BioGeoBEARS Biogeographic Analysis Skill

BioGeoBEARS Biogeographic Analysis

Overview

BioGeoBEARS (BioGeography with Bayesian and Likelihood Evolutionary Analysis in R Scripts) performs probabilistic inference of ancestral geographic ranges on phylogenetic trees. This skill helps set up complete biogeographic analyses by:

Validating and reformatting input files (phylogenetic tree and geographic distribution data)
Generating organized analysis folder structure
Creating customized RMarkdown analysis scripts
Guiding users through parameter selection and model choices
Producing publication-ready visualizations

When to Use This Skill

Use this skill when users request:

"Analyze biogeography on my phylogeny"
"Reconstruct ancestral ranges for my species"
"Run BioGeoBEARS analysis"
"Which areas did my ancestors occupy?"
"Test biogeographic models (DEC, DIVALIKE, BAYAREALIKE)"

The skill triggers when users mention phylogenetic biogeography, ancestral area reconstruction, or provide tree + distribution data.

Required Inputs

Users must provide:

Phylogenetic tree (Newick format, .nwk, .tre, or .tree file)
- Must be rooted
- Tip labels will be matched to geography file
- Branch lengths required
Geographic distribution data (any tabular format)
- Species names (matching tree tips)
- Presence/absence data for different geographic areas
- Can be CSV, TSV, Excel, or already in PHYLIP format

Workflow

Step 1: Gather Information

When a user requests a BioGeoBEARS analysis, ask for:

Input file paths:
- "What is the path to your phylogenetic tree file?"
- "What is the path to your geographic distribution file?"
Analysis parameters (if not specified):
- Maximum range size (how many areas can a species occupy simultaneously?)
- Which models to compare (default: all six - DEC, DEC+J, DIVALIKE, DIVALIKE+J, BAYAREALIKE, BAYAREALIKE+J)
- Output directory name (default: "biogeobears_analysis")

Use the AskUserQuestion tool to gather this information efficiently:

Example questions:
- "Maximum range size" - options based on number of areas (e.g., for 4 areas: "All 4 areas", "3 areas", "2 areas")
- "Models to compare" - options: "All 6 models (recommended)", "Only base models (DEC, DIVALIKE, BAYAREALIKE)", "Only +J models", "Custom selection"
- "Visualization type" - options: "Pie charts (show probabilities)", "Text labels (show most likely states)", "Both"

Step 2: Validate and Prepare Input Files

Validate Tree File

Use the Read tool to check the tree file:

# In R, basic validation:
library(ape)
tr <- read.tree("path/to/tree.nwk")
print(paste("Tips:", length(tr$tip.label)))
print(paste("Rooted:", is.rooted(tr)))
print(tr$tip.label)  # Check species names

Verify:

File can be parsed as Newick
Tree is rooted (if not, ask user which outgroup to use)
Note the tip labels for geography file validation

Validate and Reformat Geography File

Use scripts/validate_geography_file.py to validate or reformat the geography file.

If file is already in PHYLIP format (starts with numbers):

python scripts/validate_geography_file.py path/to/geography.txt --validate --tree path/to/tree.nwk

This checks:

Correct tab delimiters
Species names match tree tips
Binary codes are correct length
No spaces in species names or binary codes

If file is in CSV/TSV format (needs reformatting):

python scripts/validate_geography_file.py path/to/distribution.csv --reformat -o geography.data --delimiter ","

Or for tab-delimited:

python scripts/validate_geography_file.py path/to/distribution.txt --reformat -o geography.data --delimiter tab

The script will:

Detect area names from header row
Convert presence/absence data to binary (handles "1", "present", "TRUE", etc.)
Remove spaces from species names (replace with underscores)
Create properly formatted PHYLIP file

Always validate the reformatted file before proceeding:

python scripts/validate_geography_file.py geography.data --validate --tree path/to/tree.nwk

Step 3: Set Up Analysis Folder Structure

Create an organized directory for the analysis:

biogeobears_analysis/
├── input/
│   ├── tree.nwk                 # Original or copied tree
│   ├── geography.data            # Validated/reformatted geography file
│   └── original_data/            # Original input files
│       ├── original_tree.nwk
│       └── original_distribution.csv
├── scripts/
│   └── run_biogeobears.Rmd       # Generated RMarkdown script
├── results/                      # Created by analysis (output directory)
│   ├── [MODEL]_result.Rdata      # Saved model results
│   └── plots/                    # Visualization outputs
│       ├── [MODEL]_pie.pdf
│       └── [MODEL]_text.pdf
└── README.md                     # Analysis documentation

Create this structure programmatically:

mkdir -p biogeobears_analysis/input/original_data
mkdir -p biogeobears_analysis/scripts
mkdir -p biogeobears_analysis/results/plots

# Copy files
cp path/to/tree.nwk biogeobears_analysis/input/
cp geography.data biogeobears_analysis/input/
cp original_files biogeobears_analysis/input/original_data/

Step 4: Generate RMarkdown Analysis Script

Use the template at scripts/biogeobears_analysis_template.Rmd and customize it with user parameters.

Copy and customize the template:

cp scripts/biogeobears_analysis_template.Rmd biogeobears_analysis/scripts/run_biogeobears.Rmd

Create a parameter file or modify the YAML header in the Rmd to use the user's specific settings:

Example customization via R code:

# Edit YAML parameters programmatically or provide as params when rendering
rmarkdown::render(
  "biogeobears_analysis/scripts/run_biogeobears.Rmd",
  params = list(
    tree_file = "../input/tree.nwk",
    geog_file = "../input/geography.data",
    max_range_size = 4,
    models = "DEC,DEC+J,DIVALIKE,DIVALIKE+J,BAYAREALIKE,BAYAREALIKE+J",
    output_dir = "../results"
  ),
  output_file = "../results/biogeobears_report.html"
)

Or create a run script:

# biogeobears_analysis/run_analysis.sh
#!/bin/bash
cd "$(dirname "$0")/scripts"

R -e "rmarkdown::render('run_biogeobears.Rmd', params = list(
  tree_file = '../input/tree.nwk',
  geog_file = '../input/geography.data',
  max_range_size = 4,
  models = 'DEC,DEC+J,DIVALIKE,DIVALIKE+J,BAYAREALIKE,BAYAREALIKE+J',
  output_dir = '../results'
), output_file = '../results/biogeobears_report.html')"

Step 5: Create README Documentation

Generate a README.md in the analysis directory explaining:

What files are present
How to run the analysis
What parameters were used
How to interpret results

Example:

# BioGeoBEARS Analysis

## Overview

Biogeographic analysis of [NUMBER] species across [NUMBER] geographic areas.

## Input Data

- **Tree**: `input/tree.nwk` ([NUMBER] tips)
- **Geography**: `input/geography.data` ([NUMBER] species × [NUMBER] areas)
- **Areas**: [A, B, C, ...]

## Parameters

- Maximum range size: [NUMBER]
- Models tested: [LIST]

## Running the Analysis

### Option 1: Using RMarkdown directly

```r
library(rmarkdown)
render("scripts/run_biogeobears.Rmd",
       output_file = "../results/biogeobears_report.html")

Option 2: Using the run script

bash run_analysis.sh

Outputs

Results will be saved in results/:

biogeobears_report.html - Full analysis report with visualizations
[MODEL]_result.Rdata - Saved R objects for each model
plots/[MODEL]_pie.pdf - Ancestral range reconstructions (pie charts)
plots/[MODEL]_text.pdf - Ancestral range reconstructions (text labels)

Interpreting Results

The HTML report includes:

Model Comparison - AIC scores, AIC weights, best-fit model
Parameter Estimates - Dispersal (d), extinction (e), founder-event (j) rates
Likelihood Ratio Tests - Statistical comparisons of nested models
Ancestral Range Plots - Visualizations on phylogeny
Session Info - R package versions for reproducibility

Model Descriptions

DEC: Dispersal-Extinction-Cladogenesis (general-purpose)
DIVALIKE: Emphasizes vicariance
BAYAREALIKE: Emphasizes sympatric speciation
+J: Adds founder-event speciation parameter

See references/biogeobears_details.md for detailed model descriptions.

Installation Requirements

# Install BioGeoBEARS
install.packages("rexpokit")
install.packages("cladoRcpp")
library(devtools)
devtools::install_github(repo="nmatzke/BioGeoBEARS")

# Other packages
install.packages(c("ape", "rmarkdown", "knitr", "kableExtra"))


### Step 6: Provide User Instructions

After setting up the analysis, provide clear instructions to the user:

Analysis Setup Complete!

Directory structure created at: biogeobears_analysis/

📁 Files created: ✓ input/tree.nwk - Phylogenetic tree ([N] tips) ✓ input/geography.data - Geographic distribution data (validated) ✓ scripts/run_biogeobears.Rmd - RMarkdown analysis script ✓ README.md - Documentation and instructions ✓ run_analysis.sh - Convenience script to run analysis

📋 Next steps:

Review the README.md for analysis details

Install BioGeoBEARS if not already installed:

install.packages("rexpokit")
install.packages("cladoRcpp")
library(devtools)
devtools::install_github(repo="nmatzke/BioGeoBEARS")

Run the analysis:

cd biogeobears_analysis
bash run_analysis.sh

Or in R:

setwd("biogeobears_analysis")
rmarkdown::render("scripts/run_biogeobears.Rmd",
                  output_file = "../results/biogeobears_report.html")

View results:
- Open results/biogeobears_report.html in web browser
- Check results/plots/ for PDF visualizations

⏱️ Expected runtime: [ESTIMATE based on tree size]

Small trees (<50 tips): 5-15 minutes
Medium trees (50-100 tips): 15-60 minutes
Large trees (>100 tips): 1-4 hours

💡 The HTML report includes model comparison, parameter estimates, and visualization of ancestral ranges on your phylogeny.


## Analysis Parameter Guidance

When users ask for guidance on parameters, consult `references/biogeobears_details.md` and provide recommendations:

### Maximum Range Size

**Ask**: "What's the maximum number of areas a species in your group can realistically occupy?"

Common approaches:
- **Conservative**: Number of areas - 1 (prevents unrealistic cosmopolitan ancestral ranges)
- **Permissive**: All areas (if biologically plausible)
- **Data-driven**: Maximum observed in extant species

**Impact**: Larger values increase computational time exponentially

### Model Selection

**Default recommendation**: Run all 6 models for comprehensive comparison

- DEC, DIVALIKE, BAYAREALIKE (base models)
- DEC+J, DIVALIKE+J, BAYAREALIKE+J (+J variants)

**Rationale**:
- Model comparison is key to inference
- +J parameter is often significant
- Small additional computational cost

If computation is a concern, suggest starting with DEC and DEC+J.

### Visualization Options

**Pie charts** (`plotwhat = "pie"`):
- Show probability distributions across all possible states
- Better for conveying uncertainty
- Can be cluttered with many areas

**Text labels** (`plotwhat = "text"`):
- Show only maximum likelihood state
- Cleaner, easier to read
- Doesn't show uncertainty

**Recommendation**: Generate both in the analysis (template does this automatically)

## Common Issues and Troubleshooting

### Species Name Mismatches

**Symptom**: Error about species in tree not in geography file (or vice versa)

**Solution**: Use the validation script with `--tree` option to identify mismatches, then either:
1. Edit the geography file to match tree tip labels
2. Edit tree tip labels to match geography file
3. Remove species that aren't in both

### Tree Not Rooted

**Symptom**: Error about unrooted tree

**Solution**:
```r
library(ape)
tr <- read.tree("tree.nwk")
tr <- root(tr, outgroup = "outgroup_species_name")
write.tree(tr, "tree_rooted.nwk")

Ask user which species to use as outgroup.

Formatting Errors in Geography File

Symptom: Validation errors about tabs, spaces, or binary codes

Solution: Use the reformat option:

python scripts/validate_geography_file.py input.csv --reformat -o geography.data

Optimization Fails to Converge

Symptom: NA values in parameter estimates or very negative log-likelihoods

Possible causes:

Tree and geography data mismatch
All species in same area (no variation)
Unrealistic max_range_size

Solution: Check input data quality and try simpler model first (DEC only)

Very Slow Runtime

Causes:

Large number of areas (>6-7 areas gets slow)
Large max_range_size
Many tips (>200)

Solutions:

Reduce max_range_size
Combine geographic areas if appropriate
Use force_sparse = TRUE in run object
Run on HPC cluster

Resources

This skill includes:

scripts/

validate_geography_file.py - Validates and reformats geography files
- Checks PHYLIP format compliance
- Validates against tree tip labels
- Reformats from CSV/TSV to PHYLIP
- Usage: python validate_geography_file.py --help
biogeobears_analysis_template.Rmd - RMarkdown template for complete analysis
- Model fitting for DEC, DIVALIKE, BAYAREALIKE (with/without +J)
- Model comparison with AIC, AICc, weights
- Likelihood ratio tests
- Parameter visualization
- Ancestral range plotting
- Customizable via YAML parameters

references/

biogeobears_details.md - Comprehensive reference including:
- Detailed model descriptions
- Input file format specifications
- Parameter interpretation guidelines
- Plotting options and customization
- Citations and further reading
- Computational considerations

Load this reference when:

Users ask about specific models
Need to explain parameter estimates
Troubleshooting complex issues
Users want detailed methodology for publications

Best Practices

Always validate input files before analysis - saves time debugging later
Organize analysis in a dedicated directory - keeps everything together and reproducible
Run all 6 models by default - model comparison is crucial for biogeographic inference
Document parameters and decisions - analysis README helps with reproducibility
Generate both visualization types - pie charts for uncertainty, text labels for clarity
Save intermediate results - the RMarkdown template does this automatically
Check parameter estimates - unrealistic values suggest data or model issues
Provide context with visualizations - explain what dispersal/extinction rates mean for the user's system

Output Interpretation

When presenting results to users, explain:

Model Selection

AIC weights represent probability that each model is best
ΔAIC < 2: Models essentially equivalent
ΔAIC 2-7: Considerably less support
ΔAIC > 10: Essentially no support

Parameter Estimates

d (dispersal rate): Higher = more range expansions
e (extinction rate): Higher = more local extinctions
j (founder-event rate): Higher = more jump dispersal at speciation
Ratio d/e: > 1 favors expansion, < 1 favors contraction

Ancestral Ranges

Pie charts: Larger slices = higher probability
Colors: Represent areas (single area = bright color, multiple areas = blended)
Node labels: Most likely ancestral range
Split events (at corners): Range changes at speciation

Statistical Tests

LRT p < 0.05: +J parameter significantly improves fit
High AIC weight (>0.7): Strong evidence for one model
Similar AIC weights: Model uncertainty - report results from multiple models

Example Usage

User: "I have a phylogeny of 30 bird species and their distributions across 5 islands. Can you help me figure out where their ancestors lived?"

Claude (using this skill):
1. Ask for tree and distribution file paths
2. Validate tree file (check 30 tips, rooted)
3. Validate/reformat geography file (5 areas)
4. Ask about max_range_size (suggest 4 areas)
5. Ask about models (suggest all 6)
6. Set up biogeobears_analysis/ directory structure
7. Copy template RMarkdown script with parameters
8. Generate README.md and run_analysis.sh
9. Provide clear instructions to run analysis
10. Explain expected outputs and how to interpret them

Result: User has complete, ready-to-run analysis with documentation

Attribution

This skill was created based on:

BioGeoBEARS package by Nicholas Matzke
Tutorial resources from http://phylo.wikidot.com/biogeobears
Example workflows from the BioGeoBEARS GitHub repository

Additional Notes

Time estimate for skill execution:

File validation: 1-2 minutes
Directory setup: < 1 minute
Total setup time: 5-10 minutes

Analysis runtime (separate from skill execution):

Depends on tree size and number of areas
Small datasets (<50 tips, ≤5 areas): 10-30 minutes
Large datasets (>100 tips, >5 areas): 1-6 hours

Installation requirements (user must have):

R (≥4.0)
BioGeoBEARS R package
Supporting packages: ape, rmarkdown, knitr, kableExtra
Python 3 (for validation script)

When to consult references/:

Load biogeobears_details.md when users need detailed explanations of models, parameters, or interpretation
Reference it for troubleshooting complex issues
Use it to help users write methods sections for publications

Agent Skills: BioGeoBEARS Biogeographic Analysis

Install this agent skill to your local

Skill Files