BioGeoBEARS Biogeographic Analysis
Overview
BioGeoBEARS (BioGeography with Bayesian and Likelihood Evolutionary Analysis in R Scripts) performs probabilistic inference of ancestral geographic ranges on phylogenetic trees. This skill helps set up complete biogeographic analyses by:
- Validating and reformatting input files (phylogenetic tree and geographic distribution data)
- Generating organized analysis folder structure
- Creating customized RMarkdown analysis scripts
- Guiding users through parameter selection and model choices
- Producing publication-ready visualizations
When to Use This Skill
Use this skill when users request:
- "Analyze biogeography on my phylogeny"
- "Reconstruct ancestral ranges for my species"
- "Run BioGeoBEARS analysis"
- "Which areas did my ancestors occupy?"
- "Test biogeographic models (DEC, DIVALIKE, BAYAREALIKE)"
The skill triggers when users mention phylogenetic biogeography, ancestral area reconstruction, or provide tree + distribution data.
Required Inputs
Users must provide:
-
Phylogenetic tree (Newick format, .nwk, .tre, or .tree file)
- Must be rooted
- Tip labels will be matched to geography file
- Branch lengths required
-
Geographic distribution data (any tabular format)
- Species names (matching tree tips)
- Presence/absence data for different geographic areas
- Can be CSV, TSV, Excel, or already in PHYLIP format
Workflow
Step 1: Gather Information
When a user requests a BioGeoBEARS analysis, ask for:
-
Input file paths:
- "What is the path to your phylogenetic tree file?"
- "What is the path to your geographic distribution file?"
-
Analysis parameters (if not specified):
- Maximum range size (how many areas can a species occupy simultaneously?)
- Which models to compare (default: all six - DEC, DEC+J, DIVALIKE, DIVALIKE+J, BAYAREALIKE, BAYAREALIKE+J)
- Output directory name (default: "biogeobears_analysis")
Use the AskUserQuestion tool to gather this information efficiently:
Example questions:
- "Maximum range size" - options based on number of areas (e.g., for 4 areas: "All 4 areas", "3 areas", "2 areas")
- "Models to compare" - options: "All 6 models (recommended)", "Only base models (DEC, DIVALIKE, BAYAREALIKE)", "Only +J models", "Custom selection"
- "Visualization type" - options: "Pie charts (show probabilities)", "Text labels (show most likely states)", "Both"
Step 2: Validate and Prepare Input Files
Validate Tree File
Use the Read tool to check the tree file:
# In R, basic validation:
library(ape)
tr <- read.tree("path/to/tree.nwk")
print(paste("Tips:", length(tr$tip.label)))
print(paste("Rooted:", is.rooted(tr)))
print(tr$tip.label) # Check species names
Verify:
- File can be parsed as Newick
- Tree is rooted (if not, ask user which outgroup to use)
- Note the tip labels for geography file validation
Validate and Reformat Geography File
Use scripts/validate_geography_file.py to validate or reformat the geography file.
If file is already in PHYLIP format (starts with numbers):
python scripts/validate_geography_file.py path/to/geography.txt --validate --tree path/to/tree.nwk
This checks:
- Correct tab delimiters
- Species names match tree tips
- Binary codes are correct length
- No spaces in species names or binary codes
If file is in CSV/TSV format (needs reformatting):
python scripts/validate_geography_file.py path/to/distribution.csv --reformat -o geography.data --delimiter ","
Or for tab-delimited:
python scripts/validate_geography_file.py path/to/distribution.txt --reformat -o geography.data --delimiter tab
The script will:
- Detect area names from header row
- Convert presence/absence data to binary (handles "1", "present", "TRUE", etc.)
- Remove spaces from species names (replace with underscores)
- Create properly formatted PHYLIP file
Always validate the reformatted file before proceeding:
python scripts/validate_geography_file.py geography.data --validate --tree path/to/tree.nwk
Step 3: Set Up Analysis Folder Structure
Create an organized directory for the analysis:
biogeobears_analysis/
├── input/
│ ├── tree.nwk # Original or copied tree
│ ├── geography.data # Validated/reformatted geography file
│ └── original_data/ # Original input files
│ ├── original_tree.nwk
│ └── original_distribution.csv
├── scripts/
│ └── run_biogeobears.Rmd # Generated RMarkdown script
├── results/ # Created by analysis (output directory)
│ ├── [MODEL]_result.Rdata # Saved model results
│ └── plots/ # Visualization outputs
│ ├── [MODEL]_pie.pdf
│ └── [MODEL]_text.pdf
└── README.md # Analysis documentation
Create this structure programmatically:
mkdir -p biogeobears_analysis/input/original_data
mkdir -p biogeobears_analysis/scripts
mkdir -p biogeobears_analysis/results/plots
# Copy files
cp path/to/tree.nwk biogeobears_analysis/input/
cp geography.data biogeobears_analysis/input/
cp original_files biogeobears_analysis/input/original_data/
Step 4: Generate RMarkdown Analysis Script
Use the template at scripts/biogeobears_analysis_template.Rmd and customize it with user parameters.
Copy and customize the template:
cp scripts/biogeobears_analysis_template.Rmd biogeobears_analysis/scripts/run_biogeobears.Rmd
Create a parameter file or modify the YAML header in the Rmd to use the user's specific settings:
Example customization via R code:
# Edit YAML parameters programmatically or provide as params when rendering
rmarkdown::render(
"biogeobears_analysis/scripts/run_biogeobears.Rmd",
params = list(
tree_file = "../input/tree.nwk",
geog_file = "../input/geography.data",
max_range_size = 4,
models = "DEC,DEC+J,DIVALIKE,DIVALIKE+J,BAYAREALIKE,BAYAREALIKE+J",
output_dir = "../results"
),
output_file = "../results/biogeobears_report.html"
)
Or create a run script:
# biogeobears_analysis/run_analysis.sh
#!/bin/bash
cd "$(dirname "$0")/scripts"
R -e "rmarkdown::render('run_biogeobears.Rmd', params = list(
tree_file = '../input/tree.nwk',
geog_file = '../input/geography.data',
max_range_size = 4,
models = 'DEC,DEC+J,DIVALIKE,DIVALIKE+J,BAYAREALIKE,BAYAREALIKE+J',
output_dir = '../results'
), output_file = '../results/biogeobears_report.html')"
Step 5: Create README Documentation
Generate a README.md in the analysis directory explaining:
- What files are present
- How to run the analysis
- What parameters were used
- How to interpret results
Example:
# BioGeoBEARS Analysis
## Overview
Biogeographic analysis of [NUMBER] species across [NUMBER] geographic areas.
## Input Data
- **Tree**: `input/tree.nwk` ([NUMBER] tips)
- **Geography**: `input/geography.data` ([NUMBER] species × [NUMBER] areas)
- **Areas**: [A, B, C, ...]
## Parameters
- Maximum range size: [NUMBER]
- Models tested: [LIST]
## Running the Analysis
### Option 1: Using RMarkdown directly
```r
library(rmarkdown)
render("scripts/run_biogeobears.Rmd",
output_file = "../results/biogeobears_report.html")
Option 2: Using the run script
bash run_analysis.sh
Outputs
Results will be saved in results/:
biogeobears_report.html- Full analysis report with visualizations[MODEL]_result.Rdata- Saved R objects for each modelplots/[MODEL]_pie.pdf- Ancestral range reconstructions (pie charts)plots/[MODEL]_text.pdf- Ancestral range reconstructions (text labels)
Interpreting Results
The HTML report includes:
- Model Comparison - AIC scores, AIC weights, best-fit model
- Parameter Estimates - Dispersal (d), extinction (e), founder-event (j) rates
- Likelihood Ratio Tests - Statistical comparisons of nested models
- Ancestral Range Plots - Visualizations on phylogeny
- Session Info - R package versions for reproducibility
Model Descriptions
- DEC: Dispersal-Extinction-Cladogenesis (general-purpose)
- DIVALIKE: Emphasizes vicariance
- BAYAREALIKE: Emphasizes sympatric speciation
- +J: Adds founder-event speciation parameter
See references/biogeobears_details.md for detailed model descriptions.
Installation Requirements
# Install BioGeoBEARS
install.packages("rexpokit")
install.packages("cladoRcpp")
library(devtools)
devtools::install_github(repo="nmatzke/BioGeoBEARS")
# Other packages
install.packages(c("ape", "rmarkdown", "knitr", "kableExtra"))
### Step 6: Provide User Instructions
After setting up the analysis, provide clear instructions to the user:
Analysis Setup Complete!
Directory structure created at: biogeobears_analysis/
📁 Files created: ✓ input/tree.nwk - Phylogenetic tree ([N] tips) ✓ input/geography.data - Geographic distribution data (validated) ✓ scripts/run_biogeobears.Rmd - RMarkdown analysis script ✓ README.md - Documentation and instructions ✓ run_analysis.sh - Convenience script to run analysis
📋 Next steps:
-
Review the README.md for analysis details
-
Install BioGeoBEARS if not already installed:
install.packages("rexpokit") install.packages("cladoRcpp") library(devtools) devtools::install_github(repo="nmatzke/BioGeoBEARS") -
Run the analysis:
cd biogeobears_analysis bash run_analysis.shOr in R:
setwd("biogeobears_analysis") rmarkdown::render("scripts/run_biogeobears.Rmd", output_file = "../results/biogeobears_report.html") -
View results:
- Open results/biogeobears_report.html in web browser
- Check results/plots/ for PDF visualizations
⏱️ Expected runtime: [ESTIMATE based on tree size]
- Small trees (<50 tips): 5-15 minutes
- Medium trees (50-100 tips): 15-60 minutes
- Large trees (>100 tips): 1-4 hours
💡 The HTML report includes model comparison, parameter estimates, and visualization of ancestral ranges on your phylogeny.
## Analysis Parameter Guidance
When users ask for guidance on parameters, consult `references/biogeobears_details.md` and provide recommendations:
### Maximum Range Size
**Ask**: "What's the maximum number of areas a species in your group can realistically occupy?"
Common approaches:
- **Conservative**: Number of areas - 1 (prevents unrealistic cosmopolitan ancestral ranges)
- **Permissive**: All areas (if biologically plausible)
- **Data-driven**: Maximum observed in extant species
**Impact**: Larger values increase computational time exponentially
### Model Selection
**Default recommendation**: Run all 6 models for comprehensive comparison
- DEC, DIVALIKE, BAYAREALIKE (base models)
- DEC+J, DIVALIKE+J, BAYAREALIKE+J (+J variants)
**Rationale**:
- Model comparison is key to inference
- +J parameter is often significant
- Small additional computational cost
If computation is a concern, suggest starting with DEC and DEC+J.
### Visualization Options
**Pie charts** (`plotwhat = "pie"`):
- Show probability distributions across all possible states
- Better for conveying uncertainty
- Can be cluttered with many areas
**Text labels** (`plotwhat = "text"`):
- Show only maximum likelihood state
- Cleaner, easier to read
- Doesn't show uncertainty
**Recommendation**: Generate both in the analysis (template does this automatically)
## Common Issues and Troubleshooting
### Species Name Mismatches
**Symptom**: Error about species in tree not in geography file (or vice versa)
**Solution**: Use the validation script with `--tree` option to identify mismatches, then either:
1. Edit the geography file to match tree tip labels
2. Edit tree tip labels to match geography file
3. Remove species that aren't in both
### Tree Not Rooted
**Symptom**: Error about unrooted tree
**Solution**:
```r
library(ape)
tr <- read.tree("tree.nwk")
tr <- root(tr, outgroup = "outgroup_species_name")
write.tree(tr, "tree_rooted.nwk")
Ask user which species to use as outgroup.
Formatting Errors in Geography File
Symptom: Validation errors about tabs, spaces, or binary codes
Solution: Use the reformat option:
python scripts/validate_geography_file.py input.csv --reformat -o geography.data
Optimization Fails to Converge
Symptom: NA values in parameter estimates or very negative log-likelihoods
Possible causes:
- Tree and geography data mismatch
- All species in same area (no variation)
- Unrealistic max_range_size
Solution: Check input data quality and try simpler model first (DEC only)
Very Slow Runtime
Causes:
- Large number of areas (>6-7 areas gets slow)
- Large max_range_size
- Many tips (>200)
Solutions:
- Reduce max_range_size
- Combine geographic areas if appropriate
- Use
force_sparse = TRUEin run object - Run on HPC cluster
Resources
This skill includes:
scripts/
-
validate_geography_file.py - Validates and reformats geography files
- Checks PHYLIP format compliance
- Validates against tree tip labels
- Reformats from CSV/TSV to PHYLIP
- Usage:
python validate_geography_file.py --help
-
biogeobears_analysis_template.Rmd - RMarkdown template for complete analysis
- Model fitting for DEC, DIVALIKE, BAYAREALIKE (with/without +J)
- Model comparison with AIC, AICc, weights
- Likelihood ratio tests
- Parameter visualization
- Ancestral range plotting
- Customizable via YAML parameters
references/
- biogeobears_details.md - Comprehensive reference including:
- Detailed model descriptions
- Input file format specifications
- Parameter interpretation guidelines
- Plotting options and customization
- Citations and further reading
- Computational considerations
Load this reference when:
- Users ask about specific models
- Need to explain parameter estimates
- Troubleshooting complex issues
- Users want detailed methodology for publications
Best Practices
-
Always validate input files before analysis - saves time debugging later
-
Organize analysis in a dedicated directory - keeps everything together and reproducible
-
Run all 6 models by default - model comparison is crucial for biogeographic inference
-
Document parameters and decisions - analysis README helps with reproducibility
-
Generate both visualization types - pie charts for uncertainty, text labels for clarity
-
Save intermediate results - the RMarkdown template does this automatically
-
Check parameter estimates - unrealistic values suggest data or model issues
-
Provide context with visualizations - explain what dispersal/extinction rates mean for the user's system
Output Interpretation
When presenting results to users, explain:
Model Selection
- AIC weights represent probability that each model is best
- ΔAIC < 2: Models essentially equivalent
- ΔAIC 2-7: Considerably less support
- ΔAIC > 10: Essentially no support
Parameter Estimates
- d (dispersal rate): Higher = more range expansions
- e (extinction rate): Higher = more local extinctions
- j (founder-event rate): Higher = more jump dispersal at speciation
- Ratio d/e: > 1 favors expansion, < 1 favors contraction
Ancestral Ranges
- Pie charts: Larger slices = higher probability
- Colors: Represent areas (single area = bright color, multiple areas = blended)
- Node labels: Most likely ancestral range
- Split events (at corners): Range changes at speciation
Statistical Tests
- LRT p < 0.05: +J parameter significantly improves fit
- High AIC weight (>0.7): Strong evidence for one model
- Similar AIC weights: Model uncertainty - report results from multiple models
Example Usage
User: "I have a phylogeny of 30 bird species and their distributions across 5 islands. Can you help me figure out where their ancestors lived?"
Claude (using this skill):
1. Ask for tree and distribution file paths
2. Validate tree file (check 30 tips, rooted)
3. Validate/reformat geography file (5 areas)
4. Ask about max_range_size (suggest 4 areas)
5. Ask about models (suggest all 6)
6. Set up biogeobears_analysis/ directory structure
7. Copy template RMarkdown script with parameters
8. Generate README.md and run_analysis.sh
9. Provide clear instructions to run analysis
10. Explain expected outputs and how to interpret them
Result: User has complete, ready-to-run analysis with documentation
Attribution
This skill was created based on:
- BioGeoBEARS package by Nicholas Matzke
- Tutorial resources from http://phylo.wikidot.com/biogeobears
- Example workflows from the BioGeoBEARS GitHub repository
Additional Notes
Time estimate for skill execution:
- File validation: 1-2 minutes
- Directory setup: < 1 minute
- Total setup time: 5-10 minutes
Analysis runtime (separate from skill execution):
- Depends on tree size and number of areas
- Small datasets (<50 tips, ≤5 areas): 10-30 minutes
- Large datasets (>100 tips, >5 areas): 1-6 hours
Installation requirements (user must have):
- R (≥4.0)
- BioGeoBEARS R package
- Supporting packages: ape, rmarkdown, knitr, kableExtra
- Python 3 (for validation script)
When to consult references/:
- Load
biogeobears_details.mdwhen users need detailed explanations of models, parameters, or interpretation - Reference it for troubleshooting complex issues
- Use it to help users write methods sections for publications