Data Extraction for Meta-Analysis
This skill teaches how to extract, convert, and prepare study data for meta-analysis.
Overview
Before running a meta-analysis, you need to extract effect sizes and their variances from each study. Studies report results in different formats, requiring conversion to a common metric.
When to Use This Skill
Activate this skill when users:
- Have study data in different formats
- Need to calculate effect sizes from raw data
- Ask about converting between effect size types
- Have missing standard deviations or standard errors
- Need to extract data from figures or tables
Data Requirements
Minimum Data Needed
| Outcome Type | Required Data | |--------------|---------------| | Binary | Events and totals for each group, OR 2x2 table | | Continuous | Means, SDs, and sample sizes for each group | | Correlation | Correlation coefficient (r) and sample size | | Pre-calculated | Effect size and SE (or CI or variance) |
Effect Size Calculations
Binary Outcomes
From 2x2 Table
Treatment Control
Event a b
No Event c d
Total n1 n2
Odds Ratio:
OR = (a/c) / (b/d) = (a*d) / (b*c)
log_OR = log(OR)
SE_log_OR = sqrt(1/a + 1/b + 1/c + 1/d)
Risk Ratio:
RR = (a/n1) / (b/n2)
log_RR = log(RR)
SE_log_RR = sqrt(1/a - 1/n1 + 1/b - 1/n2)
Risk Difference:
RD = (a/n1) - (b/n2)
SE_RD = sqrt((a*c/n1^3) + (b*d/n2^3))
Continuous Outcomes
Standardized Mean Difference (SMD/Hedges' g)
# Pooled SD
s_pooled = sqrt(((n1-1)*sd1^2 + (n2-1)*sd2^2) / (n1+n2-2))
# Cohen's d
d = (mean1 - mean2) / s_pooled
# Hedges' g (bias-corrected)
J = 1 - (3 / (4*(n1+n2-2) - 1))
g = J * d
# Variance
var_g = (n1+n2)/(n1*n2) + g^2/(2*(n1+n2))
Mean Difference (MD)
MD = mean1 - mean2
SE_MD = sqrt(sd1^2/n1 + sd2^2/n2)
R Code for Effect Size Calculation
Using escalc() Function
library(metafor)
# Binary outcomes - Odds Ratio
dat_binary <- escalc(measure = "OR",
ai = events_treat, bi = nonevents_treat,
ci = events_ctrl, di = nonevents_ctrl,
data = mydata)
# Binary outcomes - Risk Ratio
dat_rr <- escalc(measure = "RR",
ai = events_treat, bi = nonevents_treat,
ci = events_ctrl, di = nonevents_ctrl,
data = mydata)
# Continuous outcomes - SMD (Hedges' g)
dat_smd <- escalc(measure = "SMD",
m1i = mean_treat, sd1i = sd_treat, n1i = n_treat,
m2i = mean_ctrl, sd2i = sd_ctrl, n2i = n_ctrl,
data = mydata)
# Continuous outcomes - Mean Difference
dat_md <- escalc(measure = "MD",
m1i = mean_treat, sd1i = sd_treat, n1i = n_treat,
m2i = mean_ctrl, sd2i = sd_ctrl, n2i = n_ctrl,
data = mydata)
# Correlations
dat_cor <- escalc(measure = "ZCOR", # Fisher's z
ri = correlation, ni = sample_size,
data = mydata)
From Pre-calculated Statistics
# From OR and 95% CI
log_or <- log(OR)
se_log_or <- (log(CI_upper) - log(CI_lower)) / (2 * 1.96)
# From SMD and 95% CI
se_smd <- (CI_upper - CI_lower) / (2 * 1.96)
# From p-value and sample size (approximate)
# For t-test
t_value <- qt(1 - p_value/2, df = n1 + n2 - 2)
d <- t_value * sqrt(1/n1 + 1/n2)
Handling Missing Data
Missing SDs
Option 1: Impute from other studies
# Use median SD from studies that report it
median_sd <- median(dat$sd, na.rm = TRUE)
dat$sd[is.na(dat$sd)] <- median_sd
Option 2: Calculate from CI or SE
# From 95% CI for mean
SD = sqrt(n) * (CI_upper - CI_lower) / (2 * 1.96)
# From SE
SD = SE * sqrt(n)
Option 3: Calculate from IQR (for skewed data)
# Wan et al. method
SD = IQR / 1.35
Missing Sample Sizes
Option 1: Use reported total N
# If only total N given, assume equal groups
n1 = n2 = N / 2
Option 2: Contact authors
- Always the best option for critical missing data
Zero Events
# Add continuity correction (0.5 to all cells)
dat_corrected <- escalc(measure = "OR",
ai = events_treat + 0.5,
bi = nonevents_treat + 0.5,
ci = events_ctrl + 0.5,
di = nonevents_ctrl + 0.5,
data = mydata)
# Or use Peto OR (handles zeros better)
dat_peto <- escalc(measure = "PETO",
ai = events_treat, bi = nonevents_treat,
ci = events_ctrl, di = nonevents_ctrl,
data = mydata)
Data Extraction Checklist
□ Study identifier (author, year)
□ Sample sizes (treatment and control)
□ Outcome data:
□ Binary: events in each group
□ Continuous: means and SDs
□ Effect size (if pre-calculated)
□ Confidence interval or SE
□ Follow-up duration
□ Subgroup information
□ Risk of bias assessment
Common Conversion Scenarios
Scenario 1: Only p-value reported
# Convert p-value to effect size (approximate)
# Requires sample sizes
z <- qnorm(1 - p_value/2)
d <- z * sqrt(1/n1 + 1/n2)
Scenario 2: Median and IQR reported
# Estimate mean and SD (Wan et al. 2014)
# For sample size n:
mean_est <- (q1 + median + q3) / 3
sd_est <- (q3 - q1) / 1.35
Scenario 3: Different scales across studies
# Use SMD to standardize
# This puts all studies on same scale
dat <- escalc(measure = "SMD", ...)
Teaching Framework
Step 1: Identify What's Reported
"What statistics does the study report?
- Raw data (means, SDs, events)?
- Effect size with CI?
- Just a p-value?"
Step 2: Determine Target Effect Size
"What effect size is appropriate for your research question?
- Binary outcome → OR or RR
- Continuous, same scale → MD
- Continuous, different scales → SMD"
Step 3: Calculate or Convert
"Now let's calculate the effect size and its variance..."
Step 4: Verify
"Let's double-check:
- Does the direction make sense?
- Is the magnitude plausible?
- Does the CI seem reasonable?"
Assessment Questions
-
Basic: "What data do you need to calculate an odds ratio?"
- Correct: Events and non-events (or totals) for each group
-
Intermediate: "A study reports mean difference = 5, p = 0.03, n = 50 per group. How do you get the SE?"
- Correct: Use p-value to get t-statistic, then SE = MD / t
-
Advanced: "Studies use different depression scales (BDI, HDRS). How do you combine them?"
- Correct: Use standardized mean difference (SMD) to put on common scale
Related Skills
meta-analysis-fundamentals- Understanding effect sizesr-code-generation- Automating calculationsgrade-assessment- Evaluating certainty of evidence
Adaptation Guidelines
Glass (the teaching agent) MUST adapt this content to the learner:
- Language Detection: Detect the user's language from their messages and respond naturally in that language
- Cultural Context: Adapt examples to local healthcare systems and research contexts when relevant
- Technical Terms: Maintain standard English terms (e.g., "forest plot", "effect size", "I²") but explain them in the user's language
- Level Adaptation: Adjust complexity based on user's demonstrated knowledge level
- Socratic Method: Ask guiding questions in the detected language to promote deep understanding
- Local Examples: When possible, reference studies or guidelines familiar to the user's region
Example Adaptations:
- 🇧🇷 Portuguese: Use Brazilian health system examples (SUS, ANVISA guidelines)
- 🇪🇸 Spanish: Reference PAHO/OPS guidelines for Latin America
- 🇨🇳 Chinese: Include examples from Chinese medical literature