Data journalism methodology
Systematic approaches for finding, analyzing, and presenting data in journalism.
Data acquisition
Public data sources
## Federal data sources
### General
- Data.gov - Federal open data portal
- Census Bureau (census.gov) - Demographics, economic data
- BLS (bls.gov) - Employment, inflation, wages
- BEA (bea.gov) - GDP, economic accounts
- Federal Reserve (federalreserve.gov) - Financial data
- SEC EDGAR - Corporate filings
### Specific domains
- EPA (epa.gov/data) - Environmental data
- FDA (fda.gov/data) - Drug approvals, recalls, adverse events
- CDC WONDER - Health statistics
- NHTSA - Vehicle safety data
- DOT - Transportation statistics
- FEC - Campaign finance
- USASpending.gov - Federal contracts and grants
### State and local
- State open data portals (search: "[state] open data")
- Socrata-powered sites (many cities/states)
- OpenStreets, municipal GIS portals
- State comptroller/auditor reports
Data request strategies
## Getting data that isn't public
### FOIA for datasets
- Request databases, not just documents
- Ask for data dictionary/schema
- Request in native format (CSV, SQL dump)
- Specify field-level needs
### Building your own dataset
- Scraping public information
- Crowdsourcing from readers
- Systematic document review
- Surveys (with proper methodology)
### Commercial data sources (for newsrooms)
- LexisNexis
- Refinitiv
- Bloomberg
- Industry-specific databases
Data cleaning and preparation
Common data problems
import pandas as pd
import numpy as np
# Load messy data
df = pd.read_csv('raw_data.csv')
# 1. INCONSISTENT FORMATTING
# Problem: Names in different formats
# "SMITH, JOHN" vs "John Smith" vs "smith john"
def standardize_name(name):
"""Standardize name format to 'First Last'."""
if pd.isna(name):
return None
name = str(name).strip().lower()
# Handle "LAST, FIRST" format
if ',' in name:
parts = name.split(',')
name = f"{parts[1].strip()} {parts[0].strip()}"
return name.title()
df['name_clean'] = df['name'].apply(standardize_name)
# 2. DATE INCONSISTENCIES
# Problem: Dates in multiple formats
# "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"
def parse_date(date_str):
"""Parse dates in various formats."""
if pd.isna(date_str):
return None
formats = [
'%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
'%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
]
for fmt in formats:
try:
return pd.to_datetime(date_str, format=fmt)
except:
continue
# Fall back to pandas parser
try:
return pd.to_datetime(date_str)
except:
return None
df['date_clean'] = df['date'].apply(parse_date)
# 3. MISSING VALUES
# Strategy depends on context
# Check missing value patterns
print(df.isnull().sum())
print(df.isnull().sum() / len(df) * 100) # Percentage
# Options:
# - Drop rows with critical missing values
df_clean = df.dropna(subset=['required_field'])
# - Fill with appropriate values
df['category'] = df['category'].fillna('Unknown')
df['amount'] = df['amount'].fillna(df['amount'].median())
# - Flag as missing (preserve for analysis)
df['amount_missing'] = df['amount'].isna()
# 4. DUPLICATES
# Find and handle duplicates
# Exact duplicates
print(f"Exact duplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()
# Fuzzy duplicates (similar but not identical)
# Use record linkage or manual review
from fuzzywuzzy import fuzz
def find_similar_names(names, threshold=85):
"""Find potentially duplicate names."""
duplicates = []
for i, name1 in enumerate(names):
for j, name2 in enumerate(names[i+1:], i+1):
score = fuzz.ratio(str(name1).lower(), str(name2).lower())
if score >= threshold:
duplicates.append((name1, name2, score))
return duplicates
# 5. OUTLIERS
# Identify potential data entry errors
def flag_outliers(series, method='iqr', threshold=1.5):
"""Flag statistical outliers."""
if method == 'iqr':
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - threshold * IQR
upper = Q3 + threshold * IQR
return (series < lower) | (series > upper)
elif method == 'zscore':
z_scores = np.abs((series - series.mean()) / series.std())
return z_scores > threshold
df['amount_outlier'] = flag_outliers(df['amount'])
print(f"Outliers found: {df['amount_outlier'].sum()}")
# 6. DATA TYPE CORRECTIONS
# Ensure proper types for analysis
# Convert to numeric (handling errors)
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
# Convert to categorical (saves memory, enables ordering)
df['status'] = pd.Categorical(df['status'],
categories=['Pending', 'Active', 'Closed'],
ordered=True)
# Convert to datetime
df['date'] = pd.to_datetime(df['date'], errors='coerce')
Data validation checklist
## Pre-analysis data validation
### Structural checks
- [ ] Row count matches expected
- [ ] Column count and names correct
- [ ] Data types appropriate
- [ ] No unexpected null columns
### Content checks
- [ ] Date ranges make sense
- [ ] Numeric values within expected bounds
- [ ] Categorical values match expected options
- [ ] Geographic data resolves correctly
- [ ] IDs are unique where expected
### Consistency checks
- [ ] Totals add up to expected values
- [ ] Cross-tabulations balance
- [ ] Related fields are consistent
- [ ] Time series is continuous
### Source verification
- [ ] Can trace back to original source
- [ ] Methodology documented
- [ ] Known limitations noted
- [ ] Update frequency understood
Statistical analysis for journalism
Basic statistics with context
# Essential statistics for any dataset
def describe_for_journalism(df, column):
"""Generate journalist-friendly statistics."""
stats = {
'count': len(df[column].dropna()),
'missing': df[column].isna().sum(),
'min': df[column].min(),
'max': df[column].max(),
'mean': df[column].mean(),
'median': df[column].median(),
'std': df[column].std(),
}
# Percentiles for context
stats['25th_percentile'] = df[column].quantile(0.25)
stats['75th_percentile'] = df[column].quantile(0.75)
stats['90th_percentile'] = df[column].quantile(0.90)
stats['99th_percentile'] = df[column].quantile(0.99)
# Distribution shape
stats['skewness'] = df[column].skew()
return stats
# Example interpretation
stats = describe_for_journalism(df, 'salary')
print(f"""
SALARY ANALYSIS
---------------
We analyzed {stats['count']:,} salary records.
The median salary is ${stats['median']:,.0f}, meaning half of workers
earn more and half earn less.
The average salary is ${stats['mean']:,.0f}, which is
{'higher' if stats['mean'] > stats['median'] else 'lower'} than the median,
indicating the distribution is {'right-skewed (pulled up by high earners)'
if stats['skewness'] > 0 else 'left-skewed'}.
The top 10% of earners make at least ${stats['90th_percentile']:,.0f}.
The top 1% make at least ${stats['99th_percentile']:,.0f}.
""")
Comparisons and context
# Year-over-year change
def calculate_change(current, previous):
"""Calculate change with multiple metrics."""
absolute = current - previous
if previous != 0:
percent = (current - previous) / previous * 100
else:
percent = float('inf') if current > 0 else 0
return {
'current': current,
'previous': previous,
'absolute_change': absolute,
'percent_change': percent,
'direction': 'increased' if absolute > 0 else 'decreased' if absolute < 0 else 'unchanged'
}
# Per capita calculations (essential for fair comparisons)
def per_capita(value, population):
"""Calculate per capita rate."""
return (value / population) * 100000 # Per 100,000 is standard
# Example: Crime rates
city_a = {'crimes': 5000, 'population': 100000}
city_b = {'crimes': 8000, 'population': 500000}
rate_a = per_capita(city_a['crimes'], city_a['population'])
rate_b = per_capita(city_b['crimes'], city_b['population'])
print(f"City A: {rate_a:.1f} crimes per 100,000 residents")
print(f"City B: {rate_b:.1f} crimes per 100,000 residents")
# City A actually has higher crime rate despite fewer total crimes!
# Inflation adjustment
def adjust_for_inflation(amount, from_year, to_year, cpi_data):
"""Adjust dollar amounts for inflation."""
from_cpi = cpi_data[from_year]
to_cpi = cpi_data[to_year]
return amount * (to_cpi / from_cpi)
# Always adjust when comparing dollars across years!
Correlation vs causation
## Reporting correlations responsibly
### What you CAN say
- "X and Y are correlated"
- "As X increases, Y tends to increase"
- "Areas with higher X also tend to have higher Y"
- "X is associated with Y"
### What you CANNOT say (without more evidence)
- "X causes Y"
- "X leads to Y"
- "Y happens because of X"
### Questions to ask before implying causation
1. Is there a plausible mechanism?
2. Does the timing make sense (cause before effect)?
3. Is there a dose-response relationship?
4. Has the finding been replicated?
5. Have confounding variables been controlled?
6. Are there alternative explanations?
### Red flags for spurious correlations
- Extremely high correlation (r > 0.95) with unrelated things
- No logical connection between variables
- Third variable could explain both
- Small sample size with high variance
Data visualization
Chart selection guide
## Choosing the right chart
### Comparison
- **Bar chart**: Compare categories
- **Grouped bar**: Compare categories across groups
- **Bullet chart**: Actual vs target
### Change over time
- **Line chart**: Trends over time
- **Area chart**: Cumulative totals over time
- **Slope chart**: Change between two points
### Distribution
- **Histogram**: Distribution of one variable
- **Box plot**: Compare distributions across groups
- **Violin plot**: Detailed distribution shape
### Relationship
- **Scatter plot**: Relationship between two variables
- **Bubble chart**: Three variables (x, y, size)
- **Connected scatter**: Change in relationship over time
### Composition
- **Pie chart**: Parts of a whole (use sparingly, max 5 slices)
- **Stacked bar**: Parts of whole across categories
- **Treemap**: Hierarchical composition
### Geographic
- **Choropleth**: Values by region (use normalized data!)
- **Dot map**: Individual locations
- **Proportional symbol**: Magnitude at locations
Visualization best practices
import matplotlib.pyplot as plt
import seaborn as sns
# Journalist-friendly chart defaults
plt.rcParams.update({
'figure.figsize': (10, 6),
'font.size': 12,
'axes.titlesize': 16,
'axes.labelsize': 12,
'axes.spines.top': False,
'axes.spines.right': False,
})
def create_bar_chart(data, title, source, xlabel='', ylabel=''):
"""Create a publication-ready bar chart."""
fig, ax = plt.subplots()
# Create bars
bars = ax.bar(data.keys(), data.values(), color='#2c7bb6')
# Add value labels on bars
for bar in bars:
height = bar.get_height()
ax.annotate(f'{height:,.0f}',
xy=(bar.get_x() + bar.get_width() / 2, height),
ha='center', va='bottom',
fontsize=10)
# Labels and title
ax.set_title(title, fontweight='bold', pad=20)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
# Add source annotation
fig.text(0.99, 0.01, f'Source: {source}',
ha='right', va='bottom', fontsize=9, color='gray')
plt.tight_layout()
return fig
# Example
data = {'2020': 1200, '2021': 1450, '2022': 1380, '2023': 1620}
fig = create_bar_chart(data,
'Annual Widget Production',
'Department of Widgets, 2024',
ylabel='Units produced')
fig.savefig('chart.png', dpi=150, bbox_inches='tight')
Avoiding misleading visualizations
## Chart integrity checklist
### Axes
- [ ] Y-axis starts at zero (for bar charts)
- [ ] Axis labels are clear
- [ ] Scale is appropriate (not truncated to exaggerate)
- [ ] Both axes labeled with units
### Data representation
- [ ] All data points visible
- [ ] Colors are distinguishable (including colorblind)
- [ ] Proportions are accurate
- [ ] 3D effects not distorting perception
### Context
- [ ] Title describes what's shown, not conclusion
- [ ] Time period clearly stated
- [ ] Source cited
- [ ] Sample size/methodology noted if relevant
- [ ] Uncertainty shown where appropriate
### Honesty
- [ ] Cherry-picking dates avoided
- [ ] Outliers explained, not hidden
- [ ] Dual axes justified (usually avoid)
- [ ] Annotations don't mislead
Story structure for data journalism
Data story framework
## The data story arc
### 1. The hook (nut graf)
- What's the key finding?
- Why should readers care?
- What's the human impact?
### 2. The evidence
- Show the data
- Explain the methodology
- Acknowledge limitations
### 3. The context
- How does this compare to past?
- How does this compare to elsewhere?
- What's the trend?
### 4. The human element
- Individual examples that illustrate the data
- Expert interpretation
- Affected voices
### 5. The implications
- What does this mean going forward?
- What questions remain?
- What actions could result?
### 6. The methodology box
- Where did data come from?
- How was it analyzed?
- What are the limitations?
- How can readers explore further?
Methodology documentation template
## How we did this analysis
### Data sources
[List all data sources with links and access dates]
### Time period
[Specify exactly what time period is covered]
### Definitions
[Define key terms and how you operationalized them]
### Analysis steps
1. [First step of analysis]
2. [Second step]
3. [Continue...]
### Limitations
- [Limitation 1]
- [Limitation 2]
### What we excluded and why
- [Excluded category]: [Reason]
### Verification
[How findings were verified/checked]
### Code and data availability
[Link to GitHub repo if sharing code/data]
### Contact
[How readers can reach you with questions]
Tools and resources
Essential tools
| Tool | Purpose | Cost | |------|---------|------| | Python + pandas | Data analysis | Free | | R + tidyverse | Statistical analysis | Free | | Excel/Sheets | Quick analysis | Free/Low | | Datawrapper | Charts for web | Free tier | | Flourish | Interactive viz | Free tier | | QGIS | Mapping | Free | | Tabula | PDF table extraction | Free | | OpenRefine | Data cleaning | Free |
Learning resources
- NICAR (Investigative Reporters & Editors)
- Knight Center for Journalism in the Americas
- Data Journalism Handbook (datajournalism.com)
- Flowing Data (flowingdata.com)
- The Pudding (pudding.cool) - examples