Source Analyzer
Audience: Data engineers planning ETL pipelines.
Goal: Characterize data sources for extraction planning by analyzing volume, update patterns, schema, and quality baselines.
Analysis Dimensions
Volume Analysis
- Current row count
- Historical growth rate (if multiple snapshots available)
- Projected growth
- Peak vs average volume
- Seasonality patterns
Update Patterns
- Full refresh vs incremental
- Update frequency (hourly, daily, weekly)
- Batch arrival times
- Late-arriving data windows
Schema Characteristics
- Column count and types
- Nullable patterns
- Primary key candidates
- Foreign key relationships
- Schema change history
Data Quality Baseline
- Typical null rates per column
- Expected value ranges
- Cardinality expectations
- Known data issues
Workflow
- Inventory the source - File location/database connection, access patterns (API, file drop, DB query), authentication requirements
- Sample multiple time periods - Analyze data from different dates, identify temporal patterns
- Profile the schema - Document all columns, identify stable vs volatile columns, note computed/derived columns
- Assess quality - Baseline quality metrics, known quirks, required transformations
- Document extraction requirements - Optimal extraction method, incremental key columns, filtering criteria
Output: Source Specification
source:
name: customer_orders
type: postgresql
connection: orders_db
extraction:
method: incremental
key_column: updated_at
batch_size: 100000
frequency: hourly
schema:
columns:
- name: order_id
type: bigint
nullable: false
primary_key: true
- name: customer_id
type: bigint
nullable: false
foreign_key: customers.id
- name: order_date
type: date
nullable: false
- name: total_amount
type: decimal(10,2)
nullable: false
- name: status
type: varchar(20)
nullable: false
values: [pending, confirmed, shipped, delivered, cancelled]
- name: updated_at
type: timestamp
nullable: false
incremental_key: true
volume:
current_rows: 5_200_000
daily_growth: 15_000
peak_hours: [10, 14, 18]
quality:
known_issues:
- "status can be null for orders before 2023"
- "total_amount occasionally negative (refunds)"
null_rates:
customer_id: 0%
order_date: 0%
status: 0.5%
recommendations:
- "Use updated_at for incremental loads"
- "Add check constraint for status values"
- "Consider partitioning by order_date"
Comparison Analysis
When analyzing multiple related sources:
| Attribute | Orders | Order_Items | |-----------|--------|-------------| | Row count | 5.2M | 18.7M | | Daily growth | 15K | 52K | | Key column | order_id | item_id | | Join key | order_id | order_id | | Update lag | < 1 hour | < 1 hour |
Relationship: 1:N (avg 3.6 items per order) Join strategy: Hash join on order_id Load order: Orders first, then Order_Items
API Source Analysis
For API-based sources:
api_source:
name: stripe_payments
base_url: https://api.stripe.com/v1
auth: bearer_token
endpoints:
- path: /charges
method: GET
pagination: cursor
rate_limit: 100/sec
params:
created[gte]: "{last_sync}"
extraction:
strategy: cursor_pagination
page_size: 100
sync_frequency: 15min
full_refresh_weekly: true
data_characteristics:
avg_response_size: 50KB
records_per_page: 100
typical_daily_volume: 5000
error_handling:
retry_codes: [429, 500, 502, 503]
max_retries: 3
backoff: exponential