Source Analyzer Skill | Agent Skills

Source Analyzer

Audience: Data engineers planning ETL pipelines.

Goal: Characterize data sources for extraction planning by analyzing volume, update patterns, schema, and quality baselines.

Analysis Dimensions

Volume Analysis

Current row count
Historical growth rate (if multiple snapshots available)
Projected growth
Peak vs average volume
Seasonality patterns

Update Patterns

Full refresh vs incremental
Update frequency (hourly, daily, weekly)
Batch arrival times
Late-arriving data windows

Schema Characteristics

Column count and types
Nullable patterns
Primary key candidates
Foreign key relationships
Schema change history

Data Quality Baseline

Typical null rates per column
Expected value ranges
Cardinality expectations
Known data issues

Workflow

Inventory the source - File location/database connection, access patterns (API, file drop, DB query), authentication requirements
Sample multiple time periods - Analyze data from different dates, identify temporal patterns
Profile the schema - Document all columns, identify stable vs volatile columns, note computed/derived columns
Assess quality - Baseline quality metrics, known quirks, required transformations
Document extraction requirements - Optimal extraction method, incremental key columns, filtering criteria

Output: Source Specification

source:
  name: customer_orders
  type: postgresql
  connection: orders_db

  extraction:
    method: incremental
    key_column: updated_at
    batch_size: 100000
    frequency: hourly

  schema:
    columns:
      - name: order_id
        type: bigint
        nullable: false
        primary_key: true
      - name: customer_id
        type: bigint
        nullable: false
        foreign_key: customers.id
      - name: order_date
        type: date
        nullable: false
      - name: total_amount
        type: decimal(10,2)
        nullable: false
      - name: status
        type: varchar(20)
        nullable: false
        values: [pending, confirmed, shipped, delivered, cancelled]
      - name: updated_at
        type: timestamp
        nullable: false
        incremental_key: true

  volume:
    current_rows: 5_200_000
    daily_growth: 15_000
    peak_hours: [10, 14, 18]

  quality:
    known_issues:
      - "status can be null for orders before 2023"
      - "total_amount occasionally negative (refunds)"
    null_rates:
      customer_id: 0%
      order_date: 0%
      status: 0.5%

  recommendations:
    - "Use updated_at for incremental loads"
    - "Add check constraint for status values"
    - "Consider partitioning by order_date"

Comparison Analysis

When analyzing multiple related sources:

| Attribute | Orders | Order_Items | |-----------|--------|-------------| | Row count | 5.2M | 18.7M | | Daily growth | 15K | 52K | | Key column | order_id | item_id | | Join key | order_id | order_id | | Update lag | < 1 hour | < 1 hour |

Relationship: 1:N (avg 3.6 items per order) Join strategy: Hash join on order_id Load order: Orders first, then Order_Items

API Source Analysis

For API-based sources:

api_source:
  name: stripe_payments
  base_url: https://api.stripe.com/v1
  auth: bearer_token

  endpoints:
    - path: /charges
      method: GET
      pagination: cursor
      rate_limit: 100/sec
      params:
        created[gte]: "{last_sync}"

  extraction:
    strategy: cursor_pagination
    page_size: 100
    sync_frequency: 15min
    full_refresh_weekly: true

  data_characteristics:
    avg_response_size: 50KB
    records_per_page: 100
    typical_daily_volume: 5000

  error_handling:
    retry_codes: [429, 500, 502, 503]
    max_retries: 3
    backoff: exponential

Agent Skills: Source Analyzer

Install this agent skill to your local

Skill Files