Agent Skills: Source Analyzer

Analyze data source characteristics including update frequency, volume patterns, and schema stability.

UncategorizedID: majesticlabs-dev/majestic-marketplace/source-analyzer

Install this agent skill to your local

pnpm dlx add-skill https://github.com/majesticlabs-dev/majestic-marketplace/tree/HEAD/plugins/majestic-data/skills/source-analyzer

Skill Files

Browse the full folder contents for source-analyzer.

Download Skill

Loading file tree…

plugins/majestic-data/skills/source-analyzer/SKILL.md

Skill Metadata

Name
source-analyzer
Description
Analyze data source characteristics including update frequency, volume patterns, and schema stability.

Source Analyzer

Audience: Data engineers planning ETL pipelines.

Goal: Characterize data sources for extraction planning by analyzing volume, update patterns, schema, and quality baselines.

Analysis Dimensions

Volume Analysis

  • Current row count
  • Historical growth rate (if multiple snapshots available)
  • Projected growth
  • Peak vs average volume
  • Seasonality patterns

Update Patterns

  • Full refresh vs incremental
  • Update frequency (hourly, daily, weekly)
  • Batch arrival times
  • Late-arriving data windows

Schema Characteristics

  • Column count and types
  • Nullable patterns
  • Primary key candidates
  • Foreign key relationships
  • Schema change history

Data Quality Baseline

  • Typical null rates per column
  • Expected value ranges
  • Cardinality expectations
  • Known data issues

Workflow

  1. Inventory the source - File location/database connection, access patterns (API, file drop, DB query), authentication requirements
  2. Sample multiple time periods - Analyze data from different dates, identify temporal patterns
  3. Profile the schema - Document all columns, identify stable vs volatile columns, note computed/derived columns
  4. Assess quality - Baseline quality metrics, known quirks, required transformations
  5. Document extraction requirements - Optimal extraction method, incremental key columns, filtering criteria

Output: Source Specification

source:
  name: customer_orders
  type: postgresql
  connection: orders_db

  extraction:
    method: incremental
    key_column: updated_at
    batch_size: 100000
    frequency: hourly

  schema:
    columns:
      - name: order_id
        type: bigint
        nullable: false
        primary_key: true
      - name: customer_id
        type: bigint
        nullable: false
        foreign_key: customers.id
      - name: order_date
        type: date
        nullable: false
      - name: total_amount
        type: decimal(10,2)
        nullable: false
      - name: status
        type: varchar(20)
        nullable: false
        values: [pending, confirmed, shipped, delivered, cancelled]
      - name: updated_at
        type: timestamp
        nullable: false
        incremental_key: true

  volume:
    current_rows: 5_200_000
    daily_growth: 15_000
    peak_hours: [10, 14, 18]

  quality:
    known_issues:
      - "status can be null for orders before 2023"
      - "total_amount occasionally negative (refunds)"
    null_rates:
      customer_id: 0%
      order_date: 0%
      status: 0.5%

  recommendations:
    - "Use updated_at for incremental loads"
    - "Add check constraint for status values"
    - "Consider partitioning by order_date"

Comparison Analysis

When analyzing multiple related sources:

| Attribute | Orders | Order_Items | |-----------|--------|-------------| | Row count | 5.2M | 18.7M | | Daily growth | 15K | 52K | | Key column | order_id | item_id | | Join key | order_id | order_id | | Update lag | < 1 hour | < 1 hour |

Relationship: 1:N (avg 3.6 items per order) Join strategy: Hash join on order_id Load order: Orders first, then Order_Items

API Source Analysis

For API-based sources:

api_source:
  name: stripe_payments
  base_url: https://api.stripe.com/v1
  auth: bearer_token

  endpoints:
    - path: /charges
      method: GET
      pagination: cursor
      rate_limit: 100/sec
      params:
        created[gte]: "{last_sync}"

  extraction:
    strategy: cursor_pagination
    page_size: 100
    sync_frequency: 15min
    full_refresh_weekly: true

  data_characteristics:
    avg_response_size: 50KB
    records_per_page: 100
    typical_daily_volume: 5000

  error_handling:
    retry_codes: [429, 500, 502, 503]
    max_retries: 3
    backoff: exponential