STCI Data Operations Skill Skill

STCI Data Operations Skill

Purpose

This skill assists with researching, evaluating, and onboarding new data sources for the STCI (Standard Token Cost Index). Given a provider or aggregator, it produces structured outputs ready for implementation.

Capabilities

Source Research: Analyze a pricing source (provider page, aggregator API) and extract key characteristics
Legal Assessment: Review ToS, robots.txt, and rate limits for compliance
Source Profile: Generate a complete source profile following the template
Collector Stub: Produce a Python collector module for the source
Validation Rules: Define source-specific validation and anomaly detection

Invocation

Use this skill when you need to:

Add a new LLM provider to STCI
Evaluate an aggregator API for data sourcing
Create a source profile for a pricing endpoint
Generate collector code for a new source

Workflow

Step 1: Source Identification

Given a source URL or name, gather:

Source type (provider, aggregator, community)
API endpoint (if available)
Pricing page URL
Data format (JSON, HTML, etc.)

Step 2: Legal Review

Check and document:

[ ] robots.txt allows access to target endpoints
[ ] ToS permits automated data access
[ ] ToS permits data redistribution
[ ] Rate limits and API quotas
[ ] Authentication requirements

Decision Matrix:

| Condition | Action | |-----------|--------| | API explicitly public + no ToS restrictions | Proceed to T1/T2 source | | API available but ToS unclear | Request legal review before production | | Scraping required + ToS prohibits | Do NOT proceed - use alternative | | Manual collection only viable | Proceed as T4 source with caveats |

Step 3: Data Structure Analysis

For API sources:

# Fetch sample data
curl -s [API_URL] | head -100

# Analyze structure
# - Identify model ID field
# - Identify pricing fields (input, output, per-request)
# - Identify metadata fields (context window, etc.)
# - Note rate units (per-token, per-1K, per-1M)

For HTML sources:

# Check page structure
# - Identify pricing table elements
# - Note update indicators (timestamps, version)
# - Assess scraping complexity

Step 4: Generate Source Profile

Output a completed source profile using template: 000-docs/009-DR-TMPL-source-profile.md

Required sections:

Source Identification
Legal & Compliance
Data Availability
Pricing Data Structure
Collection Strategy
Normalization Mapping
Validation Rules
Monitoring & Alerting
Source Acceptance Checklist

Step 5: Generate Collector Stub

Produce a Python module following this pattern:

# services/collector/sources/{source_id}.py

from .base import BaseSource

class {SourceName}Source(BaseSource):
    """
    {Source description}
    See: {Source URL}
    """

    API_URL = "{api_url}"

    @property
    def source_id(self) -> str:
        return "{source_id}"

    @property
    def source_tier(self) -> str:
        return "{tier}"  # T1, T2, T3, or T4

    def fetch(self, target_date: date) -> List[dict]:
        # Implementation
        pass

Step 6: Define Validation Rules

Specify source-specific rules:

validation:
  required_fields:
    - model_id
    - input_rate
    - output_rate

  rate_bounds:
    input_max: 100.0  # USD per 1M
    output_max: 500.0

  model_id_pattern: "^[a-z0-9-]+/[a-z0-9.-]+$"

  cross_reference:
    enabled: true
    tolerance: 0.10  # 10% tolerance
    reference_source: "openrouter"

Step 7: Generate Test Fixtures

Create test data for the source:

// data/fixtures/{source_id}_sample.json
[
  {
    "observation_id": "obs-2026-01-01-{source_id}-{model}",
    "provider": "{provider}",
    ...
  }
]

Output Checklist

After running this skill, you should have:

[ ] Source profile document (000-docs/0XX-DR-REFF-{source_id}-profile.md)
[ ] Collector module (services/collector/sources/{source_id}.py)
[ ] Test fixtures (data/fixtures/{source_id}_sample.json)
[ ] Validation config update (data/fixtures/methodology.yaml)
[ ] Beads task for implementation tracking

Example: Onboarding OpenRouter

Input

Source: OpenRouter
URL: https://openrouter.ai
API: https://openrouter.ai/api/v1/models

Research Output

Legal Assessment:

robots.txt: Allows /api/
ToS: Public API, no explicit redistribution restriction found
Rate limits: Standard API limits apply
Auth: None required for models endpoint
Risk: LOW

Data Structure:

{
  "data": [
    {
      "id": "openai/gpt-4o",
      "name": "GPT-4o",
      "pricing": {
        "prompt": "0.0000025",
        "completion": "0.00001"
      }
    }
  ]
}

Normalization:

id → model_id
pricing.prompt × 1M → input_rate_usd_per_1m
pricing.completion × 1M → output_rate_usd_per_1m

Source Tier: T1 (public API, high confidence)

Agent Skills: STCI Data Operations Skill

Install this agent skill to your local

Skill Files