LangChain CI Integration (Python) Skill

LangChain CI Integration (Python)

Overview

A PR passes every test on your laptop. You push. GHA runs pytest and aborts during collection — before a single test executes — with:

PytestUnraisableExceptionWarning: Exception ignored in: ...
DeprecationWarning: langchain_community.llms ...

The org runs pytest -W error and a provider SDK emitted a DeprecationWarning at import time, which the warning filter promoted to an exception while pytest was still walking the test tree. This is P45 and it blocks every PR for the team until someone pins a filterwarnings config.

Meanwhile the integration suite has its own failure mode: a VCR cassette recorded three months ago at temperature=0 against Anthropic is now flaking against a snapshot. temperature=0 is not deterministic on Claude — it still nucleus-samples (P05) — so the cassette captured one valid completion, not the valid completion. And yesterday a reviewer caught Authorization: Bearer sk-ant-... in a cassette file that had been committed six weeks ago (P44) because vcrpy records all request headers by default.

This skill covers the outer loop: the GitHub Actions workflow, the unit / integration / eval gate separation, VCR cassette hygiene, pytest warning policy, and a merge-blocking eval regression gate. The inner loop — fake model fixtures, VCR recording workflow, local determinism tricks — lives in langchain-local-dev-loop (F23); cross-reference it, do not duplicate it. Pin: langchain-core 1.0.x, langgraph 1.0.x, actions/checkout@v4, actions/setup-python@v5, vcrpy 6.x. Pain-catalog anchors: P05, P43, P44, P45.

Prerequisites

Python 3.10, 3.11, or 3.12 (matrix)
langchain-core >= 1.0, < 2.0, langgraph >= 1.0, < 2.0
pytest >= 8, pytest-asyncio, vcrpy >= 6 (integration)
langchain-local-dev-loop (F23) applied locally — fixtures and recording workflow
GitHub repo with Actions enabled; secrets set for any live-API nightly job

Instructions

Step 1 — GHA workflow skeleton with four jobs

Single workflow at .github/workflows/tests.yml. Matrix on unit only; keep integration and eval single-version to control cost.

name: tests

on:
  pull_request:
  push:
    branches: [main]
  schedule:
    - cron: "0 6 * * *"  # nightly live-API re-record check (06:00 UTC)

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  unit:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python: ["3.10", "3.11", "3.12"]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python }}
          cache: pip
          cache-dependency-path: |
            pyproject.toml
            requirements*.txt
      - run: pip install -e ".[test]"
      - run: pytest tests/unit/ -W error --timeout=30 -q

  integration:
    needs: unit
    if: github.event_name == 'schedule' || contains(github.event.pull_request.labels.*.name, 'run-integration')
    runs-on: ubuntu-latest
    env:
      RUN_INTEGRATION: "1"
      VCR_MODE: "none"  # replay-only; nightly cron flips to "once"
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: pip }
      - run: pip install -e ".[test,integration]"
      - run: pytest tests/integration/ -W error --timeout=60 -q

  eval:
    needs: unit
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }   # need base ref for delta comparison
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: pip }
      - run: pip install -e ".[test,eval]"
      - run: python scripts/run_eval.py --baseline origin/${{ github.base_ref }} --head HEAD --n 100
      # run_eval.py posts a PR comment and exits nonzero on regression > threshold

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: pip }
      - run: pip install -e ".[dev]"
      - run: ruff check .
      - run: python scripts/dryrun_load_chains.py   # catches ImportError migration regressions

See GHA Workflow Reference for the full job definitions including the secret-injection pattern, the matrix caching nuance, and the softprops/action-gh-release-style PR comment action used by the eval job.

Step 2 — Unit job: `-W error` + `filterwarnings` to neutralize P45

Root cause of the collection abort: pytest collects tests by importing them. Some provider SDKs emit DeprecationWarning on import. With -W error those become exceptions during collection. Fix at the filter level, not by dropping -W error (which would mask real warnings).

In pyproject.toml:

[tool.pytest.ini_options]
filterwarnings = [
    "error",
    # P45 — neutralize known import-time noise; scoped per module so new
    # warnings from YOUR code still fail the build.
    "ignore::DeprecationWarning:langchain_community.*",
    "ignore::DeprecationWarning:pydantic.*",
    "ignore:Pydantic serializer warnings:UserWarning",
]
asyncio_mode = "auto"
testpaths = ["tests"]

The ordering matters — "error" first, specific "ignore" entries after, so the filters override the global promote-to-error. Keep the list narrow: a blanket ignore::DeprecationWarning hides regressions you need to see.

Unit tests use FakeListChatModel fixtures from F23 (do not redefine them here). One CI-specific gotcha (P43): FakeListChatModel does not emit response_metadata["token_usage"], so any callback that asserts on token counts will break. Either subclass the fake and inject generation_info, or gate the assertion:

def test_chain_uses_tokens(patched_chat_model):
    result = chain.invoke({"input": "hi"})
    if patched_chat_model.__class__.__name__ == "FakeListChatModel":
        pytest.skip("fake model doesn't emit token_usage (P43)")
    assert result.response_metadata["token_usage"]["total_tokens"] > 0

Budget: unit job should finish in <2 minutes across the 3-version matrix. If it doesn't, something is calling out to a real provider — check with pytest --collect-only -q | wc -l and audit which tests lack fake-model fixtures.

Step 3 — Integration job: VCR replay + `filter_headers` (P44)

Integration tests replay pre-recorded VCR cassettes. Three rules:

Gate the job. if: contains(github.event.pull_request.labels.*.name, 'run-integration') or env.RUN_INTEGRATION == "1", plus a nightly cron that flips to VCR_MODE=once and re-records against live APIs. PRs default to pure replay.
Enforce filter_headers at the fixture level — not per-test. A single conftest.py prevents any contributor from recording a cassette with raw credentials.
Pre-commit + CI both scan cassettes for leaked keys. Belt and suspenders.

Fixture (lives in tests/integration/conftest.py, owned by this skill's pipeline concern — F23 owns the recording workflow):

import vcr
import pytest

@pytest.fixture(scope="module")
def vcr_config():
    return {
        "filter_headers": [
            "authorization",
            "x-api-key",
            "anthropic-version",
            ("openai-organization", "REDACTED"),
        ],
        "filter_post_data_parameters": ["api_key"],
        "record_mode": "none",  # CI default: replay only
        "match_on": ["method", "scheme", "host", "port", "path", "query"],
    }

Integration suite must finish in <5 minutes wall-clock on the runner, or you will start getting cancellation flakes from the concurrency block. If you exceed 5 minutes, split into a nightly-only long tier.

See Integration Gating for the full live-vs-replay decision tree, cost-per-run budget worksheet, and the VCR_MODE flip pattern.

Step 4 — Eval-regression gate: merge-blocking PR comment

The eval job runs the langchain-eval-harness harness (see that skill for the harness itself — this skill only covers the CI wire-up) against both the PR branch and the merge base. Post a comment; block merge on regression.

scripts/run_eval.py is a thin CI wrapper: check out baseline and head via git worktree, run the harness at each ref, diff the results, post a PR comment, exit nonzero on regression. Full implementation in Eval Regression Gate.

Thresholds:

| Gate | Threshold | Rationale | |------|-----------|-----------| | Aggregate score | drop >2% | One-sigma noise on n=100 with well-behaved evals | | Per-example score | drop >5% on any single case | Catches quiet regressions masked by aggregate averaging | | Sample size floor | n ≥ 100 | Below this, aggregate delta is dominated by noise |

The PR comment is a Markdown table with before / after / Δ per metric plus a bold red line if the gate failed. Required-status-check on the eval job completes the enforcement. See Eval Regression Gate for the comment template and the noise-budget calculation.

Step 5 — Pre-commit hooks: secret scan + prompt lint

Two layers: local (pre-commit) and CI (re-runs the same hooks as a final catch). Local alone is not sufficient — contributors can skip with -n. CI alone is slow. Run both.

.pre-commit-config.yaml:

repos:
  - repo: local
    hooks:
      - id: vcr-secret-scan
        name: VCR cassette secret scan (P44)
        entry: python scripts/scan_cassettes.py
        language: system
        files: "tests/integration/cassettes/.*\\.ya?ml$"
        pass_filenames: true

      - id: prompt-convention-lint
        name: prompt-convention lint
        entry: python scripts/lint_prompts.py
        language: system
        files: "prompts/.*\\.j2$|src/.*prompts?\\.py$"

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.6.9
    hooks:
      - id: ruff
      - id: ruff-format

  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.5.0
    hooks:
      - id: detect-secrets
        args: ["--baseline", ".secrets.baseline"]

scan_cassettes.py greps for sk-[A-Za-z0-9]{20,}, sk-ant-[A-Za-z0-9_-]{20,}, AIza[A-Za-z0-9_-]{35} (Google), xoxb-, and Bearer [A-Za-z0-9._-]{20,}. Fail on any match. This is your last line of defense before P44 ships to main. See Pre-Commit Hooks for the full pattern list, the prompt-convention lint rules (aligned with claude-prompt-conventions), and the detect-secrets baseline-rotation policy.

Step 6 — Dry-run chain loader: catch ImportError migration breaks

LangChain 0.x → 1.0 moved integrations into provider packages. A chain that imports from langchain.chat_models import ChatOpenAI works in local dev if you still have the old compat shim installed, and explodes in CI. Dry-run-load every chain module at lint time:

# scripts/dryrun_load_chains.py
import importlib, pathlib, sys, traceback

failures = []
for py in pathlib.Path("src/chains").rglob("*.py"):
    mod = str(py.with_suffix("")).replace("/", ".")
    try:
        importlib.import_module(mod)
    except Exception:
        failures.append((mod, traceback.format_exc()))

if failures:
    for mod, tb in failures:
        print(f"::error::chain {mod} failed to import\n{tb}")
    sys.exit(1)

Runs in the lint job. Costs ~5 seconds. Catches every ImportError and every top-level NameError from a bad rename before a single unit test fires.

Output

GHA workflow with four isolated jobs (unit / integration / eval / lint)
pyproject.toml filterwarnings config that survives -W error (P45)
VCR conftest.py fixture with enforced filter_headers (P44)
run_eval.py CI wrapper that posts PR comments and blocks merge on regression
.pre-commit-config.yaml with cassette secret scan + prompt lint + ruff
Dry-run chain loader that catches migration ImportErrors

Gate policy

| Gate | Required? | Target speed | On failure | |------|-----------|--------------|------------| | unit (3 Python versions) | yes, every PR | <2 min | block PR | | lint + dryrun-load | yes, every PR | <30 s | block PR | | integration (VCR replay) | on run-integration label or nightly | <5 min | block merge when run | | integration (live, nightly cron) | no | <15 min | open issue on fail | | eval regression (n≥100) | yes, every PR | <10 min | block merge if agg >2% or per-example >5% | | pre-commit (local) | yes | <10 s | reject commit |

Error Handling

| Error | Cause | Fix | |-------|-------|-----| | PytestUnraisableExceptionWarning during collection | -W error + SDK import-time DeprecationWarning (P45) | Add scoped filterwarnings = ["ignore::DeprecationWarning:langchain_community.*"] to pyproject.toml | | VCR replay mismatch after weeks of passing | Cassette recorded at temp=0 on Anthropic (P05); model drift | Re-record on nightly cron with VCR_MODE=once; treat replay mismatches as eval-gate concerns, not unit failures | | sk-ant-... in cassette flagged by reviewer | vcrpy records all headers by default (P44) | Enforce filter_headers in conftest.py; add scan_cassettes.py to pre-commit AND CI | | Callback AssertionError: 'token_usage' not in response_metadata | FakeListChatModel doesn't emit metadata (P43) | Subclass the fake to inject generation_info, or pytest.skip on fake-model detection | | ImportError: cannot import name 'ChatOpenAI' from 'langchain.chat_models' in CI only | Legacy compat shim installed locally, not in CI | Add dryrun_load_chains.py to lint job; fail at lint, not at test | | Eval job times out at 10 min | n too large or harness not using asyncio concurrency | Cap at n=100 for PRs; run n=500 nightly; see F23 for async harness pattern | | Concurrency block cancels integration run | Long job + rapid pushes | Do not disable; keep integration <5 min or split long tier to nightly |

Examples

Wiring a new repo from scratch

Copy the Step 1 workflow, the Step 2 pyproject.toml block, and the Step 5 pre-commit config. Create tests/unit/, tests/integration/cassettes/, scripts/run_eval.py, scripts/dryrun_load_chains.py, scripts/scan_cassettes.py. Apply langchain-local-dev-loop (F23) first so fake-model fixtures exist before the unit job runs. Enable required status checks: unit (3.10), unit (3.11), unit (3.12), lint, eval. Integration stays optional (label-gated).

See GHA Workflow Reference for the complete copy-pasteable workflow.

Hardening after a P44 cassette-leak incident

Rotate every leaked key first (not a CI concern — incident response). Then: add scan_cassettes.py to pre-commit, re-scan the full history with git log -p -- tests/integration/cassettes/, rewrite history with git-filter-repo if keys hit main, enforce the filter_headers fixture going forward. See Pre-Commit Hooks for the full pattern list and the detect-secrets baseline-rotation playbook.

Wiring the eval harness into an existing repo

The harness itself lives in langchain-eval-harness. THIS skill only supplies run_eval.py (the CI wrapper that reads the harness output, computes deltas, and posts PR comments) plus the gate thresholds. Drop in the Step 4 script, add the eval job to .github/workflows/tests.yml, make eval a required status check. See Eval Regression Gate for the PR-comment Markdown template and the n≥100 noise-budget derivation.

Resources

LangChain Python: Testing
FakeListChatModel API
vcrpy docs — filtering sensitive data
GitHub Actions docs
pytest filterwarnings
Pair skill: langchain-local-dev-loop (F23) — fake fixtures, local recording
Pair skill: langchain-eval-harness — eval suite the gate runs against
Pack pain catalog: docs/pain-catalog.md (entries P05, P43, P44, P45)

Agent Skills: LangChain CI Integration (Python)

Install this agent skill to your local

Skill Files