Agent Skills: /marketplace:build-golden-set — Build or Improve a Golden Evaluation Set

>

UncategorizedID: pproenca/dot-claude/build-golden-set

Install this agent skill to your local

pnpm dlx add-skill https://github.com/pproenca/dot-claude/tree/HEAD/domain_plugins/marketplace/skills/build-golden-set

Skill Files

Browse the full folder contents for build-golden-set.

Download Skill

Loading file tree…

domain_plugins/marketplace/skills/build-golden-set/SKILL.md

Skill Metadata

Name
build-golden-set
Description
>

/marketplace:build-golden-set — Build or Improve a Golden Evaluation Set

If you see unfamiliar placeholders or need to check which tools are connected, see CONNECTORS.md.

Golden sets are the foundation of pre-ship confidence for search and recsys changes. This skill assembles one from real data and known regressions, then grows it over time.

Usage

/marketplace:build-golden-set search
/marketplace:build-golden-set recsys
/marketplace:build-golden-set            # asks which type

Workflow

1. Load Company Context

Read:

  • golden-set.md — the existing golden set, if any. Don't duplicate cases.
  • events.md — to know which tables hold query logs and impression history
  • gotchas.mdmost important. Every prior incident is a regression candidate.
  • marketplace.md — to know what the primary conversion event is
  • surfaces.md — to pick the target surface(s)

2. Choose the Mode

Search mode

Targets ~~search engine — builds queries + expected-result assertions.

Recsys mode

Targets ~~personalisation engine — builds user archetypes + expected-recommendation assertions.

If the argument is ambiguous, ask.

3. Branch A — Search Golden Set

3a. Identify the target surface

Which search surface? Some systems have multiple (autocomplete, full search, zero-result fallback). Default to the main full-search surface.

3b. Pull candidate queries

From ~~data warehouse (query logs):

-- Top queries by 30-day volume
SELECT
  query_text,
  COUNT(*) AS n,
  AVG(clicked) AS ctr,
  AVG(result_count) AS avg_result_count,
  AVG(zero_result) AS zero_result_rate
FROM search_log
WHERE event_date >= CURRENT_DATE - INTERVAL 30 DAYS
  AND surface_id = '{{surface}}'
GROUP BY 1
ORDER BY n DESC
LIMIT 100

Then cluster the queries by intent class:

  • Transactional local — "dog sitter brighton" (strong local + entity)
  • Transactional time — "weekend stay" (time-bound)
  • Transactional combined — "dog sitter brighton this weekend" (local + entity + time)
  • Exploratory — "house sit europe" (broad, discovery-seeking)
  • Known-item — the user is looking for a specific listing they've seen before
  • Navigational — "login", "help" (not real search; should route elsewhere)
  • Zero-intent — empty query or single-char noise

For each cluster, pick 3-5 representative queries — high-volume + hard-case + edge-case.

3c. Add regression candidates from gotchas.md

For every past incident with a query signature, add the exact query that broke. These are the most valuable cases — they catch the same bug from reappearing.

If gotchas.md is thin, ask: "Any past search regressions you remember? Give me the query that broke."

3d. Add negative tests

Critical: negative tests (must NOT appear) catch more regressions than positive tests. Examples:

  • For "dog sitter brighton", a cat-only listing MUST NOT appear in top-5
  • For "long stay london" (21+ days), short-stay-only listings MUST NOT appear in top-10
  • For "safe home" queries, listings flagged for moderation MUST NOT appear at all

For each cluster, propose at least one negative test.

3e. Capture assertions

For each query, record:

| Field | Content |
|-------|---------|
| Query | "{{text}}" |
| Intent class | {{class}} |
| Volume | {{30d count}} |
| Must appear in top-N | {{listing_ids or feature filter, e.g., "geo=brighton AND pet_friendly=dog"}} |
| Must NOT appear | {{listing_ids or filter}} |
| Ordering constraint | {{e.g., "listing X must rank above listing Y"}} (optional) |
| Provenance | {{log | hand-curated | regression-from gotchas.md#case}} |
| Notes | {{edge cases, gotchas, ambiguity}} |

Prefer feature-filter assertions ("any listing matching geo=brighton AND pet=dog") over listing-ID assertions — they survive listing churn.

3f. Present and write

Show the draft to the user, ask for corrections, then write to golden-set.md in the context skill (or to a path the user specifies if no context skill exists).

4. Branch B — Recsys Golden Set

4a. Identify the target surface

Homefeed? Similar listings? Saved search alert? Email? Each has different archetypes.

4b. Pull candidate user archetypes

From ~~data warehouse:

-- Sample users from each key cohort
WITH cohorts AS (
  SELECT
    user_id,
    CASE
      WHEN DATE_DIFF(CURRENT_DATE, signup_date, DAY) < 7 THEN 'new_paid'
      WHEN DATE_DIFF(CURRENT_DATE, signup_date, DAY) < 90 THEN 'recent_paid'
      WHEN first_booking_date IS NULL THEN 'paid_no_booking'
      ELSE 'repeat'
    END AS cohort,
    home_country,
    role,
    primary_pet_type
  FROM users
  WHERE active = TRUE
)
SELECT * FROM cohorts
SAMPLE 1 ROWS PER cohort, role, home_country

For each cohort × role × key geo, pick a representative user and define the archetype.

4c. Add cold-start archetypes

Cold-start is the highest-leverage case for recsys golden sets. Always include:

  • Anonymous visitor, new country, inferred role, no history
  • Registered but < 5 interactions, role inferred from onboarding
  • Paid but zero bookings, 2 weeks old
  • Long-dormant user, no interactions in 90+ days (near-cold-start)

4d. Add regression archetypes from gotchas.md

Every recsys incident that affected a cohort is a candidate archetype. Add one user per cohort-based regression.

4e. Capture expected top-N and forbidden sets

For each archetype, record:

| Field | Content |
|-------|---------|
| Archetype | {{cohort × role × geo × key constraint}} |
| Expected top-N features | {{e.g., "majority should be role=opposite AND geo=same-country AND pet_match=true"}} |
| Forbidden features | {{e.g., "no listings flagged cold-start-only", "no listings with 0 reviews"}} |
| Diversity constraint | {{optional — e.g., "top-10 must include ≥ 3 distinct suppliers"}} |
| Provenance | {{synthetic | sampled from logs | regression from gotchas.md#case}} |
| Notes | {{edge cases}} |

Prefer feature-based expected sets over item-ID-based sets — listings churn, features don't.

4f. Present and write

Same as Branch A.

5. Grow Over Time

Golden sets are living documents. Every diagnosed incident (via /marketplace:diagnose) should add at least one new case. Prompt the user:

"The diagnosis revealed a regression in {{pattern}}. Do you want to add {{n}} case(s) to the golden set?"

6. Maintain Quality

Apply these quality rules:

  • Minimum size: 10 cases. Below that, coverage is too thin to be reliable.
  • Maximum manageable size: around 300 cases. Beyond that, run eval sample-based.
  • Balance: no more than 30% of cases from a single intent class / archetype.
  • At least 20% negative tests: they catch more regressions than positive tests.
  • At least 10% from gotchas.md: regression prevention is the whole point.
  • Provenance tracked for every case: if you don't know where a case came from, you can't defend it when it breaks.

Read-only posture

This skill reads from ~~search engine, ~~data warehouse, and ~~personalisation engine but never writes. File writes are limited to golden-set.md in the user's context skill (and only with confirmation).

Examples

Build a search golden set from scratch

/marketplace:build-golden-set search

Pulls 30d of queries, clusters, proposes cases, captures assertions, writes to context skill.

Grow an existing golden set with regression cases

/marketplace:build-golden-set search

If a golden set exists, go into iteration mode: enumerate unresolved cases from gotchas.md, propose new tests, merge.

Build a recsys golden set for a new surface

/marketplace:build-golden-set recsys

Ask which surface, pull archetypes, capture expected top-N features, write.

Tips

  • Start with gotchas.md. Every past incident is a case you already know should pass.
  • Prefer feature-based assertions over listing-ID-based ones. IDs churn; features don't.
  • Always include anonymous-visitor and cold-start cohorts for recsys — they're the cases most likely to fail silently.
  • Always include a navigational case for search — "login" should NEVER return listings, and this is a common regression point after query rewrites.
  • Don't skip negative tests. A golden set with only positive tests misses 60%+ of real regressions.
  • Re-run the golden set weekly, not just pre-ship. Silent degradation happens.

Related Commands

  • /marketplace:bootstrap-context — run first to ensure a home for the golden set
  • /marketplace:review-change — the primary consumer of this artefact
  • /marketplace:diagnose — every diagnosis session is an opportunity to add cases