/marketplace:review-change — Pre-Ship Review of a Relevance or Personalisation Change Skill

/marketplace:review-change — Pre-Ship Review of a Relevance or Personalisation Change

If you see unfamiliar placeholders or need to check which tools are connected, see CONNECTORS.md.

Stress-test a proposed change to search or personalisation before it ships. Pulls the golden set, runs offline eval, walks the rule checklist, checks observability for blind spots, and recommends a ship decision.

Usage

/marketplace:review-change <free-form description of the change, or a PR link>

Examples:

/marketplace:review-change adding synonym map "doggy -> dog" and expanding "weekend" to "(friday OR saturday OR sunday)"
/marketplace:review-change swapping homefeed recipe from user-personalization-v1 to user-personalization-v2
/marketplace:review-change adding serving-time filter "cap-provider-4pct" to homefeed
/marketplace:review-change https://github.com/company/repo/pull/1234

Workflow

1. Load Company Context

Read the <company>-marketplace-context skill:

indexes.md — if the change touches ~~search engine
recipes.md — if the change touches ~~personalisation engine
golden-set.md — the offline eval corpus
observability.md — the monitors that should catch a regression
gotchas.md — any prior incidents matching the change type

2. Classify the Change

| Type | Signal | |------|--------| | ~~search engine relevance | synonym, analyzer, mapping, BM25, LTR model, query rewrite, filter clause movement, boost weight | | ~~personalisation engine change | recipe swap, dataset schema, filter DSL, re-ranker, cold-start solution, feature set | | Observability / infrastructure | monitor threshold, alert routing, dashboard only — route to /marketplace:build-observability instead |

If ambiguous, ask. If the change touches both types, split the review into two passes.

3. Branch A — Search Change

3a. Summarise the change

Restate the change in structured form:

File(s) affected: e.g., analyzer JSON, query builder, mapping template
Scope: which index(es), which field(s), which query type
Reversibility: trivial to roll back, or requires re-index?

3b. Run offline eval against the golden set

Load golden-set.md. If empty or thin, advise running /marketplace:build-golden-set search first.

If ~~search engine is connected:

Execute each golden query against the current production settings
Execute each against a staging clone or point-in-time snapshot with the proposed change applied
Compute per-query metrics: nDCG@10, MRR, zero-result rate, position delta for must-appear items
Report per-query diff, sorted by worst regression

If not connected: Advise running the diff manually, and describe the minimal shape of the comparison the user should run.

3c. Walk the rule checklist

Apply these checks (from marketplace-search-recsys-planning):

| Rule | Applies when | Check | |------|-------------|-------| | query-curate-synonyms-by-domain | synonym change | Are the synonyms validated per intent class? | | index-match-index-and-query-time-analyzers | analyzer change | Do both sides use the same analyzer? | | rank-tune-bm25-parameters-last | BM25 parameter change | Is the golden-set effect measured before rolling out? | | rank-normalise-scores-across-retrieval-primitives | score change | Are scores from different primitives normalised? | | retrieve-use-filter-clauses-for-exact-matches | filter→must movement | Does the change lose a hard constraint? | | retrieve-use-bool-structure-deliberately | bool rewrite | Is the should / must / filter split explicit? | | query-use-fuzzy-matching-for-typos | fuzzy change | Is the fuzziness budget bounded? | | query-normalise-before-anything-else | any query-time change | Is normalisation still earliest in the pipeline? | | index-use-index-templates-for-consistency | mapping change | Does the change propagate via template? | | measure-track-ndcg-mrr-zero-result-rate | any | Will the deployed system track the metrics before / after? |

3d. Check observability

From observability.md:

Which monitors would catch a regression within 6 hours?
Are they currently green? Any pre-existing alerts that would confuse the change's effect?
Is SLO error budget sufficient to absorb a partial regression?

3e. Produce review verdict

Structured output:

## Review: {{change description}}

### Summary
{{1-paragraph restatement}}

### Golden-set diff
- **Aggregate**: nDCG@10 {{delta}}, MRR {{delta}}, zero-result {{delta}}
- **Worst regressions** ({{n}} queries):
  - {{query}}: nDCG {{old → new}} — {{suspected cause}}
  - ...
- **Improvements** ({{n}} queries):
  - ...

### Rule checklist
- [✓] {{rule}} — {{notes}}
- [✗] {{rule}} — {{issue + remediation}}

### Observability
- **Monitors that would catch regression**: {{list}}
- **SLO error budget**: {{state}}
- **Blind spots**: {{list}}

### Verdict
{{ship | ship-with-A/B | partial-rollback | no-ship}}

### Recommended next step
{{specific action}}

### Gotcha log entry (draft)
{{1-line entry for gotchas.md if this ships and something goes wrong}}

4. Branch B — Personalisation Change

4a. Summarise the change

Restate in structured form:

Target: dataset, solution, filter, re-ranker, feature, cold-start fallback
Scope: which surfaces the change affects
Reversibility: instant rollback (campaign / solution switch), re-train required, or requires dataset rebuild?

4b. Sanity-check against the library

Apply these checks (from marketplace-personalisation):

| Rule | Applies when | Check | |------|-------------|-------| | schema-design-conservatively | schema change | Does the new schema deprecate a field safely? | | schema-meet-minimum-dataset-sizes | recipe / dataset change | Are the minimum dataset size requirements still met? | | schema-include-context-everywhere | schema change | Is the context propagated to training and serving? | | schema-prefer-categorical-fields | feature change | Are features categorical where appropriate? | | recipe-default-to-user-personalization-v2 | recipe swap | Is the new recipe the right choice for this surface's intent? | | recipe-sims-for-item-page-only | surface change | Is the recipe matched to the surface's job? | | recipe-defer-hpo-until-baseline-measured | HPO change | Is a baseline measured first? | | cold-tag-cold-start-recs | cold-start fallback change | Are cold-start impressions tagged for downstream analysis? | | cold-reserve-exploration-slots | ranker change | Does exploration get reserved slots? | | loop-detect-death-spirals | ranker change | Is the monitor in place? | | loop-reserve-random-exploration | ranker change | Is there a serving-time exploration reserve? | | loop-optimize-completed-outcome | label change | Is the ranking label the completed outcome (e.g., booking), not an intermediate (e.g., click)? | | match-balance-supply-demand | ranker change | Does the ranker respect feasibility? | | match-cap-provider-exposure | ranker change | Is there a single-supplier cap? | | infer-use-filters-api | filter change | Is the filter using the managed API and not a hand-rolled rewrite? | | infer-cache-responses-short-ttl | serving change | Is the TTL conservative enough to avoid staleness? | | track-use-stable-opaque-item-ids | ID change | Are item IDs stable and opaque? | | track-stream-events-via-putevents | event change | Is streaming used for live-user signals? | | track-capture-negative-signals | event change | Are negative signals captured? | | obs-slice-metrics-by-segment | observability | Are metrics decomposed by segment? |

If the change introduces or modifies a feature (text, vision, wizard-sourced, structured, or derived), also apply these rules from marketplace-recsys-feature-engineering:

| Rule | Applies when | Check | |------|-------------|-------| | firstp-start-from-the-decision-not-the-algorithm | any new feature | Is the feature tied to a specific decision, not "nice to have"? | | firstp-tie-every-feature-to-a-specific-solution | any new feature | Does it have at least one named consuming solution? | | firstp-reject-features-you-cannot-serve-at-inference | any new feature | Is training-serving parity guaranteed from design time? | | firstp-kill-features-a-popularity-baseline-already-captures | any new feature | Does it beat a popularity-baseline ablation? | | audit-measure-coverage-before-modelling | any new feature | Is the source asset ≥ 80% coverage? | | quality-version-feature-definitions-in-one-registry | any new feature | Is the definition registered, versioned, and owned? | | quality-serve-training-and-inference-from-one-store | any new feature | Is the feature served from the same store used for training? | | quality-gate-features-on-coverage-and-drift | any new feature | Are coverage-floor and PSI-drift alarms configured? | | quality-scrub-pii-before-features-leave-secure-zone | face / text feature | Is PII scrubbing in place before encoding? | | prove-ship-one-feature-at-a-time | rollout | Is the change isolated to a single feature? | | prove-measure-lift-against-feature-ablated-variant | rollout | Is the A/B against a feature-ablated variant (not just control)? | | prove-dedicate-random-exploration-slice-to-new-features | rollout | Is a 3-5% exploration slice reserved? | | prove-kill-features-that-dont-earn-maintenance | any existing feature being modified | Is it on the quarterly kill-review list? |

4c. Check observability

Same as Search branch — which monitors catch the regression, what's SLO budget, blind spots.

4d. Offline eval (if possible)

Personalisation offline eval is harder than search because labels are sparse. If a historical holdout set exists, run it. Otherwise, propose online evaluation via a small A/B as the eval mechanism.

4e. Produce review verdict

Same structure as Branch A.

5. Experiment Design (if ship-with-A/B)

When the verdict is ship-with-A/B, draft an experiment:

### Experiment Design

- **Primary outcome**: {{metric}} (directly tied to primary revenue event from `marketplace.md`)
- **Guardrails**: {{list, including latency and error rate}}
- **MDE**: {{minimum detectable effect, and the sample-size math}}
- **Cohort**: {{who sees treatment}}
- **Traffic split**: {{percentage}}
- **Duration**: {{estimate}}
- **Ship criterion**: {{explicit decision rule — "primary non-inferior + at least one guardrail positive"}}
- **Kill criterion**: {{explicit — "primary underperforms by MDE or any guardrail regresses by X"}}
- **Interleaving** (search only): {{use if a fast preference signal is feasible — see `measure-run-interleaving-for-fast-experiments`}}

6. Decisions Log

After review, offer to append the decision to a decisions log in the context skill (create if missing). Per plan-maintain-a-decisions-log: every relevance / recsys change should leave a trail.

Read-only posture

This skill runs read-only queries against ~~search engine, ~~personalisation engine, ~~observability, and ~~data warehouse. It does not deploy changes, re-train models, or modify configurations. All writes are to files inside the context skill.

Examples

Search change

/marketplace:review-change adding synonym map "doggy -> dog" and expanding "weekend"

Returns the structured search review with golden-set diff, rule checklist, verdict, and experiment draft.

Personalisation change

/marketplace:review-change swap homefeed recipe from user-personalization-v1 to v2

Returns the structured personalisation review with rule checklist, observability check, and experiment draft.

Change to a surface lacking a golden set

/marketplace:review-change re-ranking listing-detail similar items

First advises running /marketplace:build-golden-set recsys for the similar-items surface, then proceeds with whatever eval is possible.

Tips

Never ship a ranking change without at least a partial golden-set diff. The cost of regression is much higher than the cost of the diff.
Watch for "obvious" synonym changes — they have the highest surprise rate because they affect many queries silently.
Recipe swaps have a long tail — changes that pass offline eval still can fail in production for 2-4 weeks as the feedback loop reshapes.
Prefer rollback readiness over forward-fix heroics. A change that's easy to roll back is a change you can ship with confidence.
Don't review a change without checking gotchas.md. Past incidents often preview the failure modes of similar future changes.

Related Commands

/marketplace:build-golden-set — prerequisite for any search-branch review
/marketplace:diagnose — run after ship if the change shows unexpected effects
/marketplace:build-observability — if the review reveals a monitoring blind spot

Agent Skills: /marketplace:review-change — Pre-Ship Review of a Relevance or Personalisation Change

Install this agent skill to your local

Skill Files