RAG & Search Engineering — Complete Reference Skill

RAG & Search Engineering — Complete Reference

Build production-grade retrieval systems with hybrid search, grounded generation, and measurable quality.

This skill covers:

RAG: Chunking, contextual retrieval, grounding, adaptive/self-correcting systems
Search: BM25, vector search, hybrid fusion, ranking pipelines
Evaluation: recall@k, nDCG, MRR, groundedness metrics

Modern Best Practices (Jan 2026):

Separate retrieval quality from answer quality; evaluate both (RAG: https://arxiv.org/abs/2005.11401).
Default to hybrid retrieval (sparse + dense) with reranking when precision matters (DPR: https://arxiv.org/abs/2004.04906).
Use a failure taxonomy to debug systematically (Seven Failure Points in RAG: https://arxiv.org/abs/2401.05856).
Treat freshness/invalidation as first-class; staleness is a correctness bug, not a UX issue.
Add grounding gates: answerability checks, citation coverage checks, and refusal-on-missing-context defaults.
Threat-model RAG: retrieved text is untrusted input (OWASP LLM Top 10: https://owasp.org/www-project-top-10-for-large-language-model-applications/).

Default posture: deterministic pipeline, bounded context, explicit failure handling, and telemetry for every stage.

Scope note: For prompt structure and output contracts used in the generation phase, see ai-prompt-engineering.

Quick Reference

| Task | Tool/Framework | Command/Pattern | When to Use | |------|----------------|-----------------|-------------| | Decide RAG vs alternatives | Decision framework | RAG if: freshness + citations + corpus size; else: fine-tune/caching | Avoid unnecessary retrieval latency/complexity | | Chunking & parsing | Chunker + parser | Start simple; add structure-aware chunking per doc type | Ingestion for docs, code, tables, PDFs | | Retrieval | Sparse + dense (hybrid) | Fusion (e.g., RRF) + metadata filters + top-k tuning | Mixed query styles; high recall requirements | | Precision boost | Reranker | Cross-encoder/LLM rerank of top-k candidates | When top-k contains near-misses/noise | | Grounding | Output contract + citations | Quote/ID citations; answerability gate; refuse on missing evidence | Compliance, trust, and auditability | | Evaluation | Offline + online eval | Retrieval metrics + answer metrics + regression tests | Prevent silent regressions and staleness failures |

Decision Tree: RAG Architecture Selection

Building RAG system: [Architecture Path]
    ├─ Document type?
    │   ├─ Page/section-structured? → Structure-aware chunking (pages/sections + metadata)
    │   ├─ Technical docs/code? → Structure-aware + code-aware chunking (symbols, headers)
    │   └─ Simple content? → Fixed-size token chunking with overlap (baseline)
    │
    ├─ Retrieval accuracy low?
    │   ├─ Query ambiguity? → Query rewriting + multi-query expansion + filters
    │   ├─ Noisy results? → Add reranker + better metadata filters
    │   └─ Mixed queries? → Hybrid retrieval (sparse + dense) + reranking
    │
    ├─ Dataset size?
    │   ├─ <100k chunks? → Flat index (exact search)
    │   ├─ 100k-10M? → HNSW (low latency)
    │   └─ >10M? → IVF/ScaNN/DiskANN (scalable)
    │
    └─ Production quality?
        └─ Add: ACLs, freshness/invalidation, eval gates, and telemetry (end-to-end)

Core Concepts (Vendor-Agnostic)

Pipeline stages: ingest → chunk → embed → index → retrieve → rerank → pack context → generate → verify.
Two evaluation planes: retrieval relevance (did we fetch the right evidence?) vs generation fidelity (did we use it correctly?).
Freshness model: staleness budget, invalidation triggers, and rebuild strategy (incremental vs full).
Trust boundaries: retrieved content is untrusted; apply the same rigor as user input (OWASP LLM Top 10: https://owasp.org/www-project-top-10-for-large-language-model-applications/).

Implementation Practices (Tooling Examples)

Use a retrieval API contract: query, filters, top_k, trace_id, and returned evidence IDs.
Instrument each stage with tracing/metrics (OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/).
Add caches deliberately: embeddings cache, retrieval cache (query+filters), and response cache (with invalidation).

Do / Avoid

Do keep retrieval deterministic: fixed top_k, stable ranking, explicit filters.
Do enforce document-level ACLs at retrieval time (not only at generation time).
Do include citations with stable IDs and verify citation coverage in tests.

Avoid

Avoid shipping RAG without a test set and regression gate.
Avoid "stuff everything" context packing; it increases cost and can reduce accuracy.
Avoid mixing corpora without metadata and tenant isolation.

When to Use This Skill

Use this skill when the user asks:

"Help me design a RAG pipeline."
"How should I chunk this document?"
"Optimize retrieval for my use case."
"My RAG system is hallucinating — fix it."
"Choose the right vector database / index type."
"Create a RAG evaluation framework."
"Debug why retrieval gives irrelevant results."

Tool/Model Recommendation Protocol

When users ask for vendor/model/framework recommendations, validate claims against current primary sources.

Triggers

"What's the best vector database for [use case]?"
"What should I use for [chunking/embedding/reranking]?"
"What's the latest in RAG development?"
"Current best practices for [retrieval/grounding/evaluation]?"
"Is [Pinecone/Qdrant/Chroma] still relevant in 2026?"
"[Vector DB A] vs [Vector DB B]?"
"Best embedding model for [use case]?"
"What RAG framework should I use?"

Required Checks

Read data/sources.json and start from sources with "add_as_web_search": true.
Verify 1-2 primary docs per recommendation (release notes, benchmarks, docs).
If browsing isn't available, state assumptions and give a verification checklist.

What to Report

After checking, provide:

Current landscape: What vector DBs/embeddings are popular NOW (not 6 months ago)
Emerging trends: Techniques gaining traction (late interaction, agentic RAG, graph RAG)
Deprecated/declining: Approaches or tools losing relevance
Recommendation: Based on fresh data, not just static knowledge

Example Topics (verify with current sources)

Vector databases (Pinecone, Qdrant, Weaviate, Milvus, pgvector, LanceDB)
Embedding models (OpenAI, Cohere, Voyage AI, Jina, Sentence Transformers)
Reranking (Cohere Rerank, Jina Reranker, FlashRank, RankGPT)
RAG frameworks (LlamaIndex, LangChain, Haystack, txtai)
Advanced RAG (contextual retrieval, agentic RAG, graph RAG, CRAG)
Evaluation (RAGAS, TruLens, DeepEval, BEIR)

Related Skills

For adjacent topics, reference these skills:

ai-llm - Prompting, fine-tuning, instruction datasets
ai-agents - Agentic RAG workflows and tool routing
ai-llm-inference - Serving performance, quantization, batching
ai-mlops - Deployment, monitoring, security, privacy, and governance
ai-prompt-engineering - Prompt patterns for RAG generation phase

Templates

System Design (Start Here)

RAG System Design

Chunking & Ingestion

Embedding & Indexing

Retrieval & Reranking

Context Packaging & Grounding

Evaluation

Search Configuration

Query Rewriting

Query Rewrite

Navigation

Resources

Templates

Data

data/sources.json — Curated external references

Use this skill whenever the user needs retrieval-augmented system design or debugging, not prompt work or deployment.

Fact-Checking

Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
Prefer primary sources; report source links and dates for volatile information.
If web access is unavailable, state the limitation and mark guidance as unverified.

Agent Skills: RAG & Search Engineering — Complete Reference

Install this agent skill to your local

Skill Files