Senior Prompt Engineer
Prompt engineering patterns, LLM evaluation frameworks, and agentic system design.
Table of Contents
- Quick Start
- Tools Overview
- Prompt Engineering Workflows
- Reference Documentation
- Common Patterns Quick Reference
Quick Start
# Analyze and optimize a prompt file
python scripts/prompt_optimizer.py prompts/my_prompt.txt --analyze
# Evaluate RAG retrieval quality
python scripts/rag_evaluator.py --contexts contexts.json --questions questions.json
# Visualize agent workflow from definition
python scripts/agent_orchestrator.py agent_config.yaml --visualize
Tools Overview
1. Prompt Optimizer
Analyzes prompts for token efficiency, clarity, and structure. Generates optimized versions.
Input: Prompt text file or string Output: Analysis report with optimization suggestions
Usage:
# Analyze a prompt file
python scripts/prompt_optimizer.py prompt.txt --analyze
# Output:
# Token count: 847
# Estimated cost: $0.0025 (GPT-4)
# Clarity score: 72/100
# Issues found:
# - Ambiguous instruction at line 3
# - Missing output format specification
# - Redundant context (lines 12-15 repeat lines 5-8)
# Suggestions:
# 1. Add explicit output format: "Respond in JSON with keys: ..."
# 2. Remove redundant context to save 89 tokens
# 3. Clarify "analyze" -> "list the top 3 issues with severity ratings"
# Generate optimized version
python scripts/prompt_optimizer.py prompt.txt --optimize --output optimized.txt
# Count tokens for cost estimation
python scripts/prompt_optimizer.py prompt.txt --tokens --model gpt-4
# Extract and manage few-shot examples
python scripts/prompt_optimizer.py prompt.txt --extract-examples --output examples.json
2. RAG Evaluator
Evaluates Retrieval-Augmented Generation quality by measuring context relevance and answer faithfulness.
Input: Retrieved contexts (JSON) and questions/answers Output: Evaluation metrics and quality report
Usage:
# Evaluate retrieval quality
python scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json
# Output:
# === RAG Evaluation Report ===
# Questions evaluated: 50
#
# Retrieval Metrics:
# Context Relevance: 0.78 (target: >0.80)
# Retrieval Precision@5: 0.72
# Coverage: 0.85
#
# Generation Metrics:
# Answer Faithfulness: 0.91
# Groundedness: 0.88
#
# Issues Found:
# - 8 questions had no relevant context in top-5
# - 3 answers contained information not in context
#
# Recommendations:
# 1. Improve chunking strategy for technical documents
# 2. Add metadata filtering for date-sensitive queries
# Evaluate with custom metrics
python scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json \
--metrics relevance,faithfulness,coverage
# Export detailed results
python scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json \
--output report.json --verbose
3. Agent Orchestrator
Parses agent definitions and visualizes execution flows. Validates tool configurations.
Input: Agent configuration (YAML/JSON) Output: Workflow visualization, validation report
Usage:
# Validate agent configuration
python scripts/agent_orchestrator.py agent.yaml --validate
# Output:
# === Agent Validation Report ===
# Agent: research_assistant
# Pattern: ReAct
#
# Tools (4 registered):
# [OK] web_search - API key configured
# [OK] calculator - No config needed
# [WARN] file_reader - Missing allowed_paths
# [OK] summarizer - Prompt template valid
#
# Flow Analysis:
# Max depth: 5 iterations
# Estimated tokens/run: 2,400-4,800
# Potential infinite loop: No
#
# Recommendations:
# 1. Add allowed_paths to file_reader for security
# 2. Consider adding early exit condition for simple queries
# Visualize agent workflow (ASCII)
python scripts/agent_orchestrator.py agent.yaml --visualize
# Output:
# ┌─────────────────────────────────────────┐
# │ research_assistant │
# │ (ReAct Pattern) │
# └─────────────────┬───────────────────────┘
# │
# ┌────────▼────────┐
# │ User Query │
# └────────┬────────┘
# │
# ┌────────▼────────┐
# │ Think │◄──────┐
# └────────┬────────┘ │
# │ │
# ┌────────▼────────┐ │
# │ Select Tool │ │
# └────────┬────────┘ │
# │ │
# ┌─────────────┼─────────────┐ │
# ▼ ▼ ▼ │
# [web_search] [calculator] [file_reader]
# │ │ │ │
# └─────────────┼─────────────┘ │
# │ │
# ┌────────▼────────┐ │
# │ Observe │───────┘
# └────────┬────────┘
# │
# ┌────────▼────────┐
# │ Final Answer │
# └─────────────────┘
# Export workflow as Mermaid diagram
python scripts/agent_orchestrator.py agent.yaml --visualize --format mermaid
Prompt Engineering Workflows
Prompt Optimization Workflow
Use when improving an existing prompt's performance or reducing token costs.
Step 1: Baseline current prompt
python scripts/prompt_optimizer.py current_prompt.txt --analyze --output baseline.json
Step 2: Identify issues Review the analysis report for:
- Token waste (redundant instructions, verbose examples)
- Ambiguous instructions (unclear output format, vague verbs)
- Missing constraints (no length limits, no format specification)
Step 3: Apply optimization patterns | Issue | Pattern to Apply | |-------|------------------| | Ambiguous output | Add explicit format specification | | Too verbose | Extract to few-shot examples | | Inconsistent results | Add role/persona framing | | Missing edge cases | Add constraint boundaries |
Step 4: Generate optimized version
python scripts/prompt_optimizer.py current_prompt.txt --optimize --output optimized.txt
Step 5: Compare results
python scripts/prompt_optimizer.py optimized.txt --analyze --compare baseline.json
# Shows: token reduction, clarity improvement, issues resolved
Step 6: Validate with test cases Run both prompts against your evaluation set and compare outputs.
Few-Shot Example Design Workflow
Use when creating examples for in-context learning.
Step 1: Define the task clearly
Task: Extract product entities from customer reviews
Input: Review text
Output: JSON with {product_name, sentiment, features_mentioned}
Step 2: Select diverse examples (3-5 recommended) | Example Type | Purpose | |--------------|---------| | Simple case | Shows basic pattern | | Edge case | Handles ambiguity | | Complex case | Multiple entities | | Negative case | What NOT to extract |
Step 3: Format consistently
Example 1:
Input: "Love my new iPhone 15, the camera is amazing!"
Output: {"product_name": "iPhone 15", "sentiment": "positive", "features_mentioned": ["camera"]}
Example 2:
Input: "The laptop was okay but battery life is terrible."
Output: {"product_name": "laptop", "sentiment": "mixed", "features_mentioned": ["battery life"]}
Step 4: Validate example quality
python scripts/prompt_optimizer.py prompt_with_examples.txt --validate-examples
# Checks: consistency, coverage, format alignment
Step 5: Test with held-out cases Ensure model generalizes beyond your examples.
Structured Output Design Workflow
Use when you need reliable JSON/XML/structured responses.
Step 1: Define schema
{
"type": "object",
"properties": {
"summary": {"type": "string", "maxLength": 200},
"sentiment": {"enum": ["positive", "negative", "neutral"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1}
},
"required": ["summary", "sentiment"]
}
Step 2: Include schema in prompt
Respond with JSON matching this schema:
- summary (string, max 200 chars): Brief summary of the content
- sentiment (enum): One of "positive", "negative", "neutral"
- confidence (number 0-1): Your confidence in the sentiment
Step 3: Add format enforcement
IMPORTANT: Respond ONLY with valid JSON. No markdown, no explanation.
Start your response with { and end with }
Step 4: Validate outputs
python scripts/prompt_optimizer.py structured_prompt.txt --validate-schema schema.json
Reference Documentation
| File | Contains | Load when user asks about |
|------|----------|---------------------------|
| references/prompt_engineering_patterns.md | 10 prompt patterns with input/output examples | "which pattern?", "few-shot", "chain-of-thought", "role prompting" |
| references/llm_evaluation_frameworks.md | Evaluation metrics, scoring methods, A/B testing | "how to evaluate?", "measure quality", "compare prompts" |
| references/agentic_system_design.md | Agent architectures (ReAct, Plan-Execute, Tool Use) | "build agent", "tool calling", "multi-agent" |
Common Patterns Quick Reference
| Pattern | When to Use | Example | |---------|-------------|---------| | Zero-shot | Simple, well-defined tasks | "Classify this email as spam or not spam" | | Few-shot | Complex tasks, consistent format needed | Provide 3-5 examples before the task | | Chain-of-Thought | Reasoning, math, multi-step logic | "Think step by step..." | | Role Prompting | Expertise needed, specific perspective | "You are an expert tax accountant..." | | Structured Output | Need parseable JSON/XML | Include schema + format enforcement |
Common Commands
# Prompt Analysis
python scripts/prompt_optimizer.py prompt.txt --analyze # Full analysis
python scripts/prompt_optimizer.py prompt.txt --tokens # Token count only
python scripts/prompt_optimizer.py prompt.txt --optimize # Generate optimized version
# RAG Evaluation
python scripts/rag_evaluator.py --contexts ctx.json --questions q.json # Evaluate
python scripts/rag_evaluator.py --contexts ctx.json --compare baseline # Compare to baseline
# Agent Development
python scripts/agent_orchestrator.py agent.yaml --validate # Validate config
python scripts/agent_orchestrator.py agent.yaml --visualize # Show workflow
python scripts/agent_orchestrator.py agent.yaml --estimate-cost # Token estimation
Troubleshooting
| Problem | Cause | Solution |
|---------|-------|----------|
| Token count seems inaccurate | Character-based estimation varies by language and special characters | Use --model flag matching your target model; Claude uses a 3.5 char/token ratio vs 4.0 for GPT models |
| Clarity score is low despite clear prompt | Vague-pattern detector flags common words like "analyze" or "some" even in valid contexts | Review flagged lines individually; not every match is a true issue --- focus on genuinely ambiguous instructions |
| Few-shot examples not detected | Examples do not follow the Input:/Output: or Example N: labeling convention | Format examples with explicit Input: and Output: prefixes so the extractor can parse them |
| RAG evaluator shows 0.0 for all metrics | Input JSON schema mismatch --- missing question, content, or question_id keys | Verify JSON uses the expected keys (question/query, content/text, question_id/query_id) |
| Agent YAML parsing fails | Built-in YAML parser is simplified and cannot handle advanced syntax (anchors, multi-line blocks) | Convert config to JSON, or restructure YAML to use only simple key-value pairs and dash-prefixed lists |
| Optimization produces minimal changes | --optimize only performs whitespace normalization, not semantic rewriting | Use --analyze first to get suggestions, then manually apply structural improvements before re-running --optimize |
| Mermaid diagram renders incorrectly | More than 6 tools overflow the generated subgraph | Reduce tool count in the config or manually edit the Mermaid output to split into sub-diagrams |
Success Criteria
- Prompt clarity score above 70/100 on all production prompts, measured via
prompt_optimizer.py --analyze - Token efficiency improved by 30%+ after applying optimization suggestions and removing redundant content
- RAG context relevance at or above 0.80 across evaluation sets, verified by
rag_evaluator.py - Answer faithfulness at or above 0.95 with zero unsupported claims in critical workflows
- Agent validation passes with zero errors for all deployed agent configurations
- Cost per agent run within budget --- estimated monthly spend confirmed via
agent_orchestrator.py --estimate-cost - Few-shot example coverage includes edge cases --- at least 1 simple, 1 complex, and 1 negative example per prompt template
Scope & Limitations
This skill covers:
- Static prompt analysis: token counting, clarity scoring, structure detection, and optimization suggestions
- RAG evaluation: context relevance, answer faithfulness, groundedness, and retrieval metrics (Precision@K, ROUGE-L, MRR, NDCG)
- Agent workflow design: configuration validation, ASCII/Mermaid visualization, and token cost estimation
- Few-shot example extraction and management from existing prompts
This skill does NOT cover:
- Live LLM calls or runtime prompt testing --- all analysis is static/deterministic (see
senior-ml-engineerfor LLM integration) - Vector database setup or embedding generation --- RAG evaluator scores pre-retrieved contexts only (see
senior-data-engineerfor pipeline orchestration) - Fine-tuning, RLHF, or model training workflows (see
senior-ml-engineerfor model deployment) - Production monitoring, A/B test execution, or real-time drift detection (see
senior-data-scientistfor experiment design)
Integration Points
| Skill | Integration | Data Flow |
|-------|-------------|-----------|
| senior-ml-engineer | LLM integration and model deployment | Optimized prompts from this skill feed into llm_integration_builder.py prompt templates |
| senior-data-scientist | A/B test design for prompt experiments | experiment_designer.py defines test parameters; this skill provides the prompt variants to compare |
| senior-data-engineer | RAG pipeline orchestration | pipeline_orchestrator.py builds the retrieval pipeline; this skill evaluates its output quality |
| senior-fullstack | End-to-end application scaffolding | Fullstack apps consume agent configs validated by agent_orchestrator.py |
| senior-security | Prompt injection and adversarial input review | Security analysis covers the attack surface; this skill ensures prompts include defensive constraints |
| senior-qa | Quality assurance for AI-powered features | QA test suites validate that optimized prompts produce consistent outputs in production |
Tool Reference
prompt_optimizer.py
Purpose: Static analysis tool for prompt engineering. Estimates token counts, scores clarity and structure, detects ambiguous instructions and redundant content, extracts few-shot examples, and generates optimized prompt versions.
Usage:
python scripts/prompt_optimizer.py <prompt_file> [options]
Parameters:
| Flag | Short | Type | Default | Description |
|------|-------|------|---------|-------------|
| prompt | (positional) | string | (required) | Path to the prompt text file to analyze |
| --analyze | -a | flag | off | Run full analysis (clarity, structure, issues, suggestions) |
| --tokens | -t | flag | off | Count tokens and estimate cost only |
| --optimize | -O | flag | off | Generate whitespace-optimized version of the prompt |
| --extract-examples | -e | flag | off | Extract few-shot examples (Input/Output pairs) as JSON |
| --model | -m | choice | gpt-4 | Model for token/cost estimation. Choices: gpt-4, gpt-4-turbo, gpt-3.5-turbo, claude-3-opus, claude-3-sonnet, claude-3-haiku |
| --output | -o | string | (none) | Write results to this file path |
| --json | -j | flag | off | Output analysis as JSON instead of human-readable report |
| --compare | -c | string | (none) | Path to a baseline analysis JSON file for comparison |
Example:
python scripts/prompt_optimizer.py prompt.txt --analyze --model claude-3-sonnet --json
Output Formats:
- Default (text): Human-readable report with metrics, scores, detected sections, issues, and suggestions
- JSON (
--json): StructuredPromptAnalysisobject with keys:token_count,estimated_cost,model,clarity_score,structure_score,issues,suggestions,sections,has_examples,example_count,has_output_format,word_count,line_count - Token-only (
--tokens): Single-line token count and cost estimate - Examples (
--extract-examples): JSON array of{input_text, output_text, index}objects - Optimized (
--optimize): Cleaned prompt text with normalized whitespace
rag_evaluator.py
Purpose: Evaluates Retrieval-Augmented Generation quality by measuring context relevance (lexical overlap, term coverage), answer faithfulness (claim-level verification), groundedness (ROUGE-L), and retrieval metrics (Precision@K, MRR, NDCG).
Usage:
python scripts/rag_evaluator.py --contexts <contexts.json> --questions <questions.json> [options]
Parameters:
| Flag | Short | Type | Default | Description |
|------|-------|------|---------|-------------|
| --contexts | -c | string | (required) | Path to JSON file with retrieved contexts. Expected keys per object: question_id/query_id, content/text |
| --questions | -q | string | (required) | Path to JSON file with questions and answers. Expected keys per object: id, question/query, answer/response, expected/ground_truth |
| --k | | int | 5 | Number of top contexts to evaluate per question |
| --output | -o | string | (none) | Write detailed report to this JSON file |
| --json | -j | flag | off | Output as JSON instead of human-readable text |
| --verbose | -v | flag | off | Include per-question detail breakdowns in the report |
| --compare | | string | (none) | Path to a baseline report JSON for metric comparison |
Example:
python scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json --k 10 --verbose --output report.json
Output Formats:
- Default (text): Human-readable report with summary, retrieval metrics (context relevance, Precision@K), generation metrics (faithfulness, groundedness), issues, and recommendations
- JSON (
--json): StructuredRAGEvaluationReportobject with keys:total_questions,avg_context_relevance,avg_faithfulness,avg_groundedness,retrieval_metrics,coverage,issues,recommendations,question_details - Verbose (
--verbose): Adds per-questionquestion_detailsarray containing individual context scores and faithfulness breakdowns
agent_orchestrator.py
Purpose: Parses agent configurations (YAML or JSON), validates tool registrations and flow correctness, generates ASCII or Mermaid workflow diagrams, and estimates token costs per run and monthly spend.
Usage:
python scripts/agent_orchestrator.py <config_file> [options]
Parameters:
| Flag | Short | Type | Default | Description |
|------|-------|------|---------|-------------|
| config | (positional) | string | (required) | Path to agent configuration file (YAML or JSON) |
| --validate | -V | flag | off | Validate agent configuration (errors, warnings, tool status). Runs by default if no other action is specified |
| --visualize | -v | flag | off | Generate workflow diagram |
| --format | -f | choice | ascii | Visualization format. Choices: ascii, mermaid |
| --estimate-cost | -e | flag | off | Estimate token usage and costs |
| --runs | -r | int | 100 | Daily run count for monthly cost projection |
| --output | -o | string | (none) | Write output to this file path |
| --json | -j | flag | off | Output validation and cost results as JSON |
Example:
python scripts/agent_orchestrator.py agent.yaml --validate --visualize --format mermaid --output workflow.md
Output Formats:
- Validation (text): Agent info, tool status with OK/WARN indicators, flow analysis (max iterations, token estimate, loop detection), errors, and warnings
- Validation (JSON,
--json): StructuredValidationResultobject with keys:is_valid,errors,warnings,tool_status,estimated_tokens_per_run,potential_infinite_loop,max_depth - Visualization (
--visualize): ASCII box-drawing diagram (default) or Mermaid flowchart (--format mermaid) showing the agent pattern flow and registered tools - Cost estimation (
--estimate-cost): Token range per run, cost range per run, and projected monthly cost at the specified daily run rate