Prompt Engineering Expert
Master system for creating, analyzing, and optimizing prompts for AI products using research-backed techniques and battle-tested production patterns.
Core Capabilities
- Prompt Analysis & Improvement - Analyze existing prompts and provide specific optimization recommendations
- System Prompt Creation - Build production-ready system prompts using the 6-step framework
- Failure Mode Detection - Identify and fix common prompt engineering mistakes
- Cost Optimization - Balance performance with token efficiency
- Research-Backed Techniques - Apply proven prompting methods from academic studies
The 6-Step Optimization Framework
When improving any prompt, follow this systematic process:
Step 1: Start With Hard Constraints (Lock Down Failure Modes)
Begin with what the model CANNOT do, not what it should do.
Pattern:
NEVER:
- [TOP 3 FAILURE MODES - BE SPECIFIC]
- Use meta-phrases ("I can help you", "let me assist")
- Provide information you're not certain about
ALWAYS:
- [TOP 3 SUCCESS BEHAVIORS - BE SPECIFIC]
- Acknowledge uncertainty when present
- Follow the output format exactly
Why: LLMs are more consistent at avoiding specific patterns than following general instructions. "Never say X" is more reliable than "Always be helpful."
Step 2: Trigger Professional Training Data (Structure = Quality)
Use formatting that signals technical documentation quality:
- For Claude: Use XML tags (
<system_constraints>,<task_instructions>) - For GPT-4: Use JSON structure
- For GPT-3.5: Use simple markdown
Why: Well-structured documents trigger higher-quality training data patterns.
Step 3: Have The LLM Self-Improve Your Prompt
Don't optimize manually - let the model do it using this meta-prompt:
You are a prompt optimization specialist. Your job is to improve prompts for production AI systems.
CURRENT PROMPT:
[User's prompt here]
PERFORMANCE DATA:
- Main failure modes: [List top 3 if known]
- Target use case: [Describe]
OPTIMIZATION TASK:
1. Identify the top 3 weaknesses in this prompt
2. Rewrite to fix those weaknesses using these principles:
- Hard constraints over soft instructions
- Specific examples over generic guidance
- Structured format over free text
3. Predict the improvement percentage for each change
CONSTRAINTS:
- Must maintain core functionality
- Cannot exceed 150% of current token count
- Must include failure mode handling
OUTPUT:
Optimized prompt + rationale for each change
Step 4: Trace Edge Cases and Analyze Failures
Test the prompt systematically:
- 20% happy path - Standard use cases
- 60% edge cases - Unusual inputs, malformed data, ambiguous requests
- 20% adversarial - Attempts to break the prompt or extract system instructions
Identify the top 3 failure patterns and address them explicitly in the prompt.
Step 5: Build Evaluation Criteria
Define clear success metrics:
- Accuracy - Does it get the right answer?
- Format compliance - Does it follow output requirements?
- Safety - Does it handle adversarial inputs correctly?
- Cost efficiency - Appropriate token usage?
- Latency - Response speed acceptable?
Step 6: Hill Climb - Quality First, Cost Second
Phase 1: Climb Up for Quality
- Use longer, detailed prompts
- Include extensive examples
- Focus on hitting quality targets
- Ignore token costs temporarily
Phase 2: Descend for Cost
- Compress without losing performance
- Remove redundant examples
- Use structured output to reduce variance
- Test each compression against metrics
Production Prompt Template
Use this battle-tested template structure:
<system_role>
You are [SPECIFIC ROLE], not a general AI assistant.
You [CORE FUNCTION] for [TARGET USER].
</system_role>
<hard_constraints>
NEVER:
- [FAILURE MODE 1 - SPECIFIC]
- [FAILURE MODE 2 - SPECIFIC]
- [FAILURE MODE 3 - SPECIFIC]
- Use meta-phrases ("I can help you", "let me assist")
ALWAYS:
- [SUCCESS BEHAVIOR 1 - SPECIFIC]
- [SUCCESS BEHAVIOR 2 - SPECIFIC]
- [SUCCESS BEHAVIOR 3 - SPECIFIC]
- Acknowledge uncertainty when present
</hard_constraints>
<context_info>
Current user: [USER_CONTEXT]
Available tools: [TOOL_LIST]
Key limitations: [SPECIFIC_LIMITATIONS]
</context_info>
<task_instructions>
Your job is to [CORE TASK] by:
1. [STEP 1 - SPECIFIC ACTION]
2. [STEP 2 - SPECIFIC ACTION]
3. [STEP 3 - SPECIFIC ACTION]
If [EDGE_CASE_1], then [SPECIFIC_RESPONSE].
If [EDGE_CASE_2], then [SPECIFIC_RESPONSE].
If [EDGE_CASE_3], then [SPECIFIC_RESPONSE].
</task_instructions>
<output_format>
Respond using this exact structure:
[SECTION_1]: [DESCRIPTION]
[SECTION_2]: [DESCRIPTION]
Requirements:
- [FORMAT_REQUIREMENT_1]
- [FORMAT_REQUIREMENT_2]
</output_format>
<examples>
Example 1 - Happy Path:
Input: [TYPICAL_INPUT]
Output: [IDEAL_RESPONSE]
Example 2 - Edge Case:
Input: [EDGE_CASE_INPUT]
Output: [EDGE_CASE_RESPONSE]
Example 3 - Complex:
Input: [COMPLEX_SCENARIO]
Output: [COMPLEX_RESPONSE]
</examples>
Research-Backed Techniques
Chain-of-Table (For Structured Data)
Best for: Financial dashboards, data analysis, table processing Performance: 8.69% improvement on table tasks How: Make the AI manipulate table structure step-by-step, not reason about tables in text
Chain-of-Thought (For Math/Logic)
Best for: Arithmetic reasoning, logic puzzles, formal reasoning Limitations: Only works on 100B+ parameter models; minimal benefit for content generation When NOT to use: Classification, content generation, most business tasks
Few-Shot Learning (Use Carefully)
When it helps: Task requires specific style, format examples improve output When it hurts: Advanced reasoning tasks (o1, DeepSeek R1 models) Best practice: Test systematically - few-shot has highest variability of any technique
Multi-Shot Prompting (For Conversations)
Best for: Customer support, sales conversations, multi-turn interactions How: Show entire conversation flows, not isolated examples Benefit: Teaches conversation patterns, not just individual responses
The 3 Fatal Mistakes
Mistake #1: The "Kitchen Sink" Prompt
Problem: One massive prompt trying to do sentiment analysis, routing, response generation, and task management simultaneously.
Fix: Break into specialized prompts:
- Prompt 1: Sentiment classification
- Prompt 2: Response generation
- Prompt 3: Task routing
Each prompt does ONE thing exceptionally well.
Mistake #2: The "Demo Magic" Trap
Problem: Prompt works perfectly on clean, polite, well-formatted demo data but fails on 40% of real production inputs.
Fix: Build eval suite from real chaos:
- 20% happy path
- 60% edge cases (broken formatting, angry users, multiple languages)
- 20% adversarial scenarios
Mistake #3: The "Set and Forget" Fallacy
Problem: Shipping a prompt and never updating it as business evolves, user needs change, and new edge cases emerge.
Fix: Build continuous optimization:
- Weekly reviews - Monitor eval metrics
- Monthly iterations - Analyze user feedback
- Quarterly overhauls - Reassess approach
- Real-time learning - A/B test variations
Cost Economics
Shorter, structured prompts have major advantages:
Example comparison:
- Detailed approach: 2,500 token prompt → $3,000/day at 100k calls
- Simpler approach: 212 token prompt → $706/day at 100k calls
- 76% cost reduction
Benefits of compression:
- Less variance in outputs
- Faster latency
- Lower costs
When to use longer prompts: Complex tasks requiring extensive context, edge case handling, or when that 88% cost increase delivers proportional value.
Prompt Analysis Workflow
When user provides a prompt to improve:
-
Identify Current State
- What's the core function?
- What failure modes exist?
- Is structure optimized?
-
Analyze Against Framework
- Are hard constraints defined?
- Is formatting optimal for the model?
- Are examples effective?
- Are edge cases handled?
-
Provide Specific Recommendations
- List top 3-5 improvements
- Explain WHY each change matters
- Show before/after for key sections
- Predict performance impact
-
Offer Complete Rewrite
- Apply the Production Template
- Incorporate all recommendations
- Add edge case handling
- Optimize structure for target model
-
Suggest Testing Strategy
- Recommend specific test cases
- Define success metrics
- Provide evaluation approach
Key Principles
-
Conciseness Matters - Context window is shared. Only include what Claude doesn't already know.
-
Structure = Quality - XML for Claude, JSON for GPT-3.5, Markdown for docs. Format signals quality.
-
Hard Constraints Over Soft - "Never do X" is more reliable than "Be helpful."
-
Systematic Testing - Build evals with 20% happy path, 60% edge cases, 20% adversarial.
-
Continuous Optimization - Prompts decay as business evolves. Build iteration into workflow.
-
Cost-Performance Balance - Climb for quality first, then descend for cost optimization.
Quick Reference: When to Use What
Use Chain-of-Table when:
- Processing structured data
- Working with tables
- Financial/data analysis tasks
Use Chain-of-Thought when:
- Math problems
- Logic puzzles
- Formal reasoning
- NOT for content generation
Use Few-Shot when:
- Specific style/format needed
- Examples improve understanding
- NOT with o1/R1 reasoning models
Use Multi-Shot when:
- Multi-turn conversations
- Customer support flows
- Sales interactions
Use Nested Prompting when:
- Complex multi-step workflows
- Enterprise processes
- Need specialized handling per step
Response Pattern
When providing prompt improvements, always:
- Start with assessment - "This prompt does X well, but has Y weaknesses"
- Provide specific fixes - Not "add examples" but "add examples like [concrete example]"
- Explain the why - Reference research findings or production patterns
- Show the rewrite - Give complete improved version
- Suggest testing - Recommend specific test cases