Performance Optimization Skill

Performance Optimization

Layer 2: Design Choices

Core Question

What's the bottleneck, and is optimization worth it?

Before optimizing:

Have you measured? (Don't guess)
What's the acceptable performance?
Will optimization add complexity?

Performance Decision → Implementation

| Goal | Design Choice | Implementation | |------|---------------|----------------| | Reduce allocations | Pre-allocate, reuse | with_capacity, object pools | | Improve cache | Contiguous data | Vec, SmallVec | | Parallelize | Data parallelism | rayon, threads | | Avoid copies | Zero-copy | References, Cow<T> | | Reduce indirection | Inline data | smallvec, arrays |

Thinking Prompt

Before optimizing:

Have you measured?
- Profile first → flamegraph, perf
- Benchmark → criterion, cargo bench
- Identify actual hotspots
What's the priority?
- Algorithm (10x-1000x improvement)
- Data structure (2x-10x)
- Allocation (2x-5x)
- Cache (1.5x-3x)
What's the trade-off?
- Complexity vs speed
- Memory vs CPU
- Latency vs throughput

Trace Up ↑

To domain constraints (Layer 3):

"How fast does this need to be?"
    ↑ Ask: What's the performance SLA?
    ↑ Check: domain-* (latency requirements)
    ↑ Check: Business requirements (acceptable response time)

| Question | Trace To | Ask | |----------|----------|-----| | Latency requirements | domain-* | What's acceptable response time? | | Throughput needs | domain-* | How many requests per second? | | Memory constraints | domain-* | What's the memory budget? |

Trace Down ↓

To implementation (Layer 1):

"Need to reduce allocations"
    ↓ m01-ownership: Use references, avoid clone
    ↓ m02-resource: Pre-allocate with_capacity

"Need to parallelize"
    ↓ m07-concurrency: Choose rayon or threads
    ↓ m07-concurrency: Consider async for I/O-bound

"Need cache efficiency"
    ↓ Data layout: Prefer Vec over HashMap when possible
    ↓ Access patterns: Sequential over random access

Quick Reference

| Tool | Purpose | |------|---------| | cargo bench | Micro-benchmarks | | criterion | Statistical benchmarks | | perf / flamegraph | CPU profiling | | heaptrack | Allocation tracking | | valgrind / cachegrind | Cache analysis |

Optimization Priority

1. Algorithm choice     (10x - 1000x)
2. Data structure       (2x - 10x)
3. Allocation reduction (2x - 5x)
4. Cache optimization   (1.5x - 3x)
5. SIMD/Parallelism     (2x - 8x)

Common Techniques

| Technique | When | How | |-----------|------|-----| | Pre-allocation | Known size | Vec::with_capacity(n) | | Avoid cloning | Hot paths | Use references or Cow<T> | | Batch operations | Many small ops | Collect then process | | SmallVec | Usually small | smallvec::SmallVec<[T; N]> | | Inline buffers | Fixed-size data | Arrays over Vec |

Common Mistakes

| Mistake | Why Wrong | Better | |---------|-----------|--------| | Optimize without profiling | Wrong target | Profile first | | Benchmark in debug mode | Meaningless | Always --release | | Use LinkedList | Cache unfriendly | Vec or VecDeque | | Hidden .clone() | Unnecessary allocs | Use references | | Premature optimization | Wasted effort | Make it work first |

Anti-Patterns

| Anti-Pattern | Why Bad | Better | |--------------|---------|--------| | Clone to avoid lifetimes | Performance cost | Proper ownership | | Box everything | Indirection cost | Stack when possible | | HashMap for small sets | Overhead | Vec with linear search | | String concat in loop | O(n^2) | String::with_capacity or format! |

Related Skills

| When | See | |------|-----| | Reducing clones | m01-ownership | | Concurrency options | m07-concurrency | | Smart pointer choice | m02-resource | | Domain requirements | domain-* |

Agent Skills: Performance Optimization

Install this agent skill to your local

Skill Files