Agent Skills: ln-811-performance-profiler

Profiles runtime performance with CPU, memory, and I/O metrics. Use when measuring bottlenecks before optimization.

UncategorizedID: levnikolaevich/claude-code-skills/ln-811-performance-profiler

Install this agent skill to your local

pnpm dlx add-skill https://github.com/levnikolaevich/claude-code-skills/tree/HEAD/skills-catalog/ln-811-performance-profiler

Skill Files

Browse the full folder contents for ln-811-performance-profiler.

Download Skill

Loading file tree…

skills-catalog/ln-811-performance-profiler/SKILL.md

Skill Metadata

Name
ln-811-performance-profiler
Description
"Profiles runtime performance with CPU, memory, and I/O metrics. Use when measuring bottlenecks before optimization."

Paths: File paths (shared/, references/, ../ln-*) are relative to skills repo root. If not found at CWD, locate this SKILL.md directory and go up one level for repo root. If shared/ is missing, fetch files via WebFetch from https://raw.githubusercontent.com/levnikolaevich/claude-code-skills/master/skills/{path}.

ln-811-performance-profiler

Type: L3 Worker Category: 8XX Optimization

Runtime profiler that executes the optimization target, measures multiple metrics (CPU, memory, I/O, time), instruments code for per-function breakdown, and produces a standardized performance map from real data.


Overview

| Aspect | Details | |--------|---------| | Input | Problem statement: target (file/endpoint/pipeline) + observed metric | | Output | Performance map (multi-metric, per-function), suspicion stack, bottleneck classification | | Pattern | Discover test → Baseline run → Static analysis → Deep profile → Performance map → Report |


Workflow

Phases: Test Discovery → Baseline Run → Static Analysis → Deep Profile → Performance Map → Report


Phase 0: Test Discovery/Creation

MANDATORY READ: Load shared/references/ci_tool_detection.md for test framework detection. MANDATORY READ: Load shared/references/benchmark_generation.md for auto-generating benchmarks when none exist.

Find or create commands that exercise the optimization target. Two outputs: test_command (profiling/measurement) and e2e_test_command (functional safety gate).

Step 1: Discover test_command

| Priority | Method | Action | |----------|--------|--------| | 1 | User-provided | User specifies test command or API endpoint | | 2 | Discover existing E2E test | Grep test files for target entry point (stop at first match) | | 3 | Create test script | Generate per shared/references/benchmark_generation.md to .hex-skills/optimization/{slug}/profile_test.sh |

E2E discovery protocol (stop at first match):

| Priority | Method | How | |----------|--------|-----| | 1 | Route-based search | Grep e2e/integration test files for entry point route | | 2 | Function-based search | Grep for entry point function name | | 3 | Module-based search | Grep for import of entry point module |

Test creation (if no existing test found):

| Target Type | Generated Script | |-------------|-----------------| | API endpoint | curl -w "%{time_total}" -o /dev/null -s {endpoint} | | Function | Stack-specific benchmark per shared/references/benchmark_generation.md | | Pipeline | Full pipeline invocation with test input |

Step 2: Discover e2e_test_command

If test_command came from E2E discovery (Step 1 priority 2): e2e_test_command = test_command.

Otherwise, run E2E discovery protocol again (same 3-priority table) to find a separate functional safety test.

If not found: e2e_test_command = null, log: WARNING: No e2e test covers {entry_point}. Full test suite serves as functional gate.

Output

| Field | Description | |-------|-------------| | test_command | Command for profiling/measurement | | e2e_test_command | Command for functional safety gate (may equal test_command, or null) | | e2e_test_source | Discovery method: user / route / function / module / none |


Phase 1: Baseline Run (Multi-Metric)

Run test_command with system-level profiling. Capture simultaneously:

| Metric | How to Capture | When | |--------|---------------|------| | Wall time | time wrapper or test harness | Always | | CPU time (user+sys) | /usr/bin/time -v or language profiler | Always | | Memory peak (RSS) | /usr/bin/time -v (Max RSS) or tracemalloc / process.memoryUsage() | Always | | I/O bytes | /usr/bin/time -v or structured logs | If I/O suspected | | HTTP round-trips | Count from structured logs or application metrics | If network I/O in call graph | | GPU utilization | nvidia-smi --query-gpu | Only if CUDA/GPU detected in stack |

Baseline Protocol

| Parameter | Value | |-----------|-------| | Runs | 3 | | Metric | Median | | Warm-up | 1 discarded run | | Output | baseline — multi-metric snapshot |


Phase 2: Static Analysis → Instrumentation Points

MANDATORY READ: Load bottleneck_classification.md

Trace call chain from code + build suspicion stack. Purpose: guide WHERE to instrument in Phase 3.

Step 1: Trace Call Chain

Starting from entry point, trace depth-first (max depth 5). At each step, READ the full function body.

Cross-service tracing: If service_topology is available from coordinator and a step makes an HTTP/gRPC call to another service whose code is accessible:

| Situation | Action | |-----------|--------| | HTTP call to service with code in submodule/monorepo | Follow into that service's handler: resolve route → trace handler code (depth resets to 0 for the new service) | | HTTP call to service without accessible code | Classify as External, record latency estimate | | gRPC/message queue to known service | Same as HTTP — follow into handler if code accessible |

Record service: "{service_name}" on each step to track which service owns it. The performance_map steps tree can span multiple services.

Depth-First Rule: If code of the called service is accessible — ALWAYS profile INSIDE. NEVER classify an accessible service as "External/slow" without profiling its internals. "Slow" is a symptom, not a diagnosis.

5 Whys for each bottleneck: Before reporting a bottleneck, chain "why?" until you reach config/architecture level:

  1. "What is slow?" → alignment service (5.9s) 2. "Why?" → 6 pairs × ~1s each 3. "Why ~1s per pair?" → O(n²) mwmf computation 4. "Why O(n²)?" → library default, not production config 5. "Why default?" → matching_methods not configured → root cause = config

Step 2: Classify & Suspicion Scan

For each step, classify by type (CPU, I/O-DB, I/O-Network, I/O-File, Architecture, External, Cache) and scan for performance concerns.

Suspicion checklist (minimum, not limitation):

| Category | What to Look For | |----------|-----------------| | Connection management | Client created per-request? Missing pooling? Missing reuse? | | Data flow | Data read multiple times? Over-fetching? Unnecessary transforms? | | Async patterns | Sync I/O in async context? Sequential awaits without data dependency? | | Resource lifecycle | Unclosed connections? Temp files? Memory accumulation in loop? | | Configuration | Hardcoded timeouts? Default pool sizes? Missing batch size config? | | Redundant work | Same validation at multiple layers? Same data loaded twice? | | Architecture | N+1 in loop? Batch API unused? Cache infra unused? Sequential-when-parallel? | | (open) | Anything else spotted — checklist does not limit findings |

Step 2b: Suspicion Deduplication

MANDATORY READ: Load shared/references/output_normalization.md

After generating suspicions across all call chain steps, normalize and deduplicate per §1-§2:

  • Normalize suspicion descriptions (replace specific values with placeholders)
  • Group identical suspicions across different steps → merge into single entry with affected_steps: [list]
  • Example: "Missing connection pooling" found in steps 1.1, 1.2, 1.3 → one suspicion with affected_steps: ["1.1", "1.2", "1.3"]

Step 3: Verify & Map to Instrumentation Points

FOR each suspicion:
  1. VERIFY: follow code to confirm or dismiss
  2. VERDICT: CONFIRMED → map to instrumentation point | DISMISSED → log reason
  3. For each CONFIRMED suspicion, identify:
     - function to wrap with timing
     - I/O call to count
     - memory allocation to track

Profiler Selection (per stack)

| Stack | Non-invasive profiler | Invasive (if non-invasive insufficient) | |-------|----------------------|----------------------------------------| | Python | py-spy, cProfile | time.perf_counter() decorators | | Node.js | clinic, --prof | console.time() wrappers | | Go | pprof (built-in) | Usually not needed | | .NET | dotnet-trace | Stopwatch wrappers | | Rust | cargo flamegraph | std::time::Instant |

Stack detection: per shared/references/ci_tool_detection.md.


Phase 3: Deep Profile

Profiler Hierarchy (escalate as needed)

| Level | Tool Examples | What It Shows | When to Use | |-------|--------------|---------------|-------------| | 1 | py-spy, cProfile, pprof, dotnet-trace | Function-level hotspots | Always — first pass | | 2 | line_profiler, per-line timing | Line-level timing in hotspot function | Hotspot function found but cause unclear | | 3 | tracemalloc, memory_profiler | Per-line memory allocation | Memory metrics abnormal in baseline |

Step 1: Non-Invasive Profiling (preferred)

Run test_command with Level 1 profiler to get per-function breakdown without code changes.

Step 2: Escalation Decision

After Level 1 profiler run, evaluate result against suspicion stack from Phase 2:

| Profiler Result | Action | |-----------------|--------| | Hotspot function identified, time breakdown confirms suspicions | DONE — proceed to Phase 4 | | Hotspot identified but internal cause unclear (CPU vs I/O inside one function) | Escalate to Level 2 (line-level timing) | | Memory baseline abnormal (peak or delta) | Escalate to Level 3 (memory profiler) | | Multiple suspicions unresolved — profiler granularity insufficient | Go to Step 3 (targeted instrumentation) | | Profiler unavailable or overhead > 20% of wall time | Go to Step 3 (targeted instrumentation) |

Stop Conditions (Profiler Escalation)

| Condition | Action | |-----------|--------| | Hotspot identified with clear cause | STOP — proceed to Performance Map | | All 3 profiler levels exhausted | STOP — build map from best available data | | Instrumentation breaks tests | STOP — revert instrumentation, use non-invasive data only | | Profiler overhead > 20% of wall time | STOP — skip to targeted instrumentation |

Step 3: Targeted Instrumentation (proactive)

Add timing/logging along the call stack at instrumentation points identified in Phase 2 Step 3:

1. FOR each CONFIRMED suspicion without measured data:
     Add timing wrapper around target function/I/O call
     Add counter for I/O round-trips if network/DB suspected
     (cross-service: instrument in the correct service's codebase)
2. Re-run test_command (3 runs, median)
3. Collect per-function measurements from logs
4. Record list of instrumented files (may span multiple services)

| Instrumentation Type | When | Example | |---------------------|------|---------| | Timing wrapper | Always for unresolved suspicions | time.perf_counter() around function call | | I/O call counter | Network or DB bottleneck suspected | Count HTTP requests, DB queries in loop | | Memory snapshot | Memory accumulation suspected | tracemalloc.get_traced_memory() before/after |

KEEP instrumentation in place. The executor reuses it for post-optimization per-function comparison, then cleans up after strike. Report instrumented_files in output.


Phase 4: Build Performance Map

Standardized format — feeds into .hex-skills/optimization/{slug}/context.md for downstream consumption.

performance_map:
  test_command: "uv run pytest tests/automated/e2e/test_example.py -s"
  baseline:
    wall_time_ms: 7280
    cpu_time_ms: 850
    memory_peak_mb: 256
    memory_delta_mb: 45
    io_read_bytes: 1200000
    io_write_bytes: 500000
    http_round_trips: 13
  steps:                          # service field present only in multi-service topology
    - id: "1"
      function: "process_job"
      location: "app/services/job_processor.py:45"
      service: "api"             # optional — which service owns this step
      wall_time_ms: 7200
      time_share_pct: 99
      type: "function_call"
      children:
        - id: "1.1"
          function: "translate_binary"
          wall_time_ms: 7100
          type: "function_call"
          children:
            - id: "1.1.1"
              function: "tikal_extract"
              service: "tikal"   # cross-service: code traced into submodule
              wall_time_ms: 2800
              type: "http_call"
              http_round_trips: 1
            - id: "1.1.2"
              function: "mt_translate"
              service: "mt-engine"
              wall_time_ms: 3500
              type: "http_call"
              http_round_trips: 13
  bottleneck_classification: "I/O-Network"
  bottleneck_detail: "13 sequential HTTP calls to MT service (3500ms)"
  top_bottlenecks:
    - step: "1.1.2", type: "I/O-Network", share: 48%
    - step: "1.1.1", type: "I/O-Network", share: 38%

Phase 5: Report

Report Structure

profile_result:
  entry_point_info:
    type: <string>                     # "api_endpoint" | "function" | "pipeline"
    location: <string>                 # file:line
    route: <string|null>               # API route (if endpoint)
    function: <string>                 # Entry point function name
  performance_map: <object>            # Full map from Phase 4
  bottleneck_classification: <string>  # Primary bottleneck type
  bottleneck_detail: <string>          # Human-readable description
  top_bottlenecks:
    - step, type, share, description
  optimization_hints:                  # CONFIRMED suspicions only (Phase 2)
    - hint with evidence
  suspicion_stack:                     # Full audit trail (confirmed + dismissed)
    - category: <string>
      location: <string>
      description: <string>
      verdict: <string>               # "confirmed" | "dismissed"
      evidence: <string>
      verification_note: <string>
  e2e_test:
    command: <string|null>             # E2E safety test command (from Phase 0)
    source: <string>                   # user / route / function / module / none
  instrumented_files: [<string>]       # Files with active instrumentation (empty if non-invasive only)
  wrong_tool_indicators: []            # Empty = proceed, non-empty = exit

Wrong Tool Indicators

| Indicator | Condition | |-----------|-----------| | external_service_no_alternative | 90%+ measured time in external service, no batch/cache/parallel path | | within_industry_norm | Measured time within expected range for operation type | | infrastructure_bound | Bottleneck is hardware (measured via system metrics) | | already_optimized | Code already uses best patterns (confirmed by suspicion scan) |


Error Handling

| Error | Recovery | |-------|----------| | Cannot resolve entry point | Block: "file/function not found at {path}" | | Test command fails on unmodified code | Block: "test fails before profiling — fix test first" | | Profiler not available for stack | Fall back to invasive instrumentation (Phase 3 Step 2) | | Instrumentation breaks tests | Revert immediately: git checkout -- . | | Call chain too deep (> 5 levels) | Stop at depth 5, note truncation | | Cannot classify step type | Default to "Unknown", use measured time | | No I/O detected (pure CPU) | Classify as CPU, focus on algorithm profiling |


References

  • bottleneck_classification.md — classification taxonomy
  • latency_estimation.md — latency heuristics (fallback for static-only mode)
  • shared/references/ci_tool_detection.md — stack/tool detection
  • shared/references/benchmark_generation.md — benchmark templates per stack

Runtime Summary Artifact

MANDATORY READ: Load shared/references/coordinator_summary_contract.md

Write .hex-skills/runtime-artifacts/runs/{run_id}/optimization-profile/{slug}.json before finishing.

Definition of Done

  • [ ] Test command discovered or created for optimization target
  • [ ] E2E safety test discovered (or documented as unavailable)
  • [ ] Baseline measured: wall time, CPU, memory (3 runs, median)
  • [ ] Call graph traced and function bodies read
  • [ ] Suspicion stack built: each suspicion verified and mapped to instrumentation point
  • [ ] Deep profile completed (non-invasive preferred, invasive if needed)
  • [ ] Instrumented files reported (cleanup deferred to executor)
  • [ ] Performance map built in standardized format (real measurements)
  • [ ] Top 3 bottlenecks identified from measured data
  • [ ] Wrong tool indicators evaluated from real metrics
  • [ ] optimization_hints contain only CONFIRMED suspicions with measurement evidence
  • [ ] Report prepared with measured findings
  • [ ] Optimization profile artifact written to the shared location

Version: 3.0.0 Last Updated: 2026-03-15

ln-811-performance-profiler Skill | Agent Skills