Agent Skills: Analyzing TDigest Metrics

Analyze percentile metrics (tdigest type) using OPAL for latency analysis and SLO tracking. Use when calculating p50, p95, p99 from pre-aggregated duration or latency metrics. Covers the critical double-combine pattern with align + m_tdigest() + tdigest_combine + aggregate. For simple metrics (counts, averages), see aggregating-gauge-metrics skill.

UncategorizedID: rustomax/observe-community-mcp/analyzing-tdigest-metrics

Skill Files

Browse the full folder contents for analyzing-tdigest-metrics.

Download Skill

Loading file tree…

skills/analyzing-tdigest-metrics/SKILL.md

Skill Metadata

Name
analyzing-tdigest-metrics
Description
Analyze percentile metrics (tdigest type) using OPAL for latency analysis and SLO tracking. Use when calculating p50, p95, p99 from pre-aggregated duration or latency metrics. Covers the critical double-combine pattern with align + m_tdigest() + tdigest_combine + aggregate. For simple metrics (counts, averages), see aggregating-gauge-metrics skill.

Analyzing TDigest Metrics

TDigest metrics in Observe store pre-aggregated percentile data for efficient latency and duration analysis. This skill teaches the specialized pattern for querying tdigest metrics using OPAL.

When to Use This Skill

  • Calculating latency percentiles (p50, p95, p99) for services or endpoints
  • Analyzing request duration distributions
  • Setting or tracking SLOs (Service Level Objectives) based on percentiles
  • Understanding performance characteristics beyond simple averages
  • Working with any metric of type tdigest
  • When you need accurate percentile calculations from pre-aggregated data

Prerequisites

  • Access to Observe tenant via MCP
  • Understanding that tdigest metrics are pre-aggregated percentile structures
  • Metric dataset with type: tdigest
  • Familiarity with percentiles (p50 = median, p95 = 95th percentile, etc.)
  • Use discover_context() to find and inspect tdigest metrics

Key Concepts

What Are TDigest Metrics?

TDigest (t-digest) is a probabilistic data structure for estimating percentiles efficiently:

Pre-aggregated percentile data: Not raw values, but compressed statistical summaries

  • Stores distribution information in compact form
  • Enables accurate percentile calculations
  • Much more efficient than storing all raw values

Why percentiles matter:

  • Averages hide outliers: A service with avg 100ms might have p99 at 10 seconds
  • SLOs use percentiles: "p95 latency < 500ms" is a common SLO target
  • User experience: p95/p99 show what real users experience, not just average case

Common Examples:

  • span_sn_service_node_duration_tdigest_5m - Service-to-service latency percentiles
  • span_sn_service_edge_duration_tdigest_5m - Edge latency percentiles
  • request_duration_tdigest_5m - Request duration percentiles
  • database_query_duration_tdigest_5m - Database query latency percentiles

CRITICAL: The Double-Combine Pattern

TDigest metrics require a special pattern that's different from gauge metrics:

# WRONG - Missing second combine ❌
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(combined, 0.95)

# CORRECT - Double-combine pattern ✅
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)

Why the double combine?

  1. First tdigest_combine (in align): Combines tdigest data points within time buckets
  2. Second tdigest_combine (in aggregate): Re-combines tdigests across groups/dimensions
  3. Then tdigest_quantile: Calculates the actual percentile value

Pattern breakdown:

align options(bins: 1),
      combined:tdigest_combine(m_tdigest("metric_name"))  ← First combine
aggregate p95:tdigest_quantile(
                tdigest_combine(combined),                ← Second combine (NESTED!)
                0.95),                                    ← Quantile value (0.0-1.0)
          group_by(service_name)

Percentile Values

Percentiles are specified as decimal values from 0.0 to 1.0:

| Percentile | Value | Meaning | |------------|-------|---------| | p50 (median) | 0.50 | 50% of values are below this | | p75 | 0.75 | 75% of values are below this | | p90 | 0.90 | 90% of values are below this | | p95 | 0.95 | 95% of values are below this | | p99 | 0.99 | 99% of values are below this | | p99.9 | 0.999 | 99.9% of values are below this |

Common SLO targets: p95 < 500ms, p99 < 1000ms

Summary vs Time-Series (Same as Gauge Metrics)

| Output Type | Pattern | Result | Pipe? | |-------------|---------|--------|-------| | Summary | options(bins: 1) | One row per group | NO \| | | Time-Series | 5m, 1h | Many rows per group | YES \| |

Discovery Workflow

Step 1: Search for tdigest metrics

discover_context("duration tdigest", result_type="metric")
discover_context("latency percentile", result_type="metric")

Step 2: Get detailed metric schema

discover_context(metric_name="span_sn_service_node_duration_tdigest_5m")

Step 3: Verify metric type Look for: Type: tdigest (critical!)

Step 4: Note available dimensions Used for group_by():

  • service_name, for_service_name
  • environment, for_environment
  • etc. (shown in discovery output)

Step 5: Write query Use double-combine pattern with correct dimensions

Basic Patterns

Pattern 1: Overall Percentiles (No Grouping)

Calculate percentiles across all data:

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50),
          p95:tdigest_quantile(tdigest_combine(combined), 0.95),
          p99:tdigest_quantile(tdigest_combine(combined), 0.99)

Output: Single row with overall p50, p95, p99 across entire time range.

Note: Both combines present, no group_by.

Pattern 2: Percentiles Per Service

Calculate percentiles broken down by dimension:

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50),
          p95:tdigest_quantile(tdigest_combine(combined), 0.95),
          p99:tdigest_quantile(tdigest_combine(combined), 0.99),
          group_by(service_name)

Output: One row per service with percentiles.

Pattern 3: Single Percentile (Common for SLOs)

Get just p95 for SLO tracking:

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
          group_by(service_name)
| sort desc(p95)
| limit 10

Output: Top 10 services by p95 latency.

Use case: Identify slowest services for optimization.

Pattern 4: Converting Units

TDigest values are often in nanoseconds - convert for readability:

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50_ns:tdigest_quantile(tdigest_combine(combined), 0.50),
          p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95),
          p99_ns:tdigest_quantile(tdigest_combine(combined), 0.99),
          group_by(service_name)
| make_col p50_ms:p50_ns / 1000000,
          p95_ms:p95_ns / 1000000,
          p99_ms:p99_ns / 1000000

Output: Percentiles in both nanoseconds and milliseconds.

Note: Check sample values in discover_context() to identify units.

Pattern 5: Time-Series Percentiles

Track percentiles over time buckets:

align 5m, combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
            group_by(service_name)

Output: Multiple rows per service (one per 5-minute interval).

Note: Pipe | required for time-series pattern.

Use case: Dashboard charts showing latency trends over time.

Common Use Cases

SLO Tracking: p95 Latency Under Threshold

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95),
          group_by(service_name)
| make_col p95_ms:p95_ns / 1000000
| make_col slo_target:500,
          meets_slo:if(p95_ms < 500, "yes", "no")
| sort desc(p95_ms)

Use case: Check which services meet p95 < 500ms SLO target.

Output: Services with SLO compliance status.

Latency Distribution Analysis

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50),
          p75:tdigest_quantile(tdigest_combine(combined), 0.75),
          p90:tdigest_quantile(tdigest_combine(combined), 0.90),
          p95:tdigest_quantile(tdigest_combine(combined), 0.95),
          p99:tdigest_quantile(tdigest_combine(combined), 0.99),
          group_by(service_name)
| make_col p50_ms:p50 / 1000000,
          p95_ms:p95 / 1000000,
          p99_ms:p99 / 1000000

Use case: Understand full latency distribution to identify outliers.

Insight: Large gap between p95 and p99 indicates inconsistent performance.

Comparing Services by Latency

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
          group_by(service_name)
| make_col p95_ms:p95 / 1000000
| sort desc(p95_ms)
| limit 10

Use case: Find slowest services to prioritize optimization efforts.

Time-Series for Incident Investigation

align 5m, combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
            group_by(service_name)
| filter service_name = "frontend"
| make_col p95_ms:p95 / 1000000

Use case: See when latency spiked during an incident.

Output: Timeline of p95 latency for specific service.

Multi-Dimension Grouping

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
          group_by(service_name, environment)
| make_col p95_ms:p95 / 1000000
| sort desc(p95_ms)

Use case: Compare latency across services AND environments.

Complete Example

Scenario: You're tracking SLOs for your microservices. The target is p95 latency < 500ms and p99 latency < 1000ms for all production services.

Step 1: Discover tdigest metrics

discover_context("duration tdigest", result_type="metric")

Found: span_sn_service_node_duration_tdigest_5m (type: tdigest)

Step 2: Get metric details

discover_context(metric_name="span_sn_service_node_duration_tdigest_5m")

Available dimensions: service_name, environment, for_service_name

Step 3: Query for SLO compliance

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95),
          p99_ns:tdigest_quantile(tdigest_combine(combined), 0.99),
          group_by(service_name, environment)
| make_col p95_ms:p95_ns / 1000000,
          p99_ms:p99_ns / 1000000
| make_col p95_slo:if(p95_ms < 500, "✓", "✗"),
          p99_slo:if(p99_ms < 1000, "✓", "✗")
| filter environment = "production"
| sort desc(p95_ms)

Step 4: Interpret results

| service_name | environment | p95_ms | p99_ms | p95_slo | p99_slo | |--------------|-------------|--------|--------|---------|---------| | frontend | production | 19373.5 | 5641328.2 | ✗ | ✗ | | featureflagservice | production | 5838.8 | 7473.9 | ✗ | ✗ | | cartservice | production | 4136.6 | 5898.3 | ✗ | ✗ | | productcatalogservice | production | 257.0 | 313.1 | ✓ | ✓ | | currencyservice | production | 54.1 | 125.1 | ✓ | ✓ |

Insight: Frontend, featureflagservice, and cartservice are violating SLOs - need optimization.

Step 5: Investigate frontend latency over time

align 1h, combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
            p99:tdigest_quantile(tdigest_combine(combined), 0.99),
            group_by(service_name)
| filter service_name = "frontend"
| make_col p95_ms:p95 / 1000000, p99_ms:p99 / 1000000

Output: Hourly p95/p99 trends to identify when latency degraded.

Common Pitfalls

Pitfall 1: Forgetting Second Combine

Wrong (most common mistake):

align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(combined, 0.95)

Correct:

align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)

Why: TDigest requires combining twice - once in align, once in aggregate.

Error message: "the field has to be aggregated or grouped"

Pitfall 2: Using m() Instead of m_tdigest()

Wrong:

align options(bins: 1), combined:tdigest_combine(m("duration_tdigest_5m"))

Correct:

align options(bins: 1), combined:tdigest_combine(m_tdigest("duration_tdigest_5m"))

Why: Tdigest metrics require m_tdigest() function, not m().

Check: Look for Type: tdigest in discover_context() output.

Pitfall 3: Wrong Pipe Usage (Same as Gauge)

Wrong (pipe with bins:1):

align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)

Correct:

# Summary - NO pipe
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)

# Time-series - YES pipe
align 5m, combined:tdigest_combine(m_tdigest("metric"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)

Pitfall 4: Percentile Value Out of Range

Wrong:

aggregate p95:tdigest_quantile(tdigest_combine(combined), 95)

Correct:

aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)

Why: Quantile values must be 0.0 to 1.0 (not 1 to 100).

Pitfall 5: Not Converting Units

Wrong (values in nanoseconds, hard to read):

aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)

Result: p95 = 14675991.25 (what unit is this?)

Correct (convert to milliseconds):

aggregate p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95)
| make_col p95_ms:p95_ns / 1000000

Result: p95_ms = 14.68 (clearly milliseconds)

Tip: Check sample values in discovery to identify units (19-digit numbers = nanoseconds).

Percentile Reference

Common percentiles and their meanings:

| Percentile | Decimal | Meaning | Common Use | |------------|---------|---------|------------| | p50 | 0.50 | Median (middle value) | Typical user experience | | p75 | 0.75 | 75th percentile | Better than average case | | p90 | 0.90 | 90th percentile | Catching most outliers | | p95 | 0.95 | 95th percentile | Standard SLO target | | p99 | 0.99 | 99th percentile | Tail latency / worst 1% | | p99.9 | 0.999 | 99.9th percentile | Extreme outliers |

SLO best practice: Track p95 and p99, not just averages.

Unit Conversion Reference

Common time unit conversions (assuming nanoseconds):

# Nanoseconds to milliseconds (most common)
make_col value_ms:value_ns / 1000000

# Nanoseconds to seconds
make_col value_sec:value_ns / 1000000000

# Nanoseconds to microseconds
make_col value_us:value_ns / 1000

How to identify units: Check sample values in discover_context():

  • 19 digits (1760201545280843522) = nanoseconds
  • 13 digits (1758543367916) = milliseconds
  • 10 digits (1758543367) = seconds

Best Practices

  1. Always use double-combine pattern - most critical rule for tdigest
  2. Verify metric type - must be tdigest (not gauge)
  3. Check units - convert nanoseconds to milliseconds for readability
  4. Use multiple percentiles - p50, p95, p99 show full distribution
  5. Calculate SLO compliance - add derived columns comparing to targets
  6. Sort and limit - focus on worst offenders with sort desc() | limit 10
  7. Use time-series for investigation - see when latency changed
  8. Group by relevant dimensions - service, environment, endpoint, etc.

Related Skills

  • aggregating-gauge-metrics - For count/sum/avg metrics (NOT percentiles)
  • working-with-intervals - For calculating percentiles from raw interval data (slower)
  • time-series-analysis - For event/interval trending with timechart

Summary

TDigest metrics enable efficient percentile calculations:

  • Core pattern: align + m_tdigest() + double tdigest_combine + tdigest_quantile
  • Critical rule: Use tdigest_combine() TWICE (in align AND in aggregate)
  • Metric function: m_tdigest() (NOT m())
  • Percentile values: 0.0 to 1.0 (0.95 = p95)
  • Common percentiles: p50 (median), p95 (SLO), p99 (tail latency)
  • Units: Often nanoseconds - convert to milliseconds for readability

Key distinction: TDigest metrics use special double-combine pattern, while gauge metrics use simple m() + aggregate.


Last Updated: November 14, 2025 Version: 1.0 Tested With: Observe OPAL (ServiceExplorer/Service Inspector Metrics)