Palantir Observability
Overview
Set up comprehensive observability for Foundry integrations: structured logging with request IDs, Prometheus metrics for API latency/errors, health check endpoints, and alert rules.
Prerequisites
- Working Foundry integration
- Prometheus + Grafana (or equivalent monitoring stack)
- Familiarity with
palantir-prod-checklist
Instructions
Step 1: Structured Logging
import logging, json, time, uuid
class FoundryLogger:
def __init__(self):
self.logger = logging.getLogger("foundry")
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(message)s"))
self.logger.addHandler(handler)
self.logger.setLevel(logging.INFO)
def log_api_call(self, method: str, endpoint: str, status: int, duration_ms: float):
self.logger.info(json.dumps({
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"request_id": str(uuid.uuid4())[:8],
"service": "foundry",
"method": method,
"endpoint": endpoint,
"status": status,
"duration_ms": round(duration_ms, 2),
"level": "error" if status >= 400 else "info",
}))
Step 2: Prometheus Metrics
from prometheus_client import Counter, Histogram, Gauge
foundry_requests = Counter(
"foundry_api_requests_total",
"Total Foundry API requests",
["method", "endpoint", "status"],
)
foundry_latency = Histogram(
"foundry_api_latency_seconds",
"Foundry API request latency",
["endpoint"],
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)
foundry_health = Gauge(
"foundry_api_healthy",
"1 if Foundry API is reachable, 0 otherwise",
)
def instrumented_call(client, method, *args, **kwargs):
endpoint = method.__qualname__
start = time.monotonic()
try:
result = method(*args, **kwargs)
status = 200
return result
except foundry.ApiError as e:
status = e.status_code
raise
finally:
duration = time.monotonic() - start
foundry_requests.labels(method="API", endpoint=endpoint, status=str(status)).inc()
foundry_latency.labels(endpoint=endpoint).observe(duration)
Step 3: Health Check with Metrics
import time
async def foundry_health_check():
start = time.monotonic()
try:
list(client.ontologies.Ontology.list())
latency = (time.monotonic() - start) * 1000
foundry_health.set(1)
return {"status": "healthy", "latency_ms": round(latency, 1)}
except Exception as e:
foundry_health.set(0)
return {"status": "unhealthy", "error": str(e)}
Step 4: Alert Rules (Prometheus)
groups:
- name: foundry
rules:
- alert: FoundryAPIDown
expr: foundry_api_healthy == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Foundry API unreachable for 2+ minutes"
- alert: FoundryHighErrorRate
expr: rate(foundry_api_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: warning
- alert: FoundryHighLatency
expr: histogram_quantile(0.99, foundry_api_latency_seconds_bucket) > 10
for: 10m
labels:
severity: warning
Step 5: Dashboard Queries (Grafana)
# Request rate by status
rate(foundry_api_requests_total[5m])
# P99 latency
histogram_quantile(0.99, rate(foundry_api_latency_seconds_bucket[5m]))
# Error ratio
sum(rate(foundry_api_requests_total{status=~"[45].."}[5m]))
/ sum(rate(foundry_api_requests_total[5m]))
Output
- Structured JSON logging with request IDs
- Prometheus metrics for requests, latency, and health
- Alert rules for API downtime, error rate, and latency
- Grafana dashboard queries
Error Handling
| Alert | Threshold | Action |
|-------|-----------|--------|
| API Down | Health check fails 2min | Page on-call, check palantir-incident-runbook |
| High Error Rate | 5xx > 5% for 5min | Check Foundry status, review logs |
| High Latency | p99 > 10s for 10min | Review query complexity, check Foundry load |
| Rate Limited | 429 count spike | Tune rate limiter settings |
Resources
Next Steps
For multi-environment setup, see palantir-multi-env-setup.