Palantir Incident Runbook Skill

Palantir Incident Runbook

Overview

Rapid incident response for Foundry-related outages: API failures, transform build failures, authentication issues, and data pipeline stalls.

Prerequisites

Access to application logs and Foundry build history
Foundry service user credentials for health checks
On-call escalation path defined

Instructions

Step 1: Triage (First 5 Minutes)

set -euo pipefail
echo "=== Foundry Incident Triage ==="
echo "Time: $(date -u)"

# 1. Check if Foundry itself is down
curl -s -o /dev/null -w "Foundry API: HTTP %{http_code}\n" \
  -H "Authorization: Bearer $FOUNDRY_TOKEN" \
  "https://$FOUNDRY_HOSTNAME/api/v2/ontologies" || echo "FOUNDRY UNREACHABLE"

# 2. Check our app health
curl -s http://localhost:8080/health | python -m json.tool

# 3. Check recent error logs
grep -c "ApiError\|status_code.*[45][0-9][0-9]" /var/log/app/app.log | tail -1

Step 2: Classify Severity

| Severity | Criteria | Response Time | |----------|----------|---------------| | P1 Critical | Foundry API completely unreachable, all operations failing | Immediate | | P2 High | Intermittent 429/5xx errors, degraded performance | 15 minutes | | P3 Medium | Single transform failing, non-critical pipeline stalled | 1 hour | | P4 Low | Deprecation warnings, performance degradation | Next business day |

Step 3: Common Incident Playbooks

Playbook A: Authentication Failure (401/403)

# 1. Verify token is set
echo "Token set: ${FOUNDRY_TOKEN:+yes}"
echo "Token length: ${#FOUNDRY_TOKEN}"

# 2. Test with a fresh token
python -c "
import os, foundry
client = foundry.FoundryClient(
    auth=foundry.UserTokenAuth(
        hostname=os.environ['FOUNDRY_HOSTNAME'],
        token=os.environ['FOUNDRY_TOKEN'],
    ),
    hostname=os.environ['FOUNDRY_HOSTNAME'],
)
print('Auth OK:', list(client.ontologies.Ontology.list())[0].api_name)
"
# 3. If still failing: regenerate credentials in Developer Console

Playbook B: Rate Limiting (429)

# 1. Check rate limit headers from last response
# 2. Enable request throttling
# 3. Review batch operations for unnecessary API calls
# See palantir-rate-limits for detailed implementation

Playbook C: Transform Build Failure

1. Open Foundry > Pipeline Builder > failed build
2. Check the "Errors" tab for stack trace
3. Common causes:
   - OutOfMemoryError → add @configure(profile=["DRIVER_MEMORY_LARGE"])
   - AnalysisException → column name mismatch (case-sensitive)
   - Input dataset empty → check upstream pipeline
4. Fix code, commit, trigger rebuild

Step 4: Escalation

Level 1: On-call engineer (your team)
  → Check logs, verify credentials, restart service

Level 2: Platform team
  → Foundry enrollment issues, networking, VPN

Level 3: Palantir support
  → Create ticket with debug bundle (palantir-debug-bundle)
  → Include: error codes, timestamps, request IDs

Step 5: Postmortem Template

## Incident: [Title]
**Duration:** [start] to [end] ([X] minutes)
**Severity:** P[1-4]
**Impact:** [What was affected]

### Timeline
- HH:MM — Alert fired
- HH:MM — Investigation started
- HH:MM — Root cause identified
- HH:MM — Fix deployed
- HH:MM — Verified resolution

### Root Cause
[Description]

### Action Items
- [ ] [Preventive measure 1]
- [ ] [Preventive measure 2]

Output

Incident triaged and classified within 5 minutes
Appropriate playbook executed
Escalation if needed with debug bundle
Postmortem documented with action items

Error Handling

| Incident Type | First Action | Escalation Trigger | |---------------|-------------|-------------------| | API unreachable | Check Foundry status | If Foundry is up but we cannot connect | | Auth failure | Test with fresh token | If new token also fails | | Rate limiting | Enable throttling | If throttling does not resolve | | Build failure | Check error logs | If error is infrastructure-related |

Resources

Next Steps

For proactive monitoring, see palantir-observability.

Agent Skills: Palantir Incident Runbook

Install this agent skill to your local

Skill Files