Palantir Incident Runbook
Overview
Rapid incident response for Foundry-related outages: API failures, transform build failures, authentication issues, and data pipeline stalls.
Prerequisites
- Access to application logs and Foundry build history
- Foundry service user credentials for health checks
- On-call escalation path defined
Instructions
Step 1: Triage (First 5 Minutes)
set -euo pipefail
echo "=== Foundry Incident Triage ==="
echo "Time: $(date -u)"
# 1. Check if Foundry itself is down
curl -s -o /dev/null -w "Foundry API: HTTP %{http_code}\n" \
-H "Authorization: Bearer $FOUNDRY_TOKEN" \
"https://$FOUNDRY_HOSTNAME/api/v2/ontologies" || echo "FOUNDRY UNREACHABLE"
# 2. Check our app health
curl -s http://localhost:8080/health | python -m json.tool
# 3. Check recent error logs
grep -c "ApiError\|status_code.*[45][0-9][0-9]" /var/log/app/app.log | tail -1
Step 2: Classify Severity
| Severity | Criteria | Response Time | |----------|----------|---------------| | P1 Critical | Foundry API completely unreachable, all operations failing | Immediate | | P2 High | Intermittent 429/5xx errors, degraded performance | 15 minutes | | P3 Medium | Single transform failing, non-critical pipeline stalled | 1 hour | | P4 Low | Deprecation warnings, performance degradation | Next business day |
Step 3: Common Incident Playbooks
Playbook A: Authentication Failure (401/403)
# 1. Verify token is set
echo "Token set: ${FOUNDRY_TOKEN:+yes}"
echo "Token length: ${#FOUNDRY_TOKEN}"
# 2. Test with a fresh token
python -c "
import os, foundry
client = foundry.FoundryClient(
auth=foundry.UserTokenAuth(
hostname=os.environ['FOUNDRY_HOSTNAME'],
token=os.environ['FOUNDRY_TOKEN'],
),
hostname=os.environ['FOUNDRY_HOSTNAME'],
)
print('Auth OK:', list(client.ontologies.Ontology.list())[0].api_name)
"
# 3. If still failing: regenerate credentials in Developer Console
Playbook B: Rate Limiting (429)
# 1. Check rate limit headers from last response
# 2. Enable request throttling
# 3. Review batch operations for unnecessary API calls
# See palantir-rate-limits for detailed implementation
Playbook C: Transform Build Failure
1. Open Foundry > Pipeline Builder > failed build
2. Check the "Errors" tab for stack trace
3. Common causes:
- OutOfMemoryError → add @configure(profile=["DRIVER_MEMORY_LARGE"])
- AnalysisException → column name mismatch (case-sensitive)
- Input dataset empty → check upstream pipeline
4. Fix code, commit, trigger rebuild
Step 4: Escalation
Level 1: On-call engineer (your team)
→ Check logs, verify credentials, restart service
Level 2: Platform team
→ Foundry enrollment issues, networking, VPN
Level 3: Palantir support
→ Create ticket with debug bundle (palantir-debug-bundle)
→ Include: error codes, timestamps, request IDs
Step 5: Postmortem Template
## Incident: [Title]
**Duration:** [start] to [end] ([X] minutes)
**Severity:** P[1-4]
**Impact:** [What was affected]
### Timeline
- HH:MM — Alert fired
- HH:MM — Investigation started
- HH:MM — Root cause identified
- HH:MM — Fix deployed
- HH:MM — Verified resolution
### Root Cause
[Description]
### Action Items
- [ ] [Preventive measure 1]
- [ ] [Preventive measure 2]
Output
- Incident triaged and classified within 5 minutes
- Appropriate playbook executed
- Escalation if needed with debug bundle
- Postmortem documented with action items
Error Handling
| Incident Type | First Action | Escalation Trigger | |---------------|-------------|-------------------| | API unreachable | Check Foundry status | If Foundry is up but we cannot connect | | Auth failure | Test with fresh token | If new token also fails | | Rate limiting | Enable throttling | If throttling does not resolve | | Build failure | Check error logs | If error is infrastructure-related |
Resources
Next Steps
For proactive monitoring, see palantir-observability.