Salesforce Incident Runbook Skill

Salesforce Incident Runbook

Overview

Rapid incident response procedures for Salesforce integration failures, covering Salesforce-side outages, API limit exhaustion, authentication failures, and data sync issues.

Prerequisites

Salesforce CLI authenticated (sf org login)
Access to Salesforce Status API
Monitoring dashboards configured (see salesforce-observability)
Communication channels (Slack, PagerDuty)

Quick Triage (Do This First)

# 1. Is Salesforce itself down?
curl -s https://api.status.salesforce.com/v1/incidents/active | jq '.[0:3]'
# If incidents returned → Salesforce-side issue, enable fallback mode

# 2. Check your org's instance status
# Find your instance at: https://status.salesforce.com
curl -s "https://api.status.salesforce.com/v1/instances/NA45/status" | jq '.status'

# 3. Check API limits — are we out of calls?
sf limits api display --target-org my-org --json | jq '.result[] | select(.name == "DailyApiRequests")'
# If remaining = 0 → API_LIMIT_EXCEEDED, see mitigation below

# 4. Check authentication
sf org display --target-org my-org --json | jq '.result.connectedStatus'
# If "RefreshTokenError" → re-authenticate

# 5. Check recent errors in your logs
sf apex log list --target-org my-org --json | jq '.result[0:5]'

Decision Tree

Integration returning errors?
├── YES: Is status.salesforce.com showing incident?
│   ├── YES → Salesforce outage. Enable fallback mode. Monitor status page.
│   └── NO → Check error type below:
│       ├── INVALID_SESSION_ID (401) → Token expired. Re-authenticate.
│       ├── REQUEST_LIMIT_EXCEEDED (403) → API limit hit. Reduce calls.
│       ├── UNABLE_TO_LOCK_ROW (409) → Record contention. Retry with backoff.
│       ├── MALFORMED_QUERY / INVALID_FIELD → Code bug. Check SOQL.
│       └── 500/503 → Salesforce-side. Wait and retry.
└── NO: Is data syncing correctly?
    ├── YES → Likely resolved or intermittent. Monitor.
    └── NO → Check CDC subscription, query timestamps, bulk job status.

Immediate Actions by Error Type

REQUEST_LIMIT_EXCEEDED — API Limit Exhausted

// This is a P1 — your integration is completely blocked

// 1. Check what's consuming API calls
const limits = await conn.request('/services/data/v59.0/limits/');
console.log('API calls:', limits.DailyApiRequests);
console.log('Bulk API:', limits.DailyBulkV2QueryJobs);
// Limits reset on a 24-hour rolling basis

// 2. Identify top consumers (Enterprise+ orgs with EventLogFile)
const topUsers = await conn.query(`
  SELECT UserId, COUNT(Id) callCount
  FROM EventLogFile
  WHERE EventType = 'API' AND LogDate = TODAY
  GROUP BY UserId
  ORDER BY COUNT(Id) DESC
  LIMIT 10
`);

// 3. Immediate mitigation: pause non-critical integrations
// Set env var: SF_CRITICAL_ONLY=true
// Only allow essential operations (auth, health check, critical writes)

INVALID_SESSION_ID — Authentication Failure

# Token expired or revoked — re-authenticate
sf org login web --alias my-org --instance-url https://login.salesforce.com

# For CI/automated: re-auth with JWT
sf org login jwt \
  --client-id $SF_CLIENT_ID \
  --jwt-key-file server.key \
  --username $SF_USERNAME \
  --alias my-org

# Verify connection is restored
sf org display --target-org my-org

Salesforce System Outage

// Enable graceful degradation — serve stale data from cache
const FALLBACK_MODE = process.env.SF_FALLBACK_MODE === 'true';

async function queryWithFallback<T>(soql: string, cacheKey: string): Promise<T[]> {
  if (FALLBACK_MODE) {
    const cached = await redis.get(cacheKey);
    if (cached) {
      console.warn('SF FALLBACK: serving cached data');
      return JSON.parse(cached);
    }
    throw new Error('Salesforce unavailable and no cached data');
  }

  const conn = await getConnection();
  const result = await conn.query<T>(soql);

  // Always update cache for fallback
  await redis.set(cacheKey, JSON.stringify(result.records), 'EX', 3600);
  return result.records;
}

Communication Templates

Internal (Slack)

P1 INCIDENT: Salesforce Integration
Status: INVESTIGATING
Error: [REQUEST_LIMIT_EXCEEDED / INVALID_SESSION_ID / SF outage]
Impact: [Data sync paused / API calls failing / user-facing errors]
Current action: [Checking limits / re-authenticating / enabling fallback]
Next update: [time]

Postmortem Template

## Incident: Salesforce [Error Type]
**Date:** YYYY-MM-DD | **Duration:** X hours | **Severity:** P[1-4]

### Summary
[One sentence — e.g., "API limit exhausted due to unoptimized batch job"]

### Root Cause
[e.g., "New sync job ran SELECT * on Contact (3M records) using individual queries instead of Bulk API"]

### Impact
- API calls blocked for [duration]
- [N] users affected / [N] records not synced

### Timeline
- HH:MM — Alerts fired: REQUEST_LIMIT_EXCEEDED
- HH:MM — Triage: identified bulk sync as consumer
- HH:MM — Mitigated: paused sync job
- HH:MM — Resolved: API limit rolled over

### Action Items
- [ ] Migrate sync to Bulk API 2.0 — @owner — due date
- [ ] Add API budget guard (80% warning) — @owner — due date
- [ ] Set up EventLogFile monitoring for top consumers — @owner — due date

Error Handling

| Issue | Cause | Solution | |-------|-------|----------| | Can't reach status API | Network issue | Try https://status.salesforce.com manually | | sf CLI auth expired | Token revoked | Re-authenticate with sf org login | | Limits API returns 403 | Limit already exceeded | Wait for rolling 24hr reset | | Bulk job stuck | Processing timeout | Abort and retry: sf data bulk delete |

Resources

Next Steps

For data handling, see salesforce-data-handling.

Agent Skills: Salesforce Incident Runbook

Install this agent skill to your local

Skill Files