Oracle Cloud Incident Runbook
Overview
Self-service runbook for when OCI instances go down and the status page stays green. OCI's status page has a history of not acknowledging outages in real time (London Jan 2026 — 502s and instances disappearing for 10 minutes with no status update). OCI Support response times average 4+ hours for Sev-1 tickets. This runbook gives you health probes, automated instance recovery, cross-AD failover, and cross-region failover — all executable without waiting on Oracle.
Purpose: Detect OCI service degradation independently, recover instances automatically, and fail over to alternate availability domains or regions when the primary is impacted.
Prerequisites
- OCI CLI installed and configured —
~/.oci/configvalidated (seeoraclecloud-install-auth) - Python 3.8+ with the OCI SDK —
pip install oci - Pre-configured resources: at least one compute instance, a VCN with subnets in multiple ADs
- IAM policies:
manage instances,manage volumes,inspect work-requestsin the target compartment - Boot volume backups enabled (recovery depends on having a recent backup)
Instructions
Step 1: Independent Health Probes
Do not trust the OCI status page alone. Run your own health checks against the OCI API:
import oci
import time
config = oci.config.from_file("~/.oci/config")
def probe_oci_health(config):
"""Probe OCI API endpoints independently of the status page."""
results = {}
# Probe 1: Identity service (lightest call)
try:
start = time.time()
identity = oci.identity.IdentityClient(config)
identity.list_regions()
results["identity"] = {"status": "healthy", "latency_ms": int((time.time() - start) * 1000)}
except oci.exceptions.ServiceError as e:
results["identity"] = {"status": "degraded", "error": str(e.status)}
# Probe 2: Compute service
try:
start = time.time()
compute = oci.core.ComputeClient(config)
compute.list_instances(compartment_id=config["tenancy"], limit=1)
results["compute"] = {"status": "healthy", "latency_ms": int((time.time() - start) * 1000)}
except oci.exceptions.ServiceError as e:
results["compute"] = {"status": "degraded", "error": str(e.status)}
# Probe 3: Networking service
try:
start = time.time()
network = oci.core.VirtualNetworkClient(config)
network.list_vcns(compartment_id=config["tenancy"], limit=1)
results["networking"] = {"status": "healthy", "latency_ms": int((time.time() - start) * 1000)}
except oci.exceptions.ServiceError as e:
results["networking"] = {"status": "degraded", "error": str(e.status)}
return results
health = probe_oci_health(config)
for service, status in health.items():
print(f" {service}: {status['status']} ({status.get('latency_ms', 'N/A')}ms)")
Step 2: Severity Classification
| Level | Condition | Detection | Response Time |
|-------|-----------|-----------|---------------|
| P1 — Total Outage | All API probes fail, instances unreachable | All 3 probes return degraded | Immediate — trigger cross-region failover |
| P2 — Partial Degradation | Some APIs slow (>2s latency), instance agent unresponsive | Latency >2000ms on any probe | 15 min — attempt instance recovery, prepare failover |
| P3 — Intermittent | Sporadic 429/500 errors, some requests succeed | Occasional probe failures | 1 hour — enable retries, monitor trend |
Step 3: Instance Recovery Actions
# Check current instance state
oci compute instance get --instance-id "$INSTANCE_OCID" \
--query 'data."lifecycle-state"' --raw-output
# Action: Reset (hard reboot) — fastest recovery for hung instances
oci compute instance action --instance-id "$INSTANCE_OCID" \
--action RESET
# Action: Reboot (graceful) — for instances still responding to ACPI
oci compute instance action --instance-id "$INSTANCE_OCID" \
--action SOFTRESET
# Action: Stop then Start — forces reallocation to new host hardware
oci compute instance action --instance-id "$INSTANCE_OCID" \
--action STOP
# Wait for STOPPED state
oci compute instance action --instance-id "$INSTANCE_OCID" \
--action START
Step 4: Cross-AD Failover
When an entire availability domain is impacted, launch a replacement instance in a different AD using the latest boot volume backup:
import oci
config = oci.config.from_file("~/.oci/config")
compute = oci.core.ComputeClient(config)
blockstorage = oci.core.BlockstorageClient(config)
# Find latest boot volume backup
backups = blockstorage.list_boot_volume_backups(
compartment_id="COMPARTMENT_OCID",
boot_volume_id="BOOT_VOLUME_OCID",
sort_by="TIMECREATED",
sort_order="DESC",
limit=1,
).data
if not backups:
raise RuntimeError("No boot volume backups found — cannot failover")
latest_backup = backups[0]
print(f"Using backup: {latest_backup.id} from {latest_backup.time_created}")
# Launch replacement instance in alternate AD
launch_details = oci.core.models.LaunchInstanceDetails(
availability_domain="AD-2", # Different from the impacted AD
compartment_id="COMPARTMENT_OCID",
shape="VM.Standard.E4.Flex",
shape_config=oci.core.models.LaunchInstanceShapeConfigDetails(ocpus=2, memory_in_gbs=16),
display_name="failover-instance",
source_details=oci.core.models.InstanceSourceViaBootVolumeDetails(
source_type="bootVolume",
boot_volume_id=latest_backup.id,
),
create_vnic_details=oci.core.models.CreateVnicDetails(
subnet_id="SUBNET_OCID_IN_AD2",
),
)
response = compute.launch_instance(launch_details)
print(f"Failover instance launching: {response.data.id}")
Step 5: Cross-Region Failover
For region-wide outages, use a pre-replicated boot volume in the DR region:
# List available boot volume replicas in DR region
oci bv boot-volume-backup list \
--compartment-id "$COMPARTMENT_OCID" \
--query 'data[0].id' --raw-output \
--region us-phoenix-1
# Launch instance in DR region
oci compute instance launch \
--compartment-id "$COMPARTMENT_OCID" \
--availability-domain "AD-1" \
--shape "VM.Standard.E4.Flex" \
--shape-config '{"ocpus":2,"memoryInGBs":16}' \
--display-name "dr-failover-instance" \
--source-boot-volume-id "$DR_BACKUP_OCID" \
--subnet-id "$DR_SUBNET_OCID" \
--region us-phoenix-1
Step 6: Status Monitoring Script
Run this in a loop during an incident to detect recovery:
#!/bin/bash
while true; do
STATUS=$(oci compute instance get --instance-id "$INSTANCE_OCID" \
--query 'data."lifecycle-state"' --raw-output 2>&1)
TIMESTAMP=$(date -u +%H:%M:%S)
echo "$TIMESTAMP | Instance: $STATUS"
if [ "$STATUS" = "RUNNING" ]; then
echo "$TIMESTAMP | Instance recovered."
break
fi
sleep 30
done
Output
Successful execution produces:
- Independent health probe results for Identity, Compute, and Networking services
- Severity classification (P1/P2/P3) based on probe results
- Instance recovery action (reset, reboot, or stop/start) applied to the impacted instance
- Cross-AD failover instance launched from the latest boot volume backup (if AD-level failure)
- Cross-region failover instance launched in the DR region (if region-level failure)
- Continuous monitoring log tracking instance lifecycle state until recovery
Error Handling
| Error | Code | Cause | Solution |
|-------|------|-------|----------|
| NotAuthenticated | 401 | API key expired during incident | Use oci session authenticate for token-based short-term auth |
| NotAuthorizedOrNotFound | 404 | IAM policy missing for recovery actions | Pre-create policy: allow group sre to manage instances in compartment prod |
| TooManyRequests | 429 | Rate limited during mass recovery | OCI has no Retry-After header — back off 30 seconds between calls |
| InternalError | 500 | OCI service itself is degraded | This confirms the outage — proceed to cross-region failover |
| No boot volume backups | — | Backups not configured | Cannot failover without backups — enable in Block Storage > Boot Volumes > Backup Policies |
| ServiceError status -1 | — | API timeout (region unreachable) | Switch to DR region immediately: --region us-phoenix-1 |
Examples
Quick health check (CLI one-liner):
# Test if OCI API is responsive
oci iam region list --output table && echo "OCI API: OK" || echo "OCI API: UNREACHABLE"
Automated recovery wrapper:
#!/bin/bash
set -euo pipefail
INSTANCE_OCID="${1:?Usage: recover.sh <instance-ocid>}"
STATE=$(oci compute instance get --instance-id "$INSTANCE_OCID" \
--query 'data."lifecycle-state"' --raw-output)
case "$STATE" in
RUNNING) echo "Instance is healthy" ;;
STOPPED) oci compute instance action --instance-id "$INSTANCE_OCID" --action START ;;
*) oci compute instance action --instance-id "$INSTANCE_OCID" --action RESET ;;
esac
Resources
- OCI Instance Recovery — stop, start, reboot, and reset actions
- OCI Status Page — official service health (may lag actual incidents)
- OCI CLI Reference — command-line interface documentation
- OCI Python SDK Reference — ComputeClient, BlockstorageClient APIs
- OCI Known Issues — platform-known issues and workarounds
- SDK Troubleshooting — connectivity and timeout debugging
Next Steps
After stabilizing the incident, run oraclecloud-debug-bundle to collect forensic data for the postmortem, or review oraclecloud-prod-checklist to harden your environment against future outages.