Kubernetes Debugger
Systematic debugging workflows for Kubernetes issues using MCP kubernetes tools.
Prerequisites
Install Kubernetes MCP Server
claude mcp add kubernetes --scope user -- npx mcp-server-kubernetes
Requirements:
- Access to a Kubernetes cluster configured for kubectl (minikube, Rancher Desktop, GKE, EKS, AKS, etc.)
- kubeconfig at
~/.kube/config(default) orKUBECONFIGenv var set - Helm v3 in PATH (optional, for Helm operations)
Alternative installation methods:
# Global install
npm install -g mcp-server-kubernetes
# Or run directly with npx (no install)
npx mcp-server-kubernetes
Verify installation:
claude mcp list # Should show 'kubernetes' server
Quick Reference: MCP Tools
| Tool | Use For |
|------|---------|
| kubectl_get | List resources, check status, find resource names |
| kubectl_describe | Detailed info, events, conditions |
| kubectl_logs | Container stdout/stderr, application errors |
| exec_in_pod | Run commands inside containers |
| kubectl_rollout | Deployment rollout status/history |
| node_management | Cordon/drain/uncordon nodes |
Debugging Decision Tree
Issue reported
│
├─ Pod not running? ──────────► See: Pod Debugging Workflow
│
├─ Service unreachable? ──────► See: Service/Network Debugging
│
├─ Deployment stuck? ─────────► See: Deployment Debugging
│
├─ Node issues? ──────────────► See: Node Debugging
│
└─ Performance/Resources? ────► See: Resource Debugging
Pod Debugging Workflow
Step 1: Get Pod Status
kubectl_get(resourceType="pods", namespace="<ns>")
Common statuses and their meaning:
- Pending: Scheduling issues (resources, node selector, affinity)
- CrashLoopBackOff: Container crashing repeatedly
- ImagePullBackOff/ErrImagePull: Cannot pull container image
- Running but not ready: Readiness probe failing
- Terminating: Stuck deletion (finalizers, PDB)
Step 2: Check Events and Conditions
kubectl_describe(resourceType="pod", name="<pod>", namespace="<ns>")
Look for in output:
- Events section: Scheduling failures, image pull errors, probe failures
- Conditions: PodScheduled, Initialized, ContainersReady, Ready
- Container State: Waiting (reason), Running, Terminated (exit code)
Step 3: Get Container Logs
kubectl_logs(resourceType="pod", name="<pod>", namespace="<ns>", container="<container>")
Options:
previous=true: Logs from crashed containertail=100: Last N linessince="1h": Logs from last hour
Step 4: Exec Into Container (if running)
exec_in_pod(name="<pod>", namespace="<ns>", command=["sh", "-c", "<cmd>"])
Useful commands:
["cat", "/etc/resolv.conf"]- Check DNS config["env"]- Verify environment variables["ls", "-la", "/app"]- Check mounted files["nc", "-zv", "<host>", "<port>"]- Test connectivity
Common Pod Issues
CrashLoopBackOff
- Get logs:
kubectl_logs(previous=true)for crashed container - Check exit code in
kubectl_describeoutput - Common causes:
- Exit code 1: Application error
- Exit code 137: OOMKilled (check memory limits)
- Exit code 143: SIGTERM (graceful shutdown issue)
ImagePullBackOff
- Check image name/tag in describe output
- Verify image exists in registry
- Check imagePullSecrets if private registry
- Look for "Failed to pull image" in events
Pending Pod
- Check events for scheduling failure reason
- Common causes:
Insufficient cpu/memory: Node capacity exhaustednode(s) didn't match node selector: Wrong labelsPersistentVolumeClaim not bound: Storage issue0/N nodes available: Taints/tolerations mismatch
Service/Network Debugging
Step 1: Verify Service Exists
kubectl_get(resourceType="services", namespace="<ns>")
kubectl_describe(resourceType="service", name="<svc>", namespace="<ns>")
Step 2: Check Endpoints
kubectl_get(resourceType="endpoints", name="<svc>", namespace="<ns>")
No endpoints? Check:
- Pod labels match service selector
- Pods are Running and Ready
- Target port matches container port
Step 3: Test DNS Resolution
exec_in_pod(name="<debug-pod>", command=["nslookup", "<service>.<namespace>.svc.cluster.local"])
Step 4: Test Connectivity
exec_in_pod(name="<debug-pod>", command=["nc", "-zv", "<service>", "<port>"])
Deployment Debugging
Check Rollout Status
kubectl_rollout(subCommand="status", resourceType="deployment", name="<deploy>", namespace="<ns>")
View Rollout History
kubectl_rollout(subCommand="history", resourceType="deployment", name="<deploy>", namespace="<ns>")
Rollback if Needed
kubectl_rollout(subCommand="undo", resourceType="deployment", name="<deploy>", namespace="<ns>")
Common Issues
- Progressing stuck: New pods failing (check ReplicaSet pods)
- Available < desired: Pods not passing readiness probes
- Surge/unavailable conflicts: Check deployment strategy
Node Debugging
Check Node Status
kubectl_get(resourceType="nodes")
kubectl_describe(resourceType="node", name="<node>")
Node Conditions to Check
| Condition | Problem If | |-----------|------------| | Ready | False or Unknown | | MemoryPressure | True | | DiskPressure | True | | PIDPressure | True | | NetworkUnavailable | True |
Drain Node for Maintenance
node_management(operation="cordon", nodeName="<node>") # Prevent new pods
node_management(operation="drain", nodeName="<node>", confirmDrain=true) # Evict pods
# After maintenance:
node_management(operation="uncordon", nodeName="<node>")
Resource Debugging
Check Resource Usage
kubectl_generic(command="top", resourceType="pods", namespace="<ns>")
kubectl_generic(command="top", resourceType="nodes")
OOMKilled Detection
kubectl_describepod - look for "OOMKilled" in container state- Check memory limits vs actual usage
- Solutions:
- Increase memory limits
- Fix memory leak in application
- Add memory requests for better scheduling
CPU Throttling
- Check if CPU limits are too restrictive
- Consider removing CPU limits (keep requests)
- Use
kubectl top podsto see actual usage
Reference Files
- references/pod-states.md: Complete pod state reference
- references/common-errors.md: Error messages and solutions
- references/network-debug.md: Network troubleshooting details