Kubernetes
Overview
This skill covers Kubernetes resource configuration, deployment strategies, cluster architecture, security hardening, Helm chart development, and production operations. It helps create production-ready manifests, troubleshoot cluster issues, and implement security best practices.
Agent Delegation
senior-software-engineer (Opus) - DEFAULT. Application architecture on K8s, multi-service design, cluster architecture AND implementation, operators, CRDs, RBAC, NetworkPolicies, PodSecurityStandards, admission control, manifests, and configs software-engineer (Sonnet) - ONLY for boilerplate manifests following established patterns or scaffolding from a concrete plan code-reviewer - Reviews K8s YAML for best practices and security issues
Note: For cluster architecture and security hardening, use senior-software-engineer with /loom-kubernetes, /loom-security-audit, and /loom-threat-model skills
Instructions
1. Design Resource Architecture
- Plan namespace organization and boundaries
- Define resource requests/limits based on workload
- Design service mesh topology (if applicable)
- Plan for high availability (replicas, affinity, PDBs)
- Choose workload type (Deployment/StatefulSet/DaemonSet/Job)
- Design storage strategy (PVCs, StorageClasses, ephemeral volumes)
2. Create Resource Manifests
- Write Deployments/StatefulSets/DaemonSets with proper lifecycle
- Configure Services (ClusterIP/NodePort/LoadBalancer) and Ingress
- Set up ConfigMaps and Secrets (sealed secrets, external secrets)
- Define RBAC policies (ServiceAccounts, Roles, RoleBindings)
- Implement NetworkPolicies for pod-to-pod communication
- Configure PodDisruptionBudgets for availability guarantees
3. Develop Helm Charts
- Structure charts (Chart.yaml, values.yaml, templates/)
- Use template functions (_helpers.tpl for common labels)
- Parameterize configurations (replicas, image tags, resources)
- Define dependencies (requirements.yaml or Chart.yaml dependencies)
- Implement hooks (pre-install, post-upgrade, etc.)
- Test charts locally (helm template, helm lint, helm install --dry-run)
- Package and publish charts to registries
4. Implement Security Best Practices
- Run as non-root user (runAsUser, runAsNonRoot)
- Drop all capabilities (securityContext.capabilities.drop: [ALL])
- Use read-only root filesystems (readOnlyRootFilesystem: true)
- Prevent privilege escalation (allowPrivilegeEscalation: false)
- Implement PodSecurityStandards (restricted profile)
- Use NetworkPolicies for zero-trust networking
- Scan images for vulnerabilities (integrate Trivy, Snyk)
- Rotate secrets regularly (external secrets operator)
- Audit RBAC permissions (principle of least privilege)
5. Configure Observability
- Add Prometheus annotations for scraping
- Configure logging (stdout/stderr, fluentd/loki)
- Implement health probes (liveness, readiness, startup)
- Export metrics (custom metrics for HPA)
- Set up distributed tracing (OpenTelemetry)
- Configure alerting rules
6. Troubleshoot Issues
- Pod crashes:
kubectl logs,kubectl describe pod, events - Image pull failures: Check ImagePullSecrets, registry auth
- Networking: Test DNS resolution, check NetworkPolicies
- Resource constraints: Check node capacity, pod eviction
- Scheduling failures: Node selectors, taints/tolerations, affinity
- Performance: Use
kubectl top, Prometheus metrics - CrashLoopBackOff: Check startup probes, init containers
- Pending pods: Describe pod for scheduling failures
7. Production Operations
- Rolling updates with proper rollout strategy
- Blue-green deployments (multiple services)
- Canary deployments (traffic splitting via service mesh)
- Backup and restore (Velero for cluster state)
- Disaster recovery planning (multi-region, etcd backups)
- Capacity planning (resource quotas, limit ranges)
- Cost optimization (right-sizing, spot instances, cluster autoscaler)
- Upgrade planning (test in staging, rolling node upgrades)
Best Practices
- Use Namespaces: Organize resources logically, enable RBAC boundaries
- Set Resource Limits: Prevent resource exhaustion, enable QoS classes
- Health Probes: Liveness checks only core process health (never external dependencies — that causes cascading restarts); readiness additionally checks dependencies and drains the pod; use a startup probe instead of a large initialDelaySeconds for slow starts
- Rolling Updates: Zero-downtime deployments with maxSurge/maxUnavailable
- Secrets Management: Use external secrets, never hardcode credentials
- Label Everything: Enable filtering, selection, and monitoring
- Use Helm/Kustomize: Template and manage manifests as code
- Security Contexts: Always run as non-root with minimal privileges
- NetworkPolicies: Default deny, explicit allow for zero-trust
- Pod Disruption Budgets: Protect availability during disruptions
- Resource Quotas: Prevent runaway resource consumption per namespace
- Admission Control: Use OPA/Kyverno for policy enforcement
- Immutable Infrastructure: Never SSH into pods, replace instead
- GitOps: Use ArgoCD/Flux for declarative deployment
Examples
Example 1: Production Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
namespace: production
labels:
app: api-server
version: v1.2.0
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
version: v1.2.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
serviceAccountName: api-server
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault # REQUIRED by Restricted PSS; omitting it = pod rejected
containers:
- name: api
image: myregistry.io/api-server:v1.2.0
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
protocol: TCP
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: api-secrets
key: database-url
- name: LOG_LEVEL
valueFrom:
configMapKeyRef:
name: api-config
key: log-level
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# startupProbe gates liveness/readiness; prefer it over a large
# initialDelaySeconds so liveness can stay tight at runtime.
startupProbe:
httpGet:
path: /health/live
port: http
failureThreshold: 30 # 30 * 10s = 5 min startup budget
periodSeconds: 10
livenessProbe:
httpGet:
path: /health/live # process health only — no DB/cache/upstream checks
port: http
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: http
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /app/cache
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir:
sizeLimit: 100Mi
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: api-server
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-server
Example 2: Service and Ingress
apiVersion: v1
kind: Service
metadata:
name: api-server
namespace: production
labels:
app: api-server
spec:
type: ClusterIP
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
selector:
app: api-server
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-server
namespace: production
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
spec:
ingressClassName: nginx # canonical since 1.18; kubernetes.io/ingress.class is deprecated
tls:
- hosts:
- api.example.com
secretName: api-tls-cert
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-server
port:
number: 80
Example 3: ConfigMap and Secret
apiVersion: v1
kind: ConfigMap
metadata:
name: api-config
namespace: production
data:
log-level: "info"
max-connections: "100"
cache-ttl: "3600"
feature-flags: |
{
"new-checkout": true,
"beta-features": false
}
---
apiVersion: v1
kind: Secret
metadata:
name: api-secrets
namespace: production
type: Opaque
# WARNING: Secrets are only base64-encoded, NOT encrypted at rest in etcd by default.
# Enable encryption at rest or use External Secrets Operator / a secrets manager.
# Never commit real credentials — placeholders only.
stringData:
database-url: "postgresql://user:CHANGE_ME@db-host:5432/myapp"
api-key: "CHANGE_ME"
Example 4: Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# WARNING: memory-based HPA is unreliable for load-driven scaling — memory is
# non-compressible and runtimes rarely return heap to the OS, so it scales out
# but never scales back in. Prefer CPU or external/custom metrics (RPS, queue
# depth via KEDA). Every container must set resources.requests.cpu or the HPA
# silently ignores the pod for CPU metrics.
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
Example 5: Helm Chart Structure
mychart/
Chart.yaml # Chart metadata
values.yaml # Default configuration values
templates/
_helpers.tpl # Template helpers
deployment.yaml # Deployment template
service.yaml # Service template
ingress.yaml # Ingress template
configmap.yaml # ConfigMap template
secret.yaml # Secret template
NOTES.txt # Post-install notes
charts/ # Chart dependencies
.helmignore # Files to ignore
Chart.yaml:
apiVersion: v2
name: api-server
description: Production API server Helm chart
type: application
version: 1.2.0
appVersion: "1.2.0"
keywords:
- api
- backend
maintainers:
- name: Platform Team
email: platform@example.com
dependencies:
- name: postgresql
version: "12.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
values.yaml:
replicaCount: 3
image:
repository: myregistry.io/api-server
tag: "1.2.0"
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 80
targetPort: 8080
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: api.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: api-tls-cert
hosts:
- api.example.com
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
# Avoid targetMemoryUtilizationPercentage: memory-based HPA rarely scales back in.
postgresql:
enabled: true
auth:
username: apiuser
database: apidb
templates/_helpers.tpl:
{{- define "api-server.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- define "api-server.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}
{{- define "api-server.labels" -}}
helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version }}
app.kubernetes.io/name: {{ include "api-server.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
Example 6: NetworkPolicy (Zero-Trust)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-server-netpol
namespace: production
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: postgresql
ports:
- protocol: TCP
port: 5432
- to:
# Same list element: namespaceSelector AND podSelector (zero-trust).
- namespaceSelector:
matchLabels:
name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53 # required: DNS falls back to TCP for large responses / DNSSEC
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Example 7: RBAC Configuration
apiVersion: v1
kind: ServiceAccount
metadata:
name: api-server
namespace: production
labels:
app: api-server
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: api-server
namespace: production
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["api-secrets", "database-creds"]
verbs: ["get"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: api-server
namespace: production
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: api-server
subjects:
- kind: ServiceAccount
name: api-server
namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: read-pods-global
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: pod-reader
subjects:
- kind: ServiceAccount
name: monitoring
namespace: observability
Example 8: StatefulSet with Persistent Storage
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
namespace: production
spec:
serviceName: postgresql-headless
replicas: 3
selector:
matchLabels:
app: postgresql
template:
metadata:
labels:
app: postgresql
spec:
containers:
- name: postgres
image: postgres:15-alpine
ports:
- containerPort: 5432
name: postgres
env:
- name: POSTGRES_DB
value: myapp
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-creds
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-creds
key: password
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Service
metadata:
name: postgresql-headless
namespace: production
spec:
clusterIP: None
selector:
app: postgresql
ports:
- port: 5432
targetPort: postgres
name: postgres
Troubleshooting Commands
# Pod debugging
kubectl get pods -n production
kubectl describe pod api-server-abc123 -n production
kubectl logs api-server-abc123 -n production
kubectl logs api-server-abc123 -n production --previous
kubectl logs api-server-abc123 -c container-name -n production
kubectl exec -it api-server-abc123 -n production -- /bin/sh
# Events and status
kubectl get events -n production --sort-by='.lastTimestamp'
kubectl get events -n production --field-selector involvedObject.name=api-server-abc123
# Resource usage
kubectl top nodes
kubectl top pods -n production
kubectl describe node node-1
# Network debugging
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- /bin/bash
kubectl exec -it api-server-abc123 -n production -- nslookup kubernetes.default
kubectl exec -it api-server-abc123 -n production -- curl -v http://other-service
# Configuration
kubectl get configmap api-config -n production -o yaml
kubectl get secret api-secrets -n production -o jsonpath='{.data}'
# Deployments and rollouts
kubectl rollout status deployment/api-server -n production
kubectl rollout history deployment/api-server -n production
kubectl rollout undo deployment/api-server -n production
kubectl rollout restart deployment/api-server -n production
# RBAC debugging
kubectl auth can-i get pods --as=system:serviceaccount:production:api-server -n production
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json | jq '.items[] | select(.subjects[]?.name=="api-server")'
# NetworkPolicy debugging
kubectl describe networkpolicy api-server-netpol -n production
# Helm operations
helm list -n production
helm get values api-server -n production
helm history api-server -n production
helm upgrade api-server ./mychart -n production --values values.yaml
helm rollback api-server 2 -n production
helm uninstall api-server -n production
# Cluster info
kubectl cluster-info
kubectl get nodes -o wide
kubectl get --raw '/livez?verbose' # componentstatuses deprecated since 1.19
kubectl get --raw '/readyz?verbose'
kubectl api-resources
kubectl api-versions
Common Issues and Solutions
| Issue | Cause | Solution | | ----------------------- | ------------------------------------------------ | -------------------------------------------------- | | ImagePullBackOff | Registry auth missing or image not found | Check imagePullSecrets, verify image exists | | CrashLoopBackOff | App crashes on startup | Check logs, verify config, add startup probe | | Pending pod | Insufficient resources or scheduling constraints | Check node capacity, taints, tolerations, affinity | | OOMKilled | Memory limit exceeded | Increase memory limit, fix memory leak | | Service unreachable | Wrong selector or port | Verify selector matches pod labels, check ports | | DNS resolution fails | CoreDNS issues or NetworkPolicy blocking | Check CoreDNS pods, verify NetworkPolicy egress | | PVC pending | StorageClass missing or no available volumes | Verify StorageClass, check provisioner | | RBAC permission denied | ServiceAccount lacks required permissions | Add Role/RoleBinding with necessary verbs | | Readiness probe failing | App not ready or probe misconfigured | Adjust initialDelaySeconds, check probe endpoint | | HPA not scaling | Metrics server missing or no resource requests | Install metrics-server, set resource requests |
Expert Practices: Idioms, Anti-Patterns & Gotchas
High-signal practices that separate working-on-my-laptop manifests from production-grade ones. Most of these fail silently — no API error, no event — so they bite only under load, during node maintenance, or in a hardened cluster.
Probes
Liveness checks only process health; readiness checks dependencies. A liveness probe that returns non-200 because a database, cache, or upstream is slow makes the kubelet restart the pod — which fixes nothing, drops in-flight requests, dumps warm cache, and adds load to the already-degraded dependency and the remaining pods. Under load this is a self-reinforcing restart cascade (e.g. timeoutSeconds: 1 + failureThreshold: 1 restarts a pod that merely took 2s). Liveness must hit an in-memory handler that only proves the request thread is responsive; dependency-awareness belongs in readiness, which drains the pod from Service endpoints without restarting it. Make liveness more tolerant than readiness (higher failureThreshold / longer periodSeconds) so a pod is drained before it is ever killed.
livenessProbe:
httpGet:
path: /healthz # in-memory only; no DB/cache/upstream calls
port: 8080
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready # may check DB connectivity, cache warmth, dependencies
port: 8080
periodSeconds: 10
failureThreshold: 2
Use a startupProbe, not a bloated initialDelaySeconds. Inflating liveness initialDelaySeconds to tolerate slow startup makes liveness equally slow to catch a real deadlock at runtime. While a startup probe is configured, Kubernetes does not run liveness or readiness probes until it succeeds. Size it as failureThreshold * periodSeconds to cover worst-case startup (e.g. 30 * 10s = 5 min), then give liveness tight values that only take effect after startup passes.
Prefer httpGet/tcpSocket/grpc over exec probes at high pod density. exec probes fork a new process every execution; at scale with short periodSeconds this adds measurable node CPU overhead and probe-latency spikes under pressure. tcpSocket is cheapest, httpGet moderate. For gRPC use the built-in grpc probe type (GA since 1.24) rather than shipping grpc_health_probe invoked via exec. Reserve exec for checks that genuinely need a command, with generous timeouts.
Resources & Scheduling
CPU limits cause CFS throttling even when the node has idle CPU. Limits are enforced by Linux CFS bandwidth control over a 100ms period: a container that exhausts its quota early is throttled to zero for the rest of that period regardless of free node CPU, injecting multi-millisecond latency spikes (severe for GC'd / latency-sensitive runtimes) that are invisible in average-utilization dashboards — watch container_cpu_cfs_throttled_periods_total, not kubectl top. Tradeoff: requests == limits gives Guaranteed QoS (evicted only after BestEffort and Burstable under node pressure) but limits bin-packing; omitting the CPU limit (requests only) maximizes burst headroom. For latency-sensitive services, prefer requests-only or generous limits, but always keep a memory limit (a clean OOM beats node-wide starvation).
# Latency-sensitive: request for scheduling, no CPU limit; keep a memory limit
resources:
requests: { cpu: 200m, memory: 256Mi }
limits: { memory: 256Mi }
QoS is computed per Pod — one container without requests == limits demotes the whole Pod. Guaranteed (evicted last) requires every container — app, init, sidecar — to set CPU and memory requests equal to limits. A single sidecar (monitoring agent, Envoy/Istio proxy, log shipper) with no resource spec silently drops the entire Pod to Burstable, which is evicted first. Memory is non-compressible: a container exceeding its memory limit is OOMKilled (exit 137) immediately, never throttled like CPU.
Use topologySpreadConstraints for HA spread; isolate rollout revisions with matchLabelKeys. Hard podAntiAffinity (requiredDuringScheduling, topologyKey: hostname) is binary and blocks scheduling once replicas exceed nodes — bad for HPA. The idiomatic pattern is hard zone spread via topologySpreadConstraints (maxSkew: 1, zone, DoNotSchedule) plus soft node spread via preferred podAntiAffinity. Without matchLabelKeys, a spread constraint counts all Deployment revisions together during a rollout, so new pods can cluster in one zone while the old+new mix still satisfies maxSkew. Add matchLabelKeys: [pod-template-hash] (beta, on by default since 1.27) so each revision spreads independently.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
matchLabelKeys:
- pod-template-hash # each rollout revision spreads independently
Autoscaling
Memory-based HPA is an anti-pattern for load-responsive scaling. CPU is compressible — an overloaded pod's utilization rises and adding replicas sheds load, a valid signal. Memory is non-compressible and runtimes (JVM, Go, Node.js) rarely return heap to the OS, so utilization creeps up and stays high at rest: the HPA scales out but never scales back in, parking at maxReplicas. Two further traps: undersized memory requests make the percentage math run away; and if any container lacks a CPU request the HPA controller silently ignores that pod for CPU metrics (no event). Drive HPA from CPU or external/custom metrics (RPS, queue depth via KEDA); use VPA to right-size memory requests; never run HPA and VPA on the same metric. Reserve memory HPA for the rare case where memory truly tracks load (a cache filling with traffic), and target AverageValue (absolute bytes), not Utilization.
Networking
from/to list semantics: separate list items OR, selectors within one item AND. This is the most common NetworkPolicy security bug, and a one-dash indentation change silently flips the security posture with no API error. namespaceSelector + podSelector as sibling fields in the same list element are ANDed (pods matching the label and in the matching namespace — the zero-trust intent). As separate list elements (each with its own leading dash) they are ORed (any pod in the namespace, or any pod with the label cluster-wide — dangerously over-permissive). Both shapes are accepted silently.
ingress:
- from:
# AND: only role=client pods that are ALSO in team=frontend namespaces
- namespaceSelector:
matchLabels:
team: frontend
podSelector:
matchLabels:
role: client
NetworkPolicy is a silent no-op without an enforcing CNI. The API server stores policies regardless of whether the network plugin enforces them — the docs state plainly that a NetworkPolicy without a controller that implements it has no effect, with no error or event. Flannel alone does not enforce NetworkPolicy, so a Flannel-only cluster that looks locked down is wide open. Calico, Cilium, Antrea, and Weave Net do enforce it. Verify empirically after applying:
kubectl run test --image=busybox --restart=Never -it --rm -- wget -T2 -O- http://target-service
DNS egress must allow BOTH UDP/53 and TCP/53. A port entry allows exactly one protocol. DNS uses UDP/53 but falls back to TCP/53 for responses over the UDP size limit (many records, DNSSEC) and some resolver configs. UDP-only policies break DNS intermittently and unattributably — always list both.
Security
Restricted PSS requires seccompProfile: RuntimeDefault (or Localhost) — an unset profile is rejected. Under the Restricted Pod Security Standard seccompProfile.type is a restricted field whose only allowed values are RuntimeDefault or Localhost; Undefined/nil is not allowed (stricter than Baseline). A pod omitting it is rejected by the PodSecurity admission controller in any namespace labeled pod-security.kubernetes.io/enforce: restricted. Note: readOnlyRootFilesystem is good hardening but is not a PSS-checked control — don't conflate the two. (PodSecurity admission replaced PodSecurityPolicy, removed in 1.25.)
PodSecurity enforce is checked at pod creation, not on the Deployment. The admission controller evaluates the pod template when the ReplicaSet controller creates pods, not when you kubectl apply the Deployment. With only enforce: restricted, the Deployment applies cleanly but its pods are silently rejected — a broken rollout that is far harder to debug than a pre-apply warning. Add warn and audit labels and pin enforce-version so violations surface on the workload apply; roll Restricted out in warn mode first.
metadata:
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: v1.31
pod-security.kubernetes.io/warn: restricted # surfaces violations on Deployment apply
pod-security.kubernetes.io/audit: restricted
RBAC escalate, bind, impersonate are cluster-admin-equivalent; list/watch on secrets leak contents. These three verbs bypass the protection that stops a user granting rights they don't have: escalate lets them author a Role with more rights than they hold, bind lets them bind a role whose rights they lack, impersonate lets them act as a privileged identity. Treat any holder as having full cluster control. Separately, list/watch on secrets return the actual .data of every secret, not just metadata — grant application service accounts only get with an explicit resourceNames allowlist.
ClusterRoleBinding grants its role in EVERY namespace. A benign pod-reader ClusterRole bound via a ClusterRoleBinding lets the subject read pods (and any other granted resource, including secrets) in kube-system and everywhere else. To grant a ClusterRole's permissions within one namespace, reference it from a namespaced RoleBinding. Reserve ClusterRoleBindings for genuinely cluster-scoped needs (nodes, PVs, cluster-wide operators).
Disable automountServiceAccountToken on workloads that never call the API. Every pod gets an SA token mounted by default, even if the app never talks to the API server. Since 1.22 these are short-lived auto-rotating bound tokens, but they are still a usable scoped API credential handed to anyone who gets code execution in the container. Set automountServiceAccountToken: false on the ServiceAccount (or per-Pod, which takes precedence) and re-enable only on pods that genuinely need API access. High-value, low-effort CIS hardening.
Lifecycle & Availability Gotchas
A preStop sleep absorbs the endpoint-deregistration race — and eats the grace period. On deletion the kubelet sends SIGTERM (and runs preStop) in parallel with the endpoints controller removing the pod from EndpointSlices and kube-proxy asynchronously deleting iptables/IPVS rules on every node. A pod that exits immediately still receives — and RSTs — new connections via stale rules, causing intermittent 502/connection-reset on every rolling update (invisible on single-node clusters). Add a preStop sleep (commonly 5–15s) so the pod keeps serving while it deregisters. Timing trap: terminationGracePeriodSeconds starts at deletion and the preStop sleep is charged against it (SIGTERM is sent only after preStop returns), so set grace ≥ preStop sleep + app drain + margin or SIGKILL truncates the drain. Hooks are at-least-once — keep them idempotent.
terminationGracePeriodSeconds: 60 # 5s preStop + ~45s drain + margin
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # serve through endpoint deregistration
StatefulSet PVCs are retained by default — opt into deletion explicitly. Deleting or scaling down a StatefulSet does not delete PVCs from volumeClaimTemplates (by design, for data safety), so they accumulate cost indefinitely. Kubernetes 1.27 graduated persistentVolumeClaimRetentionPolicy to beta with whenDeleted/whenScaled knobs, but both still default to Retain. Use whenScaled: Retain / whenDeleted: Delete for production. Separately, the default OrderedReady podManagementPolicy serializes rollouts: if pod N fails readiness, pods 0..N-1 are not updated — a stuck pod deadlocks the rollout until deleted deliberately.
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Delete # reclaim PVCs on teardown
whenScaled: Retain # keep data on scale-in
Set unhealthyPodEvictionPolicy: AlwaysAllow on PDBs to prevent node-drain deadlocks. The default IfHealthyBudget only evicts unhealthy (e.g. crash-looping) pods if the budget is already satisfied, so a node drain can block indefinitely waiting for a broken app to become healthy — a silent incident during maintenance and upgrades. AlwaysAllow evicts already-unhealthy pods regardless of budget while still protecting healthy ones; the docs recommend it. The GA field requires Kubernetes 1.31+ (alpha 1.26, beta 1.27).
PDB v1 empty selector matches ALL pods — a silent inversion from v1beta1. policy/v1beta1 PDB was removed in 1.25; use policy/v1. Critical behavior change: in v1beta1 an empty selector ({}) matched zero pods, but in v1 it matches every pod in the namespace. A migrated manifest still carrying selector: {} silently becomes a namespace-wide constraint blocking all voluntary disruptions. Always set an explicit selector.
Currency (deprecated patterns to avoid)
kubernetes.io/ingress.classannotation → usespec.ingressClassName(deprecated since 1.18).kubectl get componentstatuses(cs) → usekubectl get --raw '/livez?verbose'/'/readyz?verbose'(deprecated since 1.19; reports false "Unhealthy" for scheduler/controller-manager).policy/v1beta1PodDisruptionBudget andPodSecurityPolicy→ both removed in 1.25; usepolicy/v1PDB and PodSecurity admission labels.