Karpenter
Overview
Karpenter is a Kubernetes node autoscaler that provisions right-sized compute resources in response to changing application load. Unlike Cluster Autoscaler which scales predefined node groups, Karpenter provisions nodes based on aggregate pod resource requirements, enabling better bin-packing and cost optimization.
Key Differences from Cluster Autoscaler
- Direct provisioning: Talks directly to cloud provider APIs (no node groups required)
- Fast scaling: Provisions nodes in seconds vs minutes
- Flexible instance selection: Chooses from all available instance types automatically
- Consolidation: Actively replaces nodes with cheaper alternatives
- Spot instance optimization: First-class support (on-demand fallback is opt-in, not automatic — see Spot Instance Management)
When to Use Karpenter
- Running workloads with diverse resource requirements
- Need for fast scaling (sub-minute response)
- Cost optimization with spot instances and Graviton (ARM64)
- Consolidation to reduce cluster waste and over-provisioning
- Clusters with unpredictable or bursty workloads
- Right-sizing infrastructure to actual usage patterns
- Managing mixed capacity types (spot/on-demand) automatically
Instructions
1. Installation and Setup
- Install Karpenter controller in cluster
- Configure cloud provider credentials (IAM roles)
- Set up instance profiles and security groups
- Create NodePools for different workload types
- Define EC2NodeClass (AWS) or equivalent for your provider
2. Design NodePool Strategy
- Separate NodePools for different workload classes
- Define instance type families and sizes
- Configure spot/on-demand mix
- Set resource limits per NodePool
- Plan for multi-AZ distribution
3. Configure Disruption Management
- Set disruption budgets to control churn
- Configure consolidation policies
- Define expiration windows for node lifecycle
- Handle workload-specific disruption constraints
- Test disruption scenarios
4. Optimize for Cost and Performance
- Enable consolidation for cost savings
- Use spot instances with fallback strategies
- Set appropriate resource requests on pods (Karpenter depends on accurate requests)
- Monitor node utilization and waste
- Adjust instance type restrictions based on usage
- Leverage Graviton (ARM64) instances for 20% cost reduction
- Configure capacity-type weighting to prefer spot over on-demand
5. Cost Optimization Strategies
- Spot instances: Configure 70-90% spot mix for fault-tolerant workloads
- Graviton (ARM64): Use c7g, m7g, r7g families for lower costs
- Consolidation: Enable WhenEmptyOrUnderutilized policy to replace expensive nodes
- Instance diversity: Wide instance family selection improves spot availability
- Right-sizing: Let Karpenter bin-pack efficiently instead of over-provisioning
6. Spot Instance Management
- Use wide instance type selection (10+ families) for better spot availability
- Fallback to on-demand is NOT automatic: include both
spotandon-demandin one NodePool'scapacity-type(or run a lower-weight on-demand NodePool) — a spot-only NodePool leaves pods Pending when spot is exhausted - Implement Pod Disruption Budgets to control blast radius
- Set graceful termination handlers in applications (preStop hooks)
- Monitor spot interruption rates and adjust instance selection
- Use diverse availability zones to reduce correlated failures
7. Node Consolidation
- WhenEmptyOrUnderutilized: Replaces nodes with cheaper/smaller alternatives actively (v1 name; the old
WhenUnderutilizedis rejected) - WhenEmpty: Only consolidates completely empty nodes (conservative)
- Configure consolidateAfter delay to prevent churn (30s-600s typical)
- Use disruption budgets to limit consolidation rate (5-20% per window)
- Respect Pod Disruption Budgets during consolidation
- Set expiration windows to force periodic node refresh
Best Practices
- Start Conservative: Begin with restrictive instance types, expand based on observation
- Use Disruption Budgets: Prevent too many nodes from being disrupted simultaneously
- Set Pod Resource Requests: Karpenter relies on accurate requests for scheduling
- Enable Consolidation: Let Karpenter optimize node utilization automatically
- Separate Workload Classes: Use multiple NodePools for different requirements
- Monitor Provisioning: Track provisioning latency and failures
- Test Spot Interruptions: Ensure graceful handling of spot instance terminations
- Use Topology Spread: Combine with pod topology constraints for availability
Examples
Example 1: Basic NodePool with Multiple Instance Types
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
# Template for nodes created by this NodePool
template:
spec:
# Reference to EC2NodeClass (AWS-specific configuration).
# In v1, group AND kind are required alongside name (see Gotchas).
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
# Requirements that constrain instance selection
requirements:
# Use amd64 or arm64 architectures
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
# Prefer instance-category + generation over a fixed family list
# (broader spot pool, auto-adopts new generations — see Idioms).
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["2"]
# Allow a range of instance sizes
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["large", "xlarge", "2xlarge", "4xlarge"]
# Use spot, falling back to on-demand (both types required for fallback)
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
# Spread across availability zones
- key: topology.kubernetes.io/zone
operator: In
values: ["us-west-2a", "us-west-2b", "us-west-2c"]
# NOTE: kubelet config lives on EC2NodeClass.spec.kubelet in v1,
# NOT here on the NodePool (see Corrections). expireAfter is also
# a template field in v1 and is drift-able.
expireAfter: 720h
# Taints and labels
taints:
- key: workload-type
value: general
effect: NoSchedule
# Metadata applied to nodes
metadata:
labels:
workload-type: general
managed-by: karpenter
# Limits for this NodePool
limits:
cpu: 1000
memory: 1000Gi
# Disruption controls
disruption:
# Consolidation policy (v1: WhenEmpty or WhenEmptyOrUnderutilized)
consolidationPolicy: WhenEmptyOrUnderutilized
# Time window for when disruptions are allowed
consolidateAfter: 30s
# Budgets control the rate of (voluntary) disruptions
budgets:
- nodes: "10%"
duration: 5m
# Node weight for scheduling decisions (higher = preferred)
weight: 10
Example 2: EC2NodeClass for AWS-Specific Configuration
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
# AMI selection is REQUIRED in v1 (unless amiFamily is Custom).
# Pin alias family@version in production so AMI rollouts are controlled
# via drift, not triggered automatically on every AWS release. Use al2023
# (or bottlerocket): EKS stopped publishing AL2 AMIs on 2025-11-26 (k8s
# 1.32 was the last version with them).
amiSelectorTerms:
- alias: al2023@v20240807
# Alternative: select by id/tags instead of an alias (cannot combine
# an alias term with other term types):
# amiSelectorTerms:
# - id: ami-0123456789abcdef0
# - tags:
# karpenter.sh/discovery: my-cluster
# Kubelet configuration lives on EC2NodeClass in v1 (moved from NodePool).
# NodePools needing distinct kubelet config each need their own EC2NodeClass.
kubelet:
maxPods: 110
systemReserved:
cpu: 100m
memory: 100Mi
ephemeral-storage: 1Gi
evictionHard:
memory.available: 5%
nodefs.available: 10%
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
# IAM role for nodes (instance profile)
role: KarpenterNodeRole-my-cluster
# Subnet selection - use tags to identify subnets
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
kubernetes.io/role/internal-elb: "1"
# Security group selection
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
- name: my-cluster-node-security-group
# User data for node initialization.
# Do NOT call /etc/eks/bootstrap.sh — Karpenter already injects it (AL2),
# and on AL2023 Karpenter-owned fields (maxPods, labels, taints) override
# userData regardless. Use this only for extra OS-level tuning.
userData: |
#!/bin/bash
echo 'fs.inotify.max_user_watches=524288' >> /etc/sysctl.d/99-custom.conf
sysctl -p /etc/sysctl.d/99-custom.conf
# Block device mappings for EBS volumes
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 3000
throughput: 125
encrypted: true
deleteOnTermination: true
# Metadata options for IMDS. v1 default hopLimit is 1, which blocks
# non-hostNetwork pods from reaching IMDS — give such pods IRSA/Pod
# Identity instead of raising this to 2 (see Security).
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 1
httpTokens: required
# Detailed monitoring
detailedMonitoring: true
# Tags applied to EC2 instances
tags:
Name: karpenter-node
Environment: production
ManagedBy: karpenter
ClusterName: my-cluster
Example 3: Specialized NodePools for Different Workloads
---
# GPU workload NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-workloads
spec:
template:
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-nodes
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["g5", "g6", "p4", "p5"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"] # GPU instances typically on-demand
- key: karpenter.k8s.aws/instance-gpu-count
operator: Gt
values: ["0"]
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
metadata:
labels:
workload-type: gpu
nvidia.com/gpu: "true"
limits:
cpu: 500
memory: 2000Gi
nvidia.com/gpu: 16
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 300s
---
# Batch/Spot-heavy NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: batch-workloads
spec:
template:
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
# Spot-only: NO automatic on-demand fallback — pods stay Pending
# if spot is exhausted. Add "on-demand" here for fallback.
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["c6a", "c6i", "c7i", "m6a", "m6i"] # Compute-optimized
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["2xlarge", "4xlarge", "8xlarge"]
taints:
- key: workload-type
value: batch
effect: NoSchedule
metadata:
labels:
workload-type: batch
spot-interruption-handler: enabled
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 60s
budgets:
- nodes: "20%" # Allow more aggressive disruption for batch
---
# Stateful workload NodePool (on-demand only)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: stateful-workloads
spec:
template:
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: stateful-nodes
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"] # Only on-demand for stability
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["r6i", "r7i"] # Memory-optimized
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["xlarge", "2xlarge", "4xlarge"]
- key: topology.kubernetes.io/zone
operator: In
values: ["us-west-2a", "us-west-2b"]
# kubelet config (e.g. maxPods: 50 for lower density) belongs on the
# referenced EC2NodeClass "stateful-nodes" in v1, not here.
taints:
- key: workload-type
value: stateful
effect: NoSchedule
metadata:
labels:
workload-type: stateful
storage-optimized: "true"
limits:
cpu: 200
memory: 800Gi
disruption:
consolidationPolicy: WhenEmpty # Only consolidate when completely empty
consolidateAfter: 600s # Wait 10 minutes
budgets:
- nodes: "1" # Very conservative disruption
duration: 30m
Example 4: Disruption Budgets and Consolidation Policies
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: production-apps
spec:
template:
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["c6i", "m6i", "r6i"]
# Expiration is a template field in v1 and is drift-able: changing it
# rolls existing nodes. Expiration is forceful and NOT budget-limited,
# so cap drain time with terminationGracePeriod.
expireAfter: 720h # 30 days
terminationGracePeriod: 1h
# Advanced disruption configuration
disruption:
# Consolidation policy options (v1):
# - WhenEmptyOrUnderutilized: replace under-used nodes with cheaper/smaller
# - WhenEmpty: only replace completely empty nodes
consolidationPolicy: WhenEmptyOrUnderutilized
# How soon after a node becomes eligible for consolidation
consolidateAfter: 30s
# Multiple budget windows. SCHEDULES ARE UTC-ONLY (no timezone support),
# and when windows overlap Karpenter takes the MINIMUM. NotReady nodes
# also consume budget — see Gotchas.
budgets:
# During business hours: conservative disruption (08:00 UTC)
- nodes: "5%"
duration: 8h
schedule: "0 8 * * MON-FRI"
# During off-hours: more aggressive consolidation
- nodes: "20%"
duration: 16h
schedule: "0 18 * * MON-FRI"
# Weekends: most aggressive
- nodes: "30%"
duration: 48h
schedule: "0 0 * * SAT"
# Default budget (always active)
- nodes: "10%"
Example 5: Pod Scheduling with Karpenter
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-application
spec:
replicas: 5
selector:
matchLabels:
app: my-application
template:
metadata:
labels:
app: my-application
spec:
# Tolerations to allow scheduling on Karpenter nodes
tolerations:
- key: workload-type
operator: Equal
value: general
effect: NoSchedule
# Node selector to target specific NodePool
nodeSelector:
workload-type: general
karpenter.sh/capacity-type: spot # Prefer spot
# Affinity rules for better placement
affinity:
# Spread across zones for availability
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: my-application
topologyKey: topology.kubernetes.io/zone
# Node affinity for instance type preferences
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
# Prefer ARM instances (cheaper)
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/arch
operator: In
values: ["arm64"]
# Prefer larger instances (better bin-packing)
- weight: 30
preference:
matchExpressions:
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["2xlarge", "4xlarge"]
# Topology spread constraints
topologySpreadConstraints:
# Spread across zones
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: my-application
# Spread across nodes
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: my-application
containers:
- name: app
image: my-app:latest
# CRITICAL: Accurate resource requests for Karpenter
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
# Graceful shutdown for spot interruptions
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 15 # Allow time for deregistration
# Termination grace period for spot interruptions
terminationGracePeriodSeconds: 30
Example 6: Spot Instance Handling and Fallback
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-with-fallback
spec:
template:
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
# Both capacity types in ONE NodePool gives on-demand fallback when
# spot is exhausted (spot-only would leave pods Pending).
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
# Wide instance type selection for better spot availability
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- "c5a"
- "c6a"
- "c6i"
- "c7i"
- "m5a"
- "m6a"
- "m6i"
- "m7i"
- "r5a"
- "r6a"
- "r6i"
- "r7i"
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["large", "xlarge", "2xlarge", "4xlarge"]
# Support both architectures for more spot options
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
# Metadata to track spot usage.
# NOTE: spot-to-spot consolidation is a CONTROLLER feature gate
# (settings.featureGates.spotToSpotConsolidation=true via Helm), NOT a
# NodePool annotation — there is no such annotation (see Gotchas).
metadata:
labels:
spot-enabled: "true"
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
# More aggressive for spot since they can be interrupted anyway
budgets:
- nodes: "25%"
# Weight influences Karpenter's NodePool selection
# Higher weight = more preferred
# Use lower weight so other NodePools are tried first
weight: 5
Example 7: Karpenter with Pod Disruption Budget
# Application Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-service
spec:
replicas: 6
selector:
matchLabels:
app: critical-service
template:
metadata:
labels:
app: critical-service
spec:
tolerations:
- key: workload-type
operator: Equal
value: general
effect: NoSchedule
containers:
- name: app
image: critical-service:latest
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
---
# Pod Disruption Budget to protect during consolidation
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-service-pdb
spec:
minAvailable: 4 # Always keep at least 4 replicas running
selector:
matchLabels:
app: critical-service
# Karpenter respects PDBs during consolidation
# It will not disrupt nodes if doing so would violate the PDB
Example 8: Multi-Architecture NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: multi-arch
spec:
template:
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
# Support both AMD64 and ARM64
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
# ARM instances (Graviton) - typically 20% cheaper
- key: karpenter.k8s.aws/instance-family
operator: In
values:
# ARM (Graviton2)
- "c6g"
- "m6g"
- "r6g"
# ARM (Graviton3)
- "c7g"
- "m7g"
- "r7g"
# AMD64 alternatives
- "c6i"
- "m6i"
- "r6i"
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
metadata:
labels:
multi-arch: "true"
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 60s
---
# EC2NodeClass with multi-architecture AMI support
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
# amiSelectorTerms is required in v1. The al2023 alias resolves the right
# AMI per architecture automatically; pin @version in production.
amiSelectorTerms:
- alias: al2023@v20240807
role: KarpenterNodeRole-my-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
Expert Practices: Idioms, Anti-Patterns & Gotchas
Currency (v1 API — Karpenter 1.0+)
-
Use the v1 APIs exclusively. Karpenter 1.0 graduated
NodePooltokarpenter.sh/v1andEC2NodeClasstokarpenter.k8s.aws/v1; 1.1 droppedv1beta1entirely (the conversion webhooks are gone). Av1beta1manifest is rejected on Karpenter >= 1.1 — this is a hard break, not a deprecation warning. The v1 APIs carry a compatibility guarantee across the 1.x line. -
nodeClassRefrequiresgroup+kind+name. v1 renamed the oldapiVersionkey togroup, and as of v1.1.0groupandkindare strictly required alongsidename. A ref with onlynameleaves the NodePool NotReady — there is no default fallback.nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default -
kubeletconfig moved from NodePool toEC2NodeClass.spec.kubelet(maxPods, podsPerCore, systemReserved, evictionHard, imageGC thresholds). Akubeletblock left on a NodePool is invalid. Because many NodePools share one EC2NodeClass, NodePools that need distinct kubelet config each need their own EC2NodeClass. Thecompatibility.karpenter.sh/v1beta1-kubelet-conversionmigration annotation was dropped in 1.1, so anything relying on it silently loses kubelet config after the upgrade. -
consolidationPolicy: WhenUnderutilizedwas renamed toWhenEmptyOrUnderutilized(old value rejected). AndexpireAftermoved fromspec.disruptiontospec.template.spec.expireAfterand is now drift-able: changing it triggers Drift and rolling replacement of running nodes (in v1beta1 it was a no-op on existing nodes). Pair it withspec.template.spec.terminationGracePeriod— v1 expiration is forceful and NOT rate-limited by disruption budgets.
Anti-Patterns
-
amiSelectorTermsis required in v1 (unlessamiFamily: Custom); omitting it leaves the EC2NodeClass and every referencing NodePool NotReady. Analiasterm cannot be combined with other term types and must match theamiFamily. Pinalias: family@versionin production —family@latestrolls every node whenever AWS publishes a new EKS-optimized AMI, so an untested AMI can break workloads with no operator action. Use al2023 or bottlerocket for new clusters: k8s 1.32 was the last version with EKS AL2 AMIs, and EKS stopped publishing them on 2025-11-26. (The AL2 base OS itself is supported until 2026-06-30 — it has not reached EOL.) -
Never run the Karpenter controller on a Karpenter-managed node. A spot interruption, consolidation, or expiry can terminate the controller before it provisions its replacement — a circular dependency where no controller is up to launch a node and no node exists to host the controller. Run it on EKS Fargate (a Fargate profile for the
karpenternamespace) or a static managed node group Karpenter does not manage, pinned vianodeSelector/tolerations. -
Make NodePools mutually exclusive or weighted. AWS: "if multiple NodePools are matched, Karpenter will randomly choose which to use, causing unexpected results." Enforce routing with taints on the NodePool + matching tolerations (hard isolation, e.g. GPU pools) or distinct
weightvalues (preference ordering with fallback). -
Do not call
/etc/eks/bootstrap.shin customuserData(AL2) — Karpenter already injects it, so a second call reconfigures an already-running kubelet, init fails, and the node never joins despite appearing to start. On AL2023, userData is merged as NodeConfig and Karpenter-owned fields (maxPods, labels, taints) override userData — set those via native spec fields, not userData.
Gotchas
-
httpPutResponseHopLimitdefaults to 1 in v1 (was 2). This deliberately prevents non-hostNetworkpods from reaching IMDS (169.254.169.254) — the response TTL expires crossing the container netns. Any pod calling IMDS directly (SDK credential chaining, region/AZ detection) then silently fails. Fix with IRSA or EKS Pod Identity, not by raising the hop limit to 2 (that re-exposes IMDS to all containers — a credential-theft surface). Raise to 2 only as a deliberate, scoped exception. -
Set memory
requests=limitswhen consolidation is enabled. Karpenter bin-packs against requests; limits are ignored. AfterWhenEmptyOrUnderutilizedpacks pods tightly, pods whose memory limit exceeds their request can all burst at once and OOM-kill neighbors. Incompressible resources (memory, ephemeral-storage, GPU/hugepages) want requests ≈ working set / equal to limits; CPU is compressible (throttled, not killed), sorequests != limitsis fine there. -
karpenter.sh/do-not-disruptonly blocks voluntary disruption (consolidation, voluntary drift). It does NOT block Expiration, Interruption, Node Repair, or manual deletion. Since v1 made expiration forceful, a long-running pod relying solely on this annotation is still terminated when the node's TTL fires — useterminationGracePeriod+ SIGTERM handling for lifetime guarantees. The value must be empty/"true"or a valid Go duration; an invalid value (e.g."30 minutes") is silently ignored with only a Kubernetes event. -
Disruption budget math subtracts deleting AND NotReady nodes:
allowed = roundup(total * pct) - deleting - notready. A cluster under resource pressure can resolve to 0 allowed disruptions and block all consolidation with nothing intentional in flight. With multiple active budget windows Karpenter takes the minimum. Schedules are UTC-only (no timezone) —0 8 * * MON-FRIfires at 08:00 UTC. Forceful methods (expiration, interruption) are never budget-limited. -
Spot-to-spot consolidation needs the feature gate AND >= 15 instance types. Enable via Helm
settings.featureGates.spotToSpotConsolidation=true(controller-level — there is nokarpenter.sh/spot-to-spot-consolidationNodePool annotation; it is fabricated and does nothing). Even enabled, single-node spot-to-spot consolidation requires >= 15 cheaper qualifying instance types or Karpenter logsrequires 15 cheaper instance type options ... got Nand skips. Over-constraining instance families silently disables the optimization. -
spec.limitsis a soft, eventually-consistent cap — during a burst, parallel provisioning decisions can each see room and all launch, transiently overshooting it. Limits are per NodePool only (no cluster-wide limit). When hit, Karpenter writesresource usage of X exceeds limit of Yto controller logs only (no Kubernetes event) — detect overrun with a CloudWatch Logs metric filter + a billing alarm. Treat limits + billing alarms as the cost guardrail, not a hard spend cap. -
Karpenter treats preferred affinity as required on the first scheduling pass, relaxing preferences one at a time only if requirements can't be met (unlike kube-scheduler, which treats them as soft against existing nodes). A pod with
preferredDuringSchedulingpod-anti-affinity can therefore make Karpenter provision a NEW node instead of using an underutilized one — costly for overprovisioning/headroom placeholders. (This does NOT apply to topology spread.) If spreading is required for correctness, userequiredDuringSchedulingaffinity ortopologySpreadConstraintswithDoNotSchedule.
Idioms
-
Prefer
instance-category+instance-generationover fixedinstance-familylists. The official default NodePool selectsinstance-category In [c, m, r]andinstance-generation Gt 2. This keeps the spot pool broad (Price-Capacity-Optimized draws from the deepest pools → lower interruption risk) and auto-adopts new generations without editing the manifest. A short family list is rigid and narrows the spot pool.requirements: - key: karpenter.k8s.aws/instance-category operator: In values: ["c", "m", "r"] - key: karpenter.k8s.aws/instance-generation operator: Gt values: ["2"] -
Enable native interruption handling via SQS; do not also run Node Termination Handler. Point the controller at an SQS queue fed by EventBridge rules (
--interruption-queue/ Helmsettings.interruptionQueue). It proactively taints/drains/replaces nodes on spot notices, scheduled maintenance, and stop/terminate events, launching a replacement in parallel with the drain on the 2-minute spot notice. Running aws-node-termination-handler alongside it drains the same node twice (conflicting taints, excessive churn) — use one or the other. -
Scope disruption budgets by
reasons(Drifted,Underutilized,Empty; omitted = all voluntary reasons). Rate-limit causes independently — e.g. freeze drift-driven AMI rollouts during business hours while still allowing empty-node cleanup:disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 30s budgets: - nodes: "0" schedule: "0 9 * * mon-fri" # UTC duration: 8h - nodes: "20%" reasons: ["Empty"] # always allow idle-node removal
Private / Air-Gapped Clusters
- A private cluster needs a regional STS VPC endpoint (Karpenter uses IRSA; missing →
WebIdentityErr: failed to retrieve credentials) and an SSM VPC endpoint (queries SSM for EKS-optimized AMI IDs and to hydrate the launch-template cache; missing →Unable to hydrate the AWS launch template cache). There is no VPC endpoint for the Price List API — Karpenter ships on-demand pricing in its binary and only refreshes it on upgrade (logsretreiving on-demand pricing data ... i/o timeout), so plan upgrade cadence to refresh pricing in air-gapped environments. Only two endpoints are required; pricing degrades gracefully to stale data.
Monitoring and Troubleshooting
Key Metrics to Monitor
# Scheduling / provisioning metrics (v1 — "provisioner" metrics were removed)
karpenter_nodes_created_total
karpenter_nodes_terminated_total
karpenter_scheduler_scheduling_duration_seconds
# Disruption metrics
karpenter_nodepools_allowed_disruptions
karpenter_voluntary_disruption_eligible_nodes
karpenter_voluntary_disruption_decisions_total
# Cost metrics
karpenter_cloudprovider_instance_type_offering_price_estimate
# Pod metrics
karpenter_pods_state (pending, running, etc.)
# Cross-check names against the live /metrics endpoint and
# https://karpenter.sh/docs/reference/metrics/ — they change between releases.
Common Issues and Solutions
Issue: Pods stuck in Pending
- Check NodePool requirements match pod node selectors/tolerations
- Verify cloud provider limits not exceeded
- Check instance type availability in selected zones
- Ensure subnet capacity available
Issue: Excessive node churn
- Adjust consolidation delay (consolidateAfter)
- Review disruption budgets
- Check if pod resource requests are accurate
- Consider using WhenEmpty instead of WhenEmptyOrUnderutilized
Issue: High costs despite using Karpenter
- Enable consolidation if not already active
- Verify spot instances are being used
- Check if pods have unnecessarily large resource requests
- Review instance type selection (allow more variety)
Issue: Spot interruptions causing service disruption
- Implement Pod Disruption Budgets
- Use diverse instance types for better spot availability
- Configure appropriate replica counts
- Implement graceful shutdown in applications
Integration with Terraform
# Install Karpenter via Terraform
resource "helm_release" "karpenter" {
namespace = "karpenter"
create_namespace = true
name = "karpenter"
repository = "oci://public.ecr.aws/karpenter"
chart = "karpenter"
version = "1.1.1" # pin a current 1.x release (v1 APIs)
values = [
<<-EOT
settings:
clusterName: ${var.cluster_name}
clusterEndpoint: ${var.cluster_endpoint}
# Native interruption handling — feed this SQS queue from EventBridge.
# Do NOT also run aws-node-termination-handler (double-drain churn).
interruptionQueue: ${var.interruption_queue_name}
# Spot-to-spot consolidation is controller-level (not per-NodePool):
featureGates:
spotToSpotConsolidation: true
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: ${var.karpenter_irsa_arn}
controller:
resources:
requests:
cpu: 1
memory: 1Gi
limits:
cpu: 2
memory: 2Gi
EOT
]
depends_on = [
aws_iam_role_policy_attachment.karpenter_controller
]
}
# Deploy default NodePool
resource "kubectl_manifest" "karpenter_nodepool_default" {
yaml_body = <<-YAML
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["2"]
expireAfter: 720h
limits:
cpu: 1000
memory: 1000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
YAML
depends_on = [helm_release.karpenter]
}
Migration from Cluster Autoscaler
-
Plan the migration
- Identify current node groups and their characteristics
- Map workloads to new NodePool configurations
- Plan for coexistence period
-
Deploy Karpenter alongside Cluster Autoscaler
- Install Karpenter in the cluster
- Create NodePools with distinct labels
- Test with non-critical workloads first
-
Migrate workloads incrementally
- Update pod specs with Karpenter tolerations/node selectors
- Monitor provisioning and consolidation behavior
- Validate cost and performance metrics
-
Remove Cluster Autoscaler
- Once all workloads migrated, scale down CA node groups
- Remove Cluster Autoscaler deployment
- Clean up CA-specific resources