Go Performance Optimization
Overview
This skill provides comprehensive guidance for profiling, benchmarking, and optimizing Go applications. Use this skill when working on performance-critical code, investigating bottlenecks, or optimizing production systems.
When to Use This Skill:
- Profiling application performance
- Benchmarking code changes
- Investigating memory leaks or high allocations
- Optimizing hot paths
- Tuning garbage collection
- Reducing latency in production
Core Tools:
pprof- CPU, memory, and goroutine profilinggo test -bench- Benchmarking frameworkgo build -gcflags- Escape analysisGOGCandGOMEMLIMIT- GC tuning
1. Profiling with pprof
1.1 CPU Profiling
Enable CPU Profiling in Code:
import (
"os"
"runtime/pprof"
)
func main() {
f, err := os.Create("cpu.prof")
if err != nil {
log.Fatal("could not create CPU profile: ", err)
}
defer f.Close()
if err := pprof.StartCPUProfile(f); err != nil {
log.Fatal("could not start CPU profile: ", err)
}
defer pprof.StopCPUProfile()
// Your application code here
runApplication()
}
CLI Profiling:
# Profile a test
go test -cpuprofile=cpu.prof -bench=.
# Profile a binary
go test -c
./myapp.test -test.cpuprofile=cpu.prof -test.bench=.
Analysis Commands:
# Interactive web UI (recommended)
go tool pprof -http=:8080 cpu.prof
# Text output - top functions by CPU time
go tool pprof -top cpu.prof
# Top 20 with cumulative time
go tool pprof -top -cum cpu.prof | head -20
# Call graph visualization
go tool pprof -svg cpu.prof > cpu.svg
# Focus on specific function
go tool pprof -focus=processData cpu.prof
# Exclude standard library
go tool pprof -ignore=runtime cpu.prof
Interpreting CPU Profiles:
- flat: Time spent in function itself (excludes callees)
- flat%: Percentage of total runtime
- sum%: Cumulative percentage
- cum: Time spent in function and callees
- cum%: Cumulative time percentage
Example Output:
Showing nodes accounting for 2.50s, 83.33% of 3.00s total
flat flat% sum% cum cum%
0.80s 26.67% 26.67% 1.20s 40.00% processData
0.60s 20.00% 46.67% 0.90s 30.00% parseJSON
0.50s 16.67% 63.34% 0.50s 16.67% validateInput
Focus optimization on functions with high flat (own time) or cum (total time).
1.2 Memory Profiling
Heap Profiling:
import (
"os"
"runtime/pprof"
)
func captureHeapProfile() {
f, err := os.Create("mem.prof")
if err != nil {
log.Fatal("could not create memory profile: ", err)
}
defer f.Close()
// Force GC before capturing heap
runtime.GC()
if err := pprof.WriteHeapProfile(f); err != nil {
log.Fatal("could not write memory profile: ", err)
}
}
Memory Profiling via CLI:
# Profile memory allocations during test
go test -memprofile=mem.prof -bench=.
# Run benchmark multiple times for stable results
go test -memprofile=mem.prof -bench=. -benchtime=10s
Analysis Commands:
# Web UI showing allocation sites
go tool pprof -http=:8080 mem.prof
# Top allocators
go tool pprof -top mem.prof
# Focus on allocations (inuse_space)
go tool pprof -sample_index=inuse_space -top mem.prof
# Focus on allocation counts (inuse_objects)
go tool pprof -sample_index=inuse_objects -top mem.prof
# Show cumulative allocations (alloc_space)
go tool pprof -sample_index=alloc_space -top mem.prof
# Compare two profiles (before/after)
go tool pprof -base=before.prof after.prof
Memory Profile Types:
inuse_space: Memory currently in use (default)inuse_objects: Objects currently in usealloc_space: Total allocations since startalloc_objects: Total object allocations
1.3 Goroutine Profiling
Detect Goroutine Leaks:
import (
"os"
"runtime/pprof"
)
func captureGoroutineProfile() {
f, err := os.Create("goroutine.prof")
if err != nil {
log.Fatal("could not create goroutine profile: ", err)
}
defer f.Close()
if err := pprof.Lookup("goroutine").WriteTo(f, 0); err != nil {
log.Fatal("could not write goroutine profile: ", err)
}
}
Analysis:
go tool pprof -http=:8080 goroutine.prof
go tool pprof -top goroutine.prof
Goroutine Leak Indicators:
- Steadily increasing goroutine count
- Many goroutines blocked on channel recv/send
- Goroutines without termination mechanism
1.4 HTTP Profiling Endpoint (Production-Safe)
Enable pprof HTTP Server:
import (
_ "net/http/pprof"
"net/http"
)
func main() {
// Start pprof server on separate port (localhost only)
go func() {
log.Println("pprof server listening on localhost:6060")
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
// Your application here
runServer()
}
Access Profiles via HTTP:
# CPU profile (30 seconds)
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
# Heap profile
curl http://localhost:6060/debug/pprof/heap > heap.prof
# Goroutine profile
curl http://localhost:6060/debug/pprof/goroutine > goroutine.prof
# Analyze immediately
go tool pprof http://localhost:6060/debug/pprof/profile
# Web UI
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile
Available Endpoints:
/debug/pprof/- Index of all profiles/debug/pprof/profile- CPU profile/debug/pprof/heap- Heap profile/debug/pprof/goroutine- Goroutine stack traces/debug/pprof/threadcreate- Thread creation profile/debug/pprof/block- Blocking profile/debug/pprof/mutex- Mutex contention profile
Production Security:
// Only expose on localhost
http.ListenAndServe("localhost:6060", nil)
// Or use SSH port forwarding
// ssh -L 6060:localhost:6060 user@production-host
// Then access http://localhost:6060/debug/pprof/
2. Benchmarking
2.1 Basic Benchmarks
Simple Benchmark:
func BenchmarkStringConcat(b *testing.B) {
for i := 0; i < b.N; i++ {
result := "hello" + " " + "world"
_ = result // Prevent compiler optimization
}
}
Benchmark with Setup:
func BenchmarkProcessData(b *testing.B) {
data := generateTestData(1000)
b.ResetTimer() // Exclude setup time
for i := 0; i < b.N; i++ {
processData(data)
}
}
Running Benchmarks:
# Run all benchmarks
go test -bench=.
# Run specific benchmark
go test -bench=BenchmarkStringConcat
# Benchmark with memory statistics
go test -bench=. -benchmem
# Run multiple iterations for stability
go test -bench=. -count=5
# Longer benchmark time for accurate results
go test -bench=. -benchtime=10s
# CPU profile during benchmark
go test -bench=. -cpuprofile=cpu.prof
2.2 Sub-Benchmarks
Compare Multiple Implementations:
func BenchmarkStringBuilding(b *testing.B) {
items := []string{"hello", "world", "foo", "bar"}
b.Run("Concat", func(b *testing.B) {
for i := 0; i < b.N; i++ {
result := ""
for _, item := range items {
result += item
}
_ = result
}
})
b.Run("StringBuilder", func(b *testing.B) {
for i := 0; i < b.N; i++ {
var sb strings.Builder
for _, item := range items {
sb.WriteString(item)
}
_ = sb.String()
}
})
b.Run("Join", func(b *testing.B) {
for i := 0; i < b.N; i++ {
result := strings.Join(items, "")
_ = result
}
})
}
Output:
BenchmarkStringBuilding/Concat-8 500000 3245 ns/op 96 B/op 5 allocs/op
BenchmarkStringBuilding/StringBuilder-8 2000000 825 ns/op 64 B/op 1 allocs/op
BenchmarkStringBuilding/Join-8 2000000 780 ns/op 48 B/op 1 allocs/op
2.3 Memory Reporting
Track Allocations:
func BenchmarkWithAllocs(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
data := make([]int, 1000)
_ = data
}
}
Output Interpretation:
BenchmarkWithAllocs-8 200000 8234 ns/op 8192 B/op 1 allocs/op
------ ---- ---- ----
iters ns/op bytes/op allocs/op
- ns/op: Nanoseconds per operation
- B/op: Bytes allocated per operation
- allocs/op: Number of allocations per operation
Zero Allocation Goal:
// Bad: 2 allocations
func process(data string) string {
upper := strings.ToUpper(data) // 1 alloc
return strings.TrimSpace(upper) // 1 alloc
}
// Better: 1 allocation (reuse buffer)
func process(data string) string {
var sb strings.Builder
sb.Grow(len(data))
for _, r := range data {
if !unicode.IsSpace(r) {
sb.WriteRune(unicode.ToUpper(r))
}
}
return sb.String()
}
2.4 Benchmark Analysis with benchstat
Compare Before/After:
# Baseline
go test -bench=. -count=10 > old.txt
# After optimization
go test -bench=. -count=10 > new.txt
# Statistical comparison
go install golang.org/x/perf/cmd/benchstat@latest
benchstat old.txt new.txt
Example Output:
name old time/op new time/op delta
StringConcat-8 3.24µs ± 2% 0.82µs ± 1% -74.69% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
StringConcat-8 96.0B ± 0% 64.0B ± 0% -33.33% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
StringConcat-8 5.00 ± 0% 1.00 ± 0% -80.00% (p=0.000 n=10+10)
Interpretation:
±2%- Variance across runs(p=0.000)- Statistical significance (p < 0.05 = significant)n=10+10- Number of samples used
3. Memory Optimization
3.1 Pre-allocate Slices
Problem: Repeated Reallocation:
// Bad: 14 reallocations for 10,000 items
func inefficient() []int {
var data []int
for i := 0; i < 10000; i++ {
data = append(data, i)
}
return data
}
Solution: Pre-allocate Capacity:
// Good: 1 allocation
func efficient() []int {
data := make([]int, 0, 10000)
for i := 0; i < 10000; i++ {
data = append(data, i)
}
return data
}
Transformation Pattern:
func transformItems(input []string) []Result {
output := make([]Result, 0, len(input))
for _, item := range input {
output = append(output, transform(item))
}
return output
}
Estimated Capacity:
func filterItems(input []string, minLen int) []string {
// Estimate ~50% will pass
output := make([]string, 0, len(input)/2)
for _, item := range input {
if len(item) >= minLen {
output = append(output, item)
}
}
return output
}
Benchmark Impact: 5x faster for 10,000 items
3.2 strings.Builder for Concatenation
Problem: O(N²) String Concatenation:
// Bad: Creates new string on every iteration
func badConcat(items []string) string {
result := ""
for _, item := range items {
result += item // New allocation each time
}
return result
}
Solution: strings.Builder (O(N)):
// Good: Single allocation with growth
func goodConcat(items []string) string {
var sb strings.Builder
// Pre-allocate if size known
totalLen := 0
for _, item := range items {
totalLen += len(item)
}
sb.Grow(totalLen)
for _, item := range items {
sb.WriteString(item)
}
return sb.String()
}
Benchmark: 50x faster for 100 concatenations
Builder Methods:
var sb strings.Builder
sb.WriteString("hello") // Write string
sb.WriteByte('!') // Write single byte
sb.WriteRune('✓') // Write rune (Unicode)
sb.Grow(100) // Pre-allocate capacity
result := sb.String() // Get final string
sb.Reset() // Reuse builder
3.3 Escape Analysis
View Escape Decisions:
go build -gcflags='-m -m' main.go 2>&1 | grep "escapes to heap"
Stack vs Heap:
// Stack allocated (fast)
func sumArray() int {
data := [100]int{} // Stack
sum := 0
for _, v := range data {
sum += v
}
return sum
}
// Heap allocated (slower, escapes)
func createData() *Data {
data := &Data{} // Escapes: pointer returned
return data
}
Common Escape Scenarios:
// 1. Returning pointer to local variable
func escape1() *int {
x := 42
return &x // Escapes
}
// 2. Interface conversion
func escape2() interface{} {
x := 42
return x // Escapes (interface)
}
// 3. Storing in interface field
func escape3(data interface{}) {
globalVar = data // Escapes
}
// 4. Size too large for stack
func escape4() {
data := make([]byte, 1<<20) // 1MB, escapes
_ = data
}
// 5. Slice append beyond capacity
func escape5() {
data := make([]int, 0, 10)
for i := 0; i < 100; i++ {
data = append(data, i) // May escape
}
}
Reducing Escapes:
// Before: Escapes to heap
for _, item := range items {
result := &Result{Value: item}
process(result)
}
// After: Stack allocated (if process doesn't store it)
var result Result
for _, item := range items {
result.Value = item
process(&result)
}
3.4 Reducing Allocations in Hot Paths
Reuse Buffers:
// Package-level buffer pool
var bufferPool = sync.Pool{
New: func() interface{} {
return new(bytes.Buffer)
},
}
func processData(data []byte) string {
buf := bufferPool.Get().(*bytes.Buffer)
buf.Reset() // Clear previous content
defer bufferPool.Put(buf)
// Use buffer
buf.Write(data)
return buf.String()
}
Pre-allocate Maps:
// Bad: Multiple rehashes
m := make(map[string]Item)
for _, item := range items {
m[item.ID] = item
}
// Good: Single allocation
m := make(map[string]Item, len(items))
for _, item := range items {
m[item.ID] = item
}
4. GC Tuning
4.1 GOGC Environment Variable
Default Behavior:
# Default: GC when heap grows 100%
GOGC=100 ./myapp
Tuning Options:
# Less frequent GC (uses more memory, higher throughput)
GOGC=200 ./myapp
# More frequent GC (uses less memory, lower latency)
GOGC=50 ./myapp
# Disable GC (debugging only)
GOGC=off ./myapp
How GOGC Works:
GOGC=100: GC triggers when heap doublesGOGC=200: GC triggers when heap triplesGOGC=50: GC triggers when heap grows 50%
Example:
- Current heap: 100MB
GOGC=100: GC at 200MBGOGC=200: GC at 300MBGOGC=50: GC at 150MB
4.2 GOMEMLIMIT (Go 1.19+)
Set Memory Limit:
# Via environment variable
GOMEMLIMIT=10GiB ./myapp
# Programmatically
debug.SetMemoryLimit(10 << 30) // 10GB
Units Supported:
B- BytesKiB- Kibibytes (1024 bytes)MiB- Mebibytes (1024² bytes)GiB- Gibibytes (1024³ bytes)TiB- Tebibytes (1024⁴ bytes)
How it Works:
- Soft limit (not hard cap)
- GC becomes more aggressive near limit
- Prevents OOM kills in containers
- Works alongside GOGC
4.3 GC Tuning Decision Matrix
| Scenario | GOGC | GOMEMLIMIT | Rationale | |----------|------|------------|-----------| | High throughput batch | 200-400 | 80% of RAM | Reduce GC overhead, use available memory | | Memory-constrained (container) | 50-100 | Limit - 10% | Prevent OOM, more frequent GC | | Latency-sensitive API | 100 | Not set | Default balance between memory and pause | | Large heap (>4GB) | 100-200 | 80% of RAM | Reduce GC frequency for large heaps | | Short-lived processes | 400+ | Not set | Maximize speed, process ends soon |
Example: Container with 2GB RAM:
GOGC=75 GOMEMLIMIT=1800MiB ./myapp
Example: Batch Processing:
GOGC=300 GOMEMLIMIT=24GiB ./batch-processor
Monitoring GC:
import "runtime/debug"
// Get GC stats
var stats debug.GCStats
debug.ReadGCStats(&stats)
fmt.Printf("Last GC: %v\n", stats.LastGC)
fmt.Printf("Num GC: %d\n", stats.NumGC)
5. Performance Anti-Patterns
5.1 String Concatenation in Loops
Anti-Pattern:
// Bad: O(N²) complexity
func buildString(items []string) string {
result := ""
for _, item := range items {
result += item // New allocation each iteration
}
return result
}
Solution:
// Good: O(N) complexity
func buildString(items []string) string {
var sb strings.Builder
for _, item := range items {
sb.WriteString(item)
}
return sb.String()
}
5.2 Unnecessary Allocations
Anti-Pattern 1: Creating Pointers in Loops:
// Bad: N allocations
for _, item := range items {
ptr := &item
process(ptr)
}
// Good: Reuse pointer
var ptr *Item
for i := range items {
ptr = &items[i]
process(ptr)
}
Anti-Pattern 2: Converting to Interface:
// Bad: Causes allocation
func printAll(items []MyStruct) {
for _, item := range items {
fmt.Println(item) // Interface conversion
}
}
// Better: Pass pointer to avoid copy
func printAll(items []MyStruct) {
for i := range items {
fmt.Println(&items[i])
}
}
5.3 Defer Overhead in Hot Paths
Anti-Pattern:
// Bad: Defer has overhead in hot loops
func processMany(items []Item) {
for _, item := range items {
mu.Lock()
defer mu.Unlock() // Accumulates, never runs until function exits
process(item)
}
}
Solution:
// Good: Manual unlock in loop
func processMany(items []Item) {
for _, item := range items {
mu.Lock()
process(item)
mu.Unlock()
}
}
// Or: Extract to function with defer
func processMany(items []Item) {
for _, item := range items {
processOne(item)
}
}
func processOne(item Item) {
mu.Lock()
defer mu.Unlock()
process(item)
}
Quick Reference
Profiling Commands
# CPU profile
go test -cpuprofile=cpu.prof -bench=.
go tool pprof -http=:8080 cpu.prof
# Memory profile
go test -memprofile=mem.prof -bench=.
go tool pprof -http=:8080 mem.prof
# HTTP profiling (production)
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
Benchmarking Commands
# Run benchmarks with memory stats
go test -bench=. -benchmem
# Compare before/after
go test -bench=. -count=10 > old.txt
benchstat old.txt new.txt
Optimization Checklist
- [ ] Profile before optimizing (identify hot paths)
- [ ] Pre-allocate slices with known capacity
- [ ] Use
strings.Builderfor string concatenation - [ ] Check escape analysis with
-gcflags='-m' - [ ] Reduce allocations in hot loops
- [ ] Reuse buffers with
sync.Pool - [ ] Benchmark changes with
-benchmem - [ ] Tune GOGC/GOMEMLIMIT for workload
Related Skills:
golang- Core Go idioms and patternsdatabase-patterns- Database performance optimizationapi-design- API performance best practices
Sources:
- Go Diagnostics Guide: https://go.dev/doc/diagnostics
- Go Blog: Profiling Go Programs
- runtime/pprof package documentation
- Go 1.19 Memory Limit blog post
- benchstat tool documentation