Go Runtime Performance
Lesson, slides, and applied problem sets.
View SlidesLesson
Go Runtime Performance: Allocations, GC, Goroutines, Memory Model, Generics
This module is a deep, practical reference for performance‑critical Go code. It focuses on runtime behavior (allocations, garbage collection, scheduler), safety constraints (memory model), and tradeoffs (generics vs interface vs monomorphized patterns). It also includes a measurement playbook.
Measurement playbook (do this first)
- Reproduce: isolate the workload in a benchmark or minimal harness.
- Profile: CPU + allocations + pprof + trace (CPU shows time, allocs show churn, trace shows scheduler/GC/IO).
- Classify: determine if the bottleneck is CPU, allocation/GC, lock contention, syscalls, or I/O.
- Hypothesize: form 1–3 candidate fixes (algorithmic, data layout, batching, caching).
- Verify: implement one change, re‑measure. Keep the fastest correct version.
- Guard: lock in performance with micro‑benchmarks and allocation checks.
Tooling: concrete commands and what they show
go test -bench . -benchmem→ allocs/op and bytes/op for every benchmark.go test -run none -bench BenchmarkFoo -benchmem -cpuprofile cpu.out -memprofile mem.out→ capture CPU + heap profiles.go tool pprof -http=:0 cpu.out→ visualize hot functions and inlining.go test -run none -bench BenchmarkFoo -trace trace.out+go tool trace trace.out→ goroutine scheduling, GC, syscalls.GODEBUG=gctrace=1→ GC lines like:gc 15 @1.234s 0%: 0.45+1.2+0.02 ms clock, 4->6->3 MB, 8 MB goal- Watch heap size (live set) and GC pause; high allocation rate + large live set = trouble.
Allocations and escape analysis
- Goal: keep hot‑path data on the stack and reuse buffers to avoid GC pressure.
- Use
go test -bench . -benchmemto track allocs/op and bytes/op. - Common allocation sources:
- Converting
[]bytetostring(copy) andstringto[]byte(copy) fmt.Sprintf/fmt.Errorfappendgrowth when capacity is insufficient- map growth or map iteration that forces allocations
interface{}/ generics boxing when values escape
- Converting
- Use preallocation:
make([]T, 0, n)for slices,make(map[K]V, n)for maps. - Prefer slicing into existing buffers over new allocations.
- Investigate escapes with
go build -gcflags=all=-mand look for “escapes to heap”.
Garbage collector (GC) essentials
- Go uses a concurrent, tri‑color, mark‑and‑sweep GC with a pacer.
- The GC cost is proportional to live heap size more than total allocations.
- Two critical metrics: allocation rate and live heap size.
- Tactics:
- Reduce long‑lived heap objects (move hot data to stack, or reuse buffers).
- Prefer compact, contiguous data structures to reduce pointer‑tracing.
- Avoid per‑request object graphs in hot paths.
- Tuning knobs:
GOGC(default 100): higher → fewer GCs, more memory; lower → more GCs, less memory.GOMEMLIMIT(Go 1.19+): hard memory target; GC becomes more aggressive as you approach it.
- Heap size matters more than churn; shrinking the live set is often the biggest win.
Goroutines and scheduler
- Goroutines are cheap but not free. Each has a stack that grows on demand.
- Scheduler contention and synchronization are frequent real bottlenecks.
- Guidelines:
- Use coarse‑grained goroutines; avoid spawning per small task.
- Batch work to amortize synchronization.
- Minimize shared mutable state; shard state if possible.
- Measure with
go test -run Test -bench .andgo test -tracefor scheduler insights.
- Mental model: M/P/G (machine threads, processors, goroutines). Blocking syscalls park an M; cgo can pin threads.
- For CPU‑bound work, use worker pools; for I/O‑bound work, goroutine‑per‑request is usually fine.
Memory model (correctness first)
- The Go memory model guarantees that reads observe writes only through synchronization.
- Use
sync.Mutex,sync/atomic, channels, or other synchronization to establish happens‑before. - Avoid data races even if they "seem to work"; races invalidate compiler and CPU assumptions.
- If performance requires lock‑free patterns, validate with
-raceand micro‑benchmarks. - Example (racy):
var done bool
var x int
go func() {
x = 1
done = true
}()
for !done {}
fmt.Println(x) // data race; no happens-before
- Example (safe):
var x int
done := make(chan struct{})
go func() {
x = 1
close(done)
}()
<-done
fmt.Println(x) // safe
Generics performance notes
- Generics reduce code duplication but can hide costs if values escape.
- Performance depends on constraints and inlining:
- Constrained types can enable inlining and avoid interface boxing.
- Use
~type sets to allow optimized operations on underlying types.
- Beware of generic functions that return interface values or store into
any. - For hot paths, measure both generic and hand‑specialized versions; choose the faster and clearer one.
- Prefer constraints like
constraints.Integer/constraints.Floatwhen you need ops without interface boxing. - Avoid returning
anyfrom generic helpers; it forces interface allocation for non‑pointer values.
Zero‑allocation patterns
- Accept
[]byteinputs and write into caller‑provided buffers. - Avoid building intermediate
strings; parse from byte slices. - Compute output size early and return
ErrShortBufferif needed. - Use arenas or buffer pools for batch processing.
Common pitfalls
- Mixing
bytes.Bufferwith repeatedString()conversions. - Using
fmton hot paths (parsing/formatting allocate). - Escaping pointers by storing them in interface or capturing in closures.
- Excessively fine‑grained locks around small data.