Go Runtime Performance

Lesson, slides, and applied problem sets.

View Slides

Lesson

Go Runtime Performance: Allocations, GC, Goroutines, Memory Model, Generics

This module is a deep, practical reference for performance‑critical Go code. It focuses on runtime behavior (allocations, garbage collection, scheduler), safety constraints (memory model), and tradeoffs (generics vs interface vs monomorphized patterns). It also includes a measurement playbook.

Measurement playbook (do this first)

  1. Reproduce: isolate the workload in a benchmark or minimal harness.
  2. Profile: CPU + allocations + pprof + trace (CPU shows time, allocs show churn, trace shows scheduler/GC/IO).
  3. Classify: determine if the bottleneck is CPU, allocation/GC, lock contention, syscalls, or I/O.
  4. Hypothesize: form 1–3 candidate fixes (algorithmic, data layout, batching, caching).
  5. Verify: implement one change, re‑measure. Keep the fastest correct version.
  6. Guard: lock in performance with micro‑benchmarks and allocation checks.

Tooling: concrete commands and what they show

  • go test -bench . -benchmem → allocs/op and bytes/op for every benchmark.
  • go test -run none -bench BenchmarkFoo -benchmem -cpuprofile cpu.out -memprofile mem.out → capture CPU + heap profiles.
  • go tool pprof -http=:0 cpu.out → visualize hot functions and inlining.
  • go test -run none -bench BenchmarkFoo -trace trace.out + go tool trace trace.out → goroutine scheduling, GC, syscalls.
  • GODEBUG=gctrace=1 → GC lines like: gc 15 @1.234s 0%: 0.45+1.2+0.02 ms clock, 4->6->3 MB, 8 MB goal
    • Watch heap size (live set) and GC pause; high allocation rate + large live set = trouble.

Allocations and escape analysis

  • Goal: keep hot‑path data on the stack and reuse buffers to avoid GC pressure.
  • Use go test -bench . -benchmem to track allocs/op and bytes/op.
  • Common allocation sources:
    • Converting []byte to string (copy) and string to []byte (copy)
    • fmt.Sprintf / fmt.Errorf
    • append growth when capacity is insufficient
    • map growth or map iteration that forces allocations
    • interface{} / generics boxing when values escape
  • Use preallocation: make([]T, 0, n) for slices, make(map[K]V, n) for maps.
  • Prefer slicing into existing buffers over new allocations.
  • Investigate escapes with go build -gcflags=all=-m and look for “escapes to heap”.

Garbage collector (GC) essentials

  • Go uses a concurrent, tri‑color, mark‑and‑sweep GC with a pacer.
  • The GC cost is proportional to live heap size more than total allocations.
  • Two critical metrics: allocation rate and live heap size.
  • Tactics:
    • Reduce long‑lived heap objects (move hot data to stack, or reuse buffers).
    • Prefer compact, contiguous data structures to reduce pointer‑tracing.
    • Avoid per‑request object graphs in hot paths.
  • Tuning knobs:
    • GOGC (default 100): higher → fewer GCs, more memory; lower → more GCs, less memory.
    • GOMEMLIMIT (Go 1.19+): hard memory target; GC becomes more aggressive as you approach it.
  • Heap size matters more than churn; shrinking the live set is often the biggest win.

Goroutines and scheduler

  • Goroutines are cheap but not free. Each has a stack that grows on demand.
  • Scheduler contention and synchronization are frequent real bottlenecks.
  • Guidelines:
    • Use coarse‑grained goroutines; avoid spawning per small task.
    • Batch work to amortize synchronization.
    • Minimize shared mutable state; shard state if possible.
    • Measure with go test -run Test -bench . and go test -trace for scheduler insights.
  • Mental model: M/P/G (machine threads, processors, goroutines). Blocking syscalls park an M; cgo can pin threads.
  • For CPU‑bound work, use worker pools; for I/O‑bound work, goroutine‑per‑request is usually fine.

Memory model (correctness first)

  • The Go memory model guarantees that reads observe writes only through synchronization.
  • Use sync.Mutex, sync/atomic, channels, or other synchronization to establish happens‑before.
  • Avoid data races even if they "seem to work"; races invalidate compiler and CPU assumptions.
  • If performance requires lock‑free patterns, validate with -race and micro‑benchmarks.
  • Example (racy):
var done bool
var x int
go func() {
    x = 1
    done = true
}()
for !done {}
fmt.Println(x) // data race; no happens-before
  • Example (safe):
var x int
done := make(chan struct{})
go func() {
    x = 1
    close(done)
}()
<-done
fmt.Println(x) // safe

Generics performance notes

  • Generics reduce code duplication but can hide costs if values escape.
  • Performance depends on constraints and inlining:
    • Constrained types can enable inlining and avoid interface boxing.
    • Use ~ type sets to allow optimized operations on underlying types.
  • Beware of generic functions that return interface values or store into any.
  • For hot paths, measure both generic and hand‑specialized versions; choose the faster and clearer one.
  • Prefer constraints like constraints.Integer/constraints.Float when you need ops without interface boxing.
  • Avoid returning any from generic helpers; it forces interface allocation for non‑pointer values.

Zero‑allocation patterns

  • Accept []byte inputs and write into caller‑provided buffers.
  • Avoid building intermediate strings; parse from byte slices.
  • Compute output size early and return ErrShortBuffer if needed.
  • Use arenas or buffer pools for batch processing.

Common pitfalls

  • Mixing bytes.Buffer with repeated String() conversions.
  • Using fmt on hot paths (parsing/formatting allocate).
  • Escaping pointers by storing them in interface or capturing in closures.
  • Excessively fine‑grained locks around small data.

Module Items