Writing faster Go: profile first, then kill allocations
Performance work goes wrong when it starts with a guess. Someone decides a function “feels slow”, rewrites it, and moves on without ever checking whether it mattered. In Go you never have to guess: the tooling is good enough to tell you. This is the loop I use, in order: measure, find the real cost, change one thing, measure again.
Measure first: benchmarks
Go’s testing package has benchmarks built in. A benchmark is just a function that runs your
code b.N times:
func BenchmarkParse(b *testing.B) {
input := loadFixture()
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = Parse(input)
}
}
Run it with -benchmem, which is the flag that matters:
go test -bench=Parse -benchmem ./...
BenchmarkParse-10 248511 4815 ns/op 2208 B/op 37 allocs/op
Read those columns right to left. allocs/op (heap allocations per call) and B/op (bytes
allocated) usually predict ns/op better than anything else, because every heap allocation is
work now and garbage-collection work later.
Two practical notes. Use b.ResetTimer() after expensive setup so it isn’t counted. And on Go
1.24+ you can write for b.Loop() instead of the b.N loop; it stops the compiler from
optimising your loop body away, which is the classic way to get a benchmark that measures
nothing.
Profile: find the real cost
A benchmark tells you a function is slow. A profile tells you where the time goes.
go test -bench=Parse -cpuprofile=cpu.out ./...
go tool pprof -http=:8080 cpu.out
That opens a flame graph in your browser. For a running service, import net/http/pprof and
read it live from /debug/pprof/. For allocations specifically, take a memory profile and
sort by allocation count, which maps directly to GC pressure:
go test -bench=Parse -memprofile=mem.out ./...
go tool pprof -alloc_objects mem.out
Why allocations dominate
When a value escapes to the heap you pay twice: once to allocate it, and again later when the garbage collector has to scan and free it. Keeping a value on the stack avoids both. The compiler decides this through escape analysis, and it will show you what it decided:
go build -gcflags='-m' ./...
./parse.go:42:9: &buf escapes to heap
./parse.go:51:13: make([]token, 0, n) does not escape
You can’t override escape analysis, but you can avoid forcing escapes. The usual causes: returning a pointer to a local, storing a pointer inside an interface, capturing a variable in a closure that outlives the call, or a slice or map the compiler can’t size.
Five changes that usually pay off
1. Preallocate slices and maps
If you know the size, say so:
out := make([]Result, 0, len(rows)) // capacity up front
for _, r := range rows {
out = append(out, transform(r))
}
Growing a slice reallocates and copies; growing a map rehashes. One make with capacity
replaces a handful of hidden allocations.
2. Reuse buffers with sync.Pool
For short-lived objects on a hot path (byte buffers, encoders), a sync.Pool lets you reuse
instead of reallocate:
var bufPool = sync.Pool{New: func() any { return new(bytes.Buffer) }}
func render(v Value) string {
buf := bufPool.Get().(*bytes.Buffer)
defer func() { buf.Reset(); bufPool.Put(buf) }()
writeInto(buf, v)
return buf.String()
}
Reset the object before putting it back, and never assume Get hands you a clean one.
3. Stop converting between string and []byte
[]byte(s) and string(b) both copy. In a hot loop that adds up fast. Operate on []byte
and convert once at the boundary, and use the Append family to write into a buffer you own:
buf = strconv.AppendInt(buf, n, 10) // no intermediate string
4. Watch interface boxing
Putting a value into an interface can allocate, because the interface needs a pointer to the
value. That includes handing an int to fmt. In hot paths, prefer concrete types and typed
helpers:
BenchmarkSprintf-10 8912340 134 ns/op 16 B/op 2 allocs/op
BenchmarkItoa-10 72944011 16 ns/op 0 B/op 0 allocs/op
strconv.Itoa(n) does the same job as fmt.Sprintf("%d", n) with zero allocations.
5. Build strings with strings.Builder
+ inside a loop allocates a fresh string every iteration. strings.Builder writes into one
growing buffer, and Grow lets you size it once:
var b strings.Builder
b.Grow(len(parts) * 8)
for _, p := range parts {
b.WriteString(p)
}
return b.String()
Tune the garbage collector, last
Once the allocation profile is flat, the GC itself becomes a knob. Two environment variables do most of the work:
GOGC(default 100) sets how much the heap grows between collections. Higher means fewer, larger collections: more memory, less CPU.GOMEMLIMIT(Go 1.19+) is a soft memory ceiling. This is the one for containers: set it a little below the container’s memory limit and the runtime collects harder as it approaches, instead of getting OOM-killed.
Reach for these after you’ve cut allocations, not instead of it. A GC knob trades memory for CPU; removing an allocation costs nothing.
The loop, again
That’s the whole thing: a benchmark to know if you’re faster, a profile to know where to look, escape analysis to understand why, and one change at a time so you can tell what actually worked. Most of the wins are allocations. Measure, and let the numbers pick the fight.