The Green Tea Garbage Collector
The Go Blog
The Green Tea Garbage Collector
For the best experience, view this blog post
in a browser with JavaScript enabled.
Go 1.25 includes a new experimental garbage collector called Green Tea,
available by setting GOEXPERIMENT=greenteagc at build time.
Many workloads spend around 10% less time in the garbage collector, but some
workloads see a reduction of up to 40%!
Itâs production-ready and already in use at Google, so we encourage you to
try it out.
We know some workloads donât benefit as much, or even at all, so your feedback
is crucial to helping us move forward.
Based on the data we have now, we plan to make it the default in Go 1.26.
To report back with any problems, file a new issue.
To report back with any successes, reply to the existing Green Tea issue.
What follows is a blog post based on Michael Knyszekâs GopherCon 2025 talk.
Tracing garbage collection
Before we discuss Green Tea letâs get us all on the same page about garbage
collection.
Objects and pointers
The purpose of garbage collection is to automatically reclaim and reuse memory
no longer used by the program.
To this end, the Go garbage collector concerns itself with objects and
pointers.
In the context of the Go runtime, objects are Go values whose underlying
memory is allocated from the heap.
Heap objects are created when the Go compiler canât figure out how else to allocate
memory for a value.
For example, the following code snippet allocates a single heap object: the backing
store for a slice of pointers.
var x = make([]*int, 10) // global
The Go compiler canât allocate the slice backing store anywhere except the heap,
since itâs very hard, and maybe even impossible, for it to know how long x will
refer to the object for.
Pointers are just numbers that indicate the location of a Go value in memory,
and theyâre how a Go program references objects.
For example, to get the pointer to the beginning of the object allocated in the
last code snippet, we can write:
&x[0] // 0xc000104000
The mark-sweep algorithm
Goâs garbage collector follows a strategy broadly referred to as tracing garbage
collection, which just means that the garbage collector follows, or traces, the
pointers in the program to identify which objects the program is still using.
More specifically, the Go garbage collector implements the mark-sweep algorithm.
This is much simpler than it sounds.
Imagine objects and pointers as a sort of graph, in the computer science sense.
Objects are nodes, pointers are edges.
The mark-sweep algorithm operates on this graph, and as the name might suggest,
proceeds in two phases.
In the first phase, the mark phase, it walks the object graph from well-defined
source edges called roots.
Think global and local variables.
Then, it marks everything it finds along the way as visited, to avoid going in
circles.
This is analogous to your typical graph flood algorithm, like a depth-first or
breadth-first search.
Next is the sweep phase.
Whatever objects were not visited in our graph walk are unused, or unreachable,
by the program.
We call this state unreachable because it is impossible with normal safe Go code
to access that memory anymore, simply through the semantics of the language.
To complete the sweep phase, the algorithm simply iterates through all the
unvisited nodes and marks their memory as free, so the memory allocator can reuse
it.
Thatâs it?
You may think Iâm oversimplifying a bit here.
Garbage collectors are frequently referred to as magic, and black boxes.
And youâd be partially right, there are more complexities.
For example, this algorithm is, in practice, executed concurrently with your
regular Go code.
Walking a graph thatâs mutating underneath you brings challenges.
We also parallelize this algorithm, which is a detail thatâll come up again
later.
But trust me when I tell you that these details are mostly separate from the
core algorithm.
It really is just a simple graph flood at the center.
Graph flood example
Letâs walk through an example.
Navigate through the slideshow below to follow along.
Scroll horizontally through the slideshow!
Consider viewing with JavaScript enabled, which will add “Previous” and “Next”
buttons.
This will let you click through the slideshow without the scrolling motion,
which will better highlight differences between the diagrams.

Let’s break it down, piece by piece.

These are global variables x and y.
They will be the starting point of our graph walk.
Since they’re marked blue, according to our handy legend in the bottom left, they’re currently on our work list.

Currently, everything in our heap is grayed out because we haven’t visited any of it yet.

Each object is labeled with its type.
This object in particular is an object of type T, whose type definition is on the top left.
It’s got a pointer to an array of children, and some value.
We can surmise that this is some kind of recursive tree data structure.

These are pointed to by the “children” field of objects of type T.

A square with a dot is a pointer.
If it has an arrow, it is a non-nil pointer pointing to some other object.



Each of these represents a page, which is a contiguous
block of fixed-size, aligned memory.
In Go, pages are 8 KiB (regardless of the hardware virtual
memory page size).
These pages are labeled A, B, C, and D, and I’ll refer to them that way.

Like in the real implementation, each page here only contains objects of a certain size.
This is just how the Go heap is organized.

Here you can see seven boxes, each corresponding to one of the seven object slots in page A.

This is actually how the real runtime manages whether an object has been visited, and it’ll be an important detail later.

This will all come into play later.
For now, let’s just see how our graph flood applies to this picture.

We mark it red to indicate that it’s now active.

Following our legend, we draw the object in blue to indicate that it’s on our work list.
Note also that we set the seen bit corresponding to this object in our metadata.


Let’s take an object off the work list.

By the way, we call walking the pointers of an object “scanning” the object.











We don’t know ahead of time if they will be.




This is intentional, and reflects the actual graph flood algorithm in the Go runtime.











Every object drawn in black is reachable, and every object drawn in gray is unreachable.
Let’s sweep the unreachable objects, all in one go.

The problem
After all that, I think we have a handle on what the Go garbage collector is actually doing.
This process seems to work well enough today, so whatâs the problem?
Well, it turns out we can spend a lot of time executing this particular algorithm in some
programs, and it adds substantial overhead to nearly every Go program.
Itâs not that uncommon to see Go programs spending 20% or more of their CPU time in the
garbage collector.
Letâs break down where that time is being spent.
Garbage collection costs
At a high level, there are two parts to the cost of the garbage collector.
The first is how often it runs, and the second is how much work it does each time it runs.
Multiply those two together, and you get the total cost of the garbage collector.
Over the years weâve tackled both terms in this equation, and for more on how often the garbage
collector runs, see Michaelâs GopherCon EU talk from 2022
about memory limits.
The guide to the Go garbage collector also has a lot to say about this topic,
and is worth a look if you want to dive deeper.
But for now letâs focus only on the second part, the cost per cycle.
From years of poring over CPU profiles to try to improve performance, we know two big things
about Goâs garbage collector.
The first is that about 90% of the cost of the garbage collector is spent marking,
and only about 10% is sweeping.
Sweeping turns out to be much easier to optimize than marking,
and Go has had a very efficient sweeper for many years.
The second is that, of that time spent marking, a substantial portion, usually at least 35%, is
simply spent stalled on accessing heap memory.
This is bad enough on its own, but it completely gums up the works on what makes modern CPUs
actually fast.
âA microarchitectural disasterâ
What does âgum up the worksâ mean in this context?
The specifics of modern CPUs can get pretty complicated, so letâs use an analogy.
Imagine the CPU driving down a road, where that road is your program.
The CPU wants to ramp up to a high speed, and to do that it needs to be able to see far ahead of it,
and the way needs to be clear.
But the graph flood algorithm is like driving through city streets for the CPU.
The CPU canât see around corners and it canât predict whatâs going to happen next.
To make progress, it constantly has to slow down to make turns, stop at traffic lights, and avoid
pedestrians.
It hardly matters how fast your engine is because you never get a chance to get going.
Letâs make that more concrete by looking at our example again.
Iâve overlaid the heap here with the path that we took.
Each left-to-right arrow represents a piece of scanning work that we did
and the dashed arrows show how we jumped around between bits of scanning work.

Notice that we were jumping all over memory doing tiny bits of work in each place.
In particular, weâre frequently jumping between pages, and between different parts of pages.
Modern CPUs do a lot of caching.
Going to main memory can be up to 100x slower than accessing memory thatâs in our cache.
CPU caches are populated with memory thatâs been recently accessed, and memory thatâs nearby to
recently accessed memory.
But thereâs no guarantee that any two objects that point to each other will also be close to each
other in memory.
The graph flood doesnât take this into account.
Quick side note: if we were just stalling fetches to main memory, it might not be so bad.
CPUs issue memory requests asynchronously, so even slow ones could overlap if the CPU could see
far enough ahead.
But in the graph flood, every bit of work is small, unpredictable, and highly dependent on the
last, so the CPU is forced to wait on nearly every individual memory fetch.
And unfortunately for us, this problem is only getting worse.
Thereâs an adage in the industry of âwait two years and your code will get faster.â
But Go, as a garbage collected language that relies on the mark-sweep algorithm, risks the opposite.
âWait two years and your code will get slower.â
The trends in modern CPU hardware are creating new challenges for garbage collector performance:
Non-uniform memory access.
For one, memory now tends to be associated with subsets of CPU cores.
Accesses by other CPU cores to that memory are slower than before.
In other words, the cost of a main memory access depends on which CPU core is accessing
it.
Itâs non-uniform, so we call this non-uniform memory access, or NUMA for short.
Reduced memory bandwidth.
Available memory bandwidth per CPU is trending downward over time.
This just means that while we have more CPU cores, each core can submit relatively fewer
requests to main memory, forcing non-cached requests to wait longer than before.
Ever more CPU cores.
Above, we looked at a sequential marking algorithm, but the real garbage collector performs this
algorithm in parallel.
This scales well to a limited number of CPU cores, but the shared queue of objects to scan becomes
a bottleneck, even with careful design.
Modern hardware features.
New hardware has fancy features like vector instructions, which let us operate on a lot of data at once.
While this has the potential for big speedups, itâs not immediately clear how to make that work for
marking because marking does so much irregular and often small pieces of work.
Green Tea
Finally, this brings us to Green Tea, our new approach to the mark-sweep algorithm.
The key idea behind Green Tea is astonishingly simple:
Work with pages, not objects.
Sounds trivial, right?
And yet, it took a lot of work to figure out how to order the object graph walk and what we needed to
track to make this work well in practice.
More concretely, this means:
- Instead of scanning objects we scan whole pages.
- Instead of tracking objects on our work list, we track whole pages.
- We still need to mark objects at the end of the day, but weâll track marked objects locally to each
page, rather than across the whole heap.
Green Tea example
Letâs see what this means in practice by looking at our example heap again, but this time
running Green Tea instead of the straightforward graph flood.
As above, navigate through the annotated slideshow to follow along.
Scroll horizontally through the slideshow!
Consider viewing with JavaScript enabled, which will add “Previous” and “Next”
buttons.
This will let you click through the slideshow without the scrolling motion,
which will better highlight differences between the diagrams.

Again, each bit, or box, corresponds to one of the object slots in the page.
In total, we now have fourteen bits that correspond to the seven slots in page A.

I’ll call these the “seen” bits.
The bottom set of bits are new.
These “scanned” bits track whether or not we’ve scanned the object.

not objects.
We still need to track objects at some level, and that’s the purpose of these bits.



we put a whole pageâin this case page Aâon the work list,
indicated by shading the whole page blue.

Note that the object’s blue hue directly reflects the metadata in page A.
Its corresponding seen bit is set, but its scanned bit is not.




And as a result, we add page B to the work list, since the first object in page A points to an object in page B.

Next we take page C off the work list.


Page B is already on the work list, so we don’t need to add anything to the work list.
We simply have to set the seen bit for the target object.

We’ve accumulated two objects to scan on page B,
and we can process both of these objects in a row, in memory order!




Page A was previously on the work list, but isn’t at this point, so we put it back on the work list.
Unlike the original mark-sweep algorithm, where any given object is only added to the work list at
most once per whole mark phase, in Green Tea, a given page can reappear on the work list several times
during a mark phase.










since we already scanned the first object.
We know which objects to scan by looking at the difference between the “seen” and “scanned” bits.







Notice that the metadata now all lines up nicely, since all reachable objects were both seen and scanned.

Where the graph flood had a last-in-first-out, or stack-like, order, here we’re using a first-in-first-out, or queue-like, order for the pages on our work list.

We let seen objects accumulate on each page while the page sits on the queue, so we can process as many as we can at once.
That’s how we were able to hit so many objects on page A at once.
Sometimes laziness is a virtue.

Getting on the highway
Letâs come back around to our driving analogy.
Are we finally getting on the highway?
Letâs recall our graph flood picture before.

We jumped around a whole lot, doing little bits of work in different places.
The path taken by Green Tea looks very different.

Green Tea, in contrast, makes fewer, longer left-to-right passes over pages A and B.
The longer these arrows, the better, and with bigger heaps, this effect can be much stronger.
Thatâs the magic of Green Tea.
Itâs also our opportunity to ride the highway.
This all adds up to a better fit with the microarchitecture.
We can now scan objects closer together with much higher probability, so
thereâs a better chance we can make use of our caches and avoid main memory.
Likewise, per-page metadata is more likely to be in cache.
Tracking pages instead of objects means work lists are smaller,
and less pressure on work lists means less contention and fewer CPU stalls.
And speaking of the highway, we can take our metaphorical engine into gears weâve never been able to
before, since now we can use vector hardware!
Vector acceleration
If youâre only vaguely familiar with vector hardware, you might be confused as to how we can use it here.
But besides the usual arithmetic and trigonometric operations,
recent vector hardware supports two things that are valuable for Green Tea:
very wide registers, and sophisticated bit-wise operations.
Most modern x86 CPUs support AVX-512, which has 512-bit wide vector registers.
This is wide enough to hold all of the metadata for an entire page in just two registers,
right on the CPU, enabling Green Tea to work on an entire page in just a few straight-line
instructions.
Vector hardware has long supported basic bit-wise operations on whole vector registers, but starting
with AMD Zen 4 and Intel Ice Lake, it also supports a new bit vector âSwiss army knifeâ instruction
that enables a key step of the Green Tea scanning process to be done in just a few CPU cycles.
Together, these allow us to turbo-charge the Green Tea scan loop.
This wasnât even an option for the graph flood, where weâd be jumping between scanning objects that
are all sorts of different sizes.
Sometimes you needed two bits of metadata and sometimes you needed ten thousand.
There simply wasnât enough predictability or regularity to use vector hardware.
If you want to nerd out on some of the details, read along!
Otherwise, feel free to skip ahead to the evaluation.
AVX-512 scanning kernel
To get a sense of what AVX-512 GC scanning looks like, take a look at the diagram below.
Thereâs a lot going on here and we could probably fill an entire blog post just on how this works.
For now, letâs just break it down at a high level:
-
First we fetch the âseenâ and âscannedâ bits for a page.
Recall, these are one bit per object in the page, and all objects in a page have the same size. -
Next, we compare the two bit sets.
Their union becomes the new âscannedâ bits, while their difference is the âactive objectsâ bitmap,
which tells us which objects we need to scan in this pass over the page (versus previous passes). -
We take the difference of the bitmaps and âexpandâ it, so that instead of one bit per object,
we have one bit per word (8 bytes) of the page.
We call this the âactive wordsâ bitmap.
For example, if the page stores 6-word (48-byte) objects, each bit in the active objects bitmap
will be copied to 6 bits in the active words bitmap.
Like so:
0 0 1 1 ...
â
000000 000000 111111 111111 ...
-
Next we fetch the pointer/scalar bitmap for the page.
Here, too, each bit corresponds to a word (8 bytes) of the page, and it tells us whether that word
stores a pointer.
This data is managed by the memory allocator. -
Now, we take the intersection of the pointer/scalar bitmap and the active words bitmap.
The result is the âactive pointer bitmapâ: a bitmap that tells us the location of every
pointer in the entire page contained in any live object we havenât scanned yet. -
Finally, we can iterate over the memory of the page and collect all the pointers.
Logically, we iterate over each set bit in the active pointer bitmap,
load the pointer value at that word, and write it back to a buffer that
will later be used to mark objects seen and add pages to the work list.
Using vector instructions, weâre able to do this 64 bytes at a time,
in just a couple instructions.
Part of what makes this fast is the VGF2P8AFFINEQB instruction,
part of the âGalois Field New Instructionsâ x86 extension,
and the bit manipulation Swiss army knife we referred to above.
Itâs the real star of the show, since it lets us do step (3) in the scanning kernel very, very
efficiently.
It performs a bit-wise affine
transformations,
treating each byte in a vector as itself a mathematical vector of 8 bits
and multiplying it by an 8×8 bit matrix.
This is all done over the Galois field GF(2),
which just means multiplication is AND and addition is XOR.
The upshot of this is that we can define a few 8×8 bit matrices for each
object size that perform exactly the 1:n bit expansion we need.
For the full assembly code, see this
file.
The âexpandersâ use different matrices and different permutations for each size class,
so theyâre in a separate file
thatâs written by a code generator.
Aside from the expansion functions, itâs really not a lot of code.
Most of it is dramatically simplified by the fact that we can perform most of the above
operations on data that sits purely in registers.
And, hopefully soon this assembly code will be replaced with Go code!
Credit to Austin Clements for devising this process.
Itâs incredibly cool, and incredibly fast!
Evaluation
So thatâs it for how it works.
How much does it actually help?
It can be quite a lot.
Even without the vector enhancements, we see reductions in garbage collection CPU costs
between 10% and 40% in our benchmark suite.
For example, if an application spends 10% of its time in the garbage collector, then that
would translate to between a 1% and 4% overall CPU reduction, depending on the specifics of
the workload.
A 10% reduction in garbage collection CPU time is roughly the modal improvement.
(See the GitHub issue for some of these details.)
Weâve rolled Green Tea out inside Google, and we see similar results at scale.
Weâre still rolling out the vector enhancements,
but benchmarks and early results suggest this will net an additional 10% GC CPU reduction.
While most workloads benefit to some degree, there are some that donât.
Green Tea is based on the hypothesis that we can accumulate enough objects to scan on a
single page in one pass to counteract the costs of the accumulation process.
This is clearly the case if the heap has a very regular structure: objects of the same size at a
similar depth in the object graph.
But there are some workloads that often require us to scan only a single object per page at a time.
This is potentially worse than the graph flood because we might be doing more work than before while
trying to accumulate objects on pages and failing.
The implementation of Green Tea has a special case for pages that have only a single object to scan.
This helps reduce regressions, but doesnât completely eliminate them.
However, it takes a lot less per-page accumulation to outperform the graph flood
than you might expect.
One surprise result of this work was that scanning a mere 2% of a page at a time
can yield improvements over the graph flood.
Availability
Green Tea is already available as an experiment in the recent Go 1.25 release and can be enabled
by setting the environment variable GOEXPERIMENT to greenteagc at build time.
This doesnât include the aforementioned vector acceleration.
We expect to make it the default garbage collector in Go 1.26, but youâll still be able to opt-out
with GOEXPERIMENT=nogreenteagc at build time.
Go 1.26 will also add vector acceleration on newer x86 hardware, and include a whole bunch of
tweaks and improvements based on feedback weâve collected so far.
If you can, we encourage you to try at Go tip-of-tree!
If you prefer to use Go 1.25, weâd still love your feedback.
See this GitHub
comment with some details on
what diagnostics weâd be interested in seeing, if you can share, and the preferred channels for
reporting feedback.
The journey
Before we wrap up this blog post, letâs take a moment to talk about the journey that got us here.
The human element of the technology.
The core of Green Tea may seem like a single, simple idea.
Like the spark of inspiration that just one single person had.
But thatâs not true at all.
Green Tea is the result of work and ideas from many people over several years.
Several people on the Go team contributed to the ideas, including Michael Pratt, Cherry Mui, David
Chase, and Keith Randall.
Microarchitectural insights from Yves Vandriessche, who was at Intel at the time, also really helped
direct the design exploration.
There were a lot of ideas that didnât work, and there were a lot of details that needed figuring out.
Just to make this single, simple idea viable.

where we are today.
The seeds of this idea go all the way back to 2018.
Whatâs funny is that everyone on the team thinks someone else thought of this initial idea.
Green Tea got its name in 2024 when Austin worked out a prototype of an earlier version while cafe
crawling in Japan and drinking LOTS of matcha!
This prototype showed that the core idea of Green Tea was viable.
And from there we were off to the races.
Throughout 2025, as Michael implemented and productionized Green Tea, the ideas evolved and changed even
further.
This took so much collaborative exploration because Green Tea is not just an algorithm, but an entire
design space.
One that we donât think any of us couldâve navigated alone.
Itâs not enough to just have the idea, but you need to figure out the details and prove it.
And now that weâve done it, we can finally iterate.
The future of Green Tea is bright.
Once again, please try it out by setting GOEXPERIMENT=greenteagc and let us know how it goes!
Weâre really excited about this work and want to hear from you!
