Dissecting the CPU-memory relationship in garbage collection (OpenJDK 26)

(norlinder.nu)

31 points | by jonasn 1 day ago

4 comments

jonasn 1 day ago
Hi HN, I'm the author of this post and a JVM engineer working on OpenJDK.
I've spent the last few years researching GC for my PhD and realized that the ecosystem lacked standard tools to quantify GC CPU overhead—especially with modern concurrent collectors where pause times don't tell the whole story.
To fix this blind spot, I built a new telemetry framework into OpenJDK 26. This post walks through the CPU-memory trade-off and shows how to use the new API to measure exactly what your GC is costing you.
I'll be around and am happy to answer any questions about the post or the implementation!
[-]
- spockz 1 hour ago
  Thank you for this interface! It will definitely help in tracking down GC related performance issues or in selecting optimal settings.
  One thing that I still struggle with, is to see how much penalty our application threads suffer from other work, say GC. In the blog you mention that GC is not only impacting by cpu doing work like traversing and moving (old/live) objects but also the cost of thread pauses and other barriers.
  How can we detect these? Is there a way we can share the data in some way like with OpenTelemetry?
  Currently I do it by running a load on an application and retaining its memory resources until the point where it CPU skyrockets because of the strongly increasing GC cycles and then comparing the cpu utilisation and ratio between cpu used/work.
  Edit: it would be interesting to have the GC time spent added to a span. Even though that time is shared across multiple units of work, at least you can use it as a datapoint that the work was (significantly?) delayed by the GC occurring, or waiting for the required memory to be freed.
  [-]
  - jonasn 1 hour ago
    Thanks for reading! Your current method, pushing the load until the GC spirals and then comparing the CPU utilization, is exactly the painful, trial-and-error approach I'm hoping this new API helps alleviate.
    You've hit on the exact next frontier of GC observability. The API in JDK 26 tracks the explicit GC cost (the work done by the actual GC threads). Tracking the implicit costs, like the overhead of ZGC's load barriers or G1's write barriers executing directly inside your application threads, along with the cache eviction penalties, is essentially the holy grail of GC telemetry.
    I have spent a lot of time thinking about how to isolate those costs as part of my research. The challenge is that instrumenting those barrier events in a production VM without destroying application throughput (and creating observer effects) is incredibly difficult. It is absolutely an area of future research I am actively thinking about, but there isn't a silver bullet for it in standard HotSpot just yet.
    Something that you could look at there are some support to analyze with regards to thread pauses is time to safepoint.
    Regarding OpenTelemetry. MemoryMXBean.getTotalGcCpuTime() is exposed via the standard Java Management API, so it should be able to hook into this.
cogman10 1 hour ago
At my work, one thing that I've often had to explain to devs is that the Parallel collector (and even the serial collector) are not bad just because they are old or simple. They aren't always the right tool, but for us who do a lot of batch data processing, it's the best collector around for that data pipeline.
Devs keep on trying to sneak in G1GC or ZGC because they hyper focus on pause time as being the only metric of value. Hopefully this new log:cpu will give us a better tool for doing GC time and computational costs. And for me, will make for a better way to argue that "it's ok that the parallel collector had a 10s pause in a 2 hour run".
[-]
- jonasn 53 minutes ago
  Every GC algorithm in HotSpot is designed with a specific set of trade-offs in mind.
  ZGC and G1 are fantastic engineering achievements for applications that require low latency and high responsiveness. However, if you are running a pure batch data pipeline where pause times simply don't matter, Parallel GC remains an incredibly powerful tool and probably the one I would pick for that scenario. By accepting the pauses, you get the benefit of zero concurrent overhead, dedicating 100% of the CPU to your application threads while they are running.
  [-]
  - cogman10 49 minutes ago
    Gotta be honest, I have a hard time arguing for G1 over ZGC. It seems to me like any situation you'd want G1 you probably want ZGC instead. That default 200ms target latency is already pretty long. If you've made that tradeoff for G1 because you wanted lower latency, you probably are going to be happier with ZGC.
    I also find that the parallel collector is often better than G1, particularly for small heaps. With modern CPUs, parallel is really fast. Those 200ms pauses are pretty easy to achieve if you have something like a 4gb heap and 4 cores.
    The other benefit of the parallel collector is the off heap memory allocation is quiet low. It was a nasty surprise to us with G1 how much off heap memory was required (with java 11, I know that's gotten a lot better).
cvoss 1 hour ago
> This freed programmers from managing complex lifecycle management.
It also deceived programmers into failing to manage complex lifecycles. Debugging wasted memory consumption is a huge pain.
firefly2000 1 hour ago
Are there plans to elucidate implicit GC costs as well?
[-]
- jonasn 1 hour ago
  Great question! I actually just touched on this in another thread that went up right around the same time you asked this. It is clearly the next big frontier!
  The short answer is: It's something I'm actively thinking about, but instrumenting micro-level events (like ZGC's load barriers or G1's write barriers) directly inside application threads without destroying throughput (or creating observer effects invalidating the measurements) is incredibly difficult.