Accelerating Gemma 4: faster inference with multi-token prediction drafters

(blog.google)

131 points | by amrrs 1 hour ago

16 comments

zdw 1 hour ago
MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.
The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.
[-]
- tarruda 41 minutes ago
  There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673
- WhitneyLand 53 minutes ago
  Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.
  [-]
  - HumanOstrich 37 minutes ago
    That is.. inaccurate.
- EGreg 57 minutes ago
  How does this get added in practice?
  [-]
  - flakiness 51 minutes ago
    According to the linked PR, the original model does come with MTP which is another "head" (=output path) in the same model and (supposedly) runs faster.
    The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)
- dakolli 1 hour ago
  yet, still mostly useless.
skybrian 1 hour ago
Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.
[-]
- macNchz 42 minutes ago
  This is something I've been thinking about for a while...the current state of things really does feel kind of like the dialup era, wondering what the "broadband" era could look like. Watching tokens stream in is reminiscent of watching a jpeg load a few rows of pixels at a time, and the various different loading and connecting animations that applications implemented before things got fast enough to make them less relevant.
  Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.
- garciasn 15 minutes ago
  You're right about it being reminiscent of the dial-up area, but I don't believe it's 300 to 1200; it's more like 4800:
  Modem vs Claude according to Claude:
  300 @ 2368 characters - 1m 19s
  1200 @ 2368 characters - 19.7s
  2400 @ 2368 characters - 9.9s
  14.4K @ 2368 characters - 1.6s
  33.6K @ 2368 characters - 705 ms
  56K @ 2368 characters - 447 ms
  Claude @ 2368 characters - 7.9s
- jeffhuys 21 minutes ago
  Check chatjimmy.ai
- MagicMoonlight 7 minutes ago
  There was a startup posted here which built custom hardware that let the AI respond instantly. Thousands of tokens per second.
  [-]
  - zargon 1 minute ago
    Groq.
julianlam 11 minutes ago
Really excited to try this once it is merged into llama.cpp.
Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.
Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)
msp26 6 minutes ago
Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.
However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.
AbuAssar 3 minutes ago
these are the updated models:
google/gemma-4-31B-it-assistant
google/gemma-4-26B-A4B-it-assistant
google/gemma-4-E4B-it-assistant
google/gemma-4-E2B-it-assistant
these 1 hour ago
Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.
[-]
- dvt 1 hour ago
  It's not implemented in mlx[1] yet (or llama.cpp[2]), so it may take a while.
  [1] https://github.com/ml-explore/mlx-lm/pull/990
  [2] https://github.com/ggml-org/llama.cpp/pull/22673
- Havoc 53 minutes ago
  Normally when LM Studio doesn't like it it's because of the presence of mmproj files in the folder. Sometimes removing them helps it show up.
  They're somehow connected to vision & block speculative decode...don't ask me how/why though
  For gemma specifically had more luck with speculative using the llama-server route than lm studio
- AlphaSite 50 minutes ago
  Yes. Make sure you’re not using the Gemma sparse models since they don’t have a small model to use. Also I removed all the image models from the workspace.
- svachalek 1 hour ago
  I've gotten it to work with other models. They've got to be perfectly aligned usually, in terms of provider, quantization etc. Might be a bit before you can get a matched set.
christina97 41 minutes ago
I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment.
I tried first with Qwen but it was unstable and had ridiculously long thinning traces!
[-]
- jszymborski 38 minutes ago
  The A4B model is blazing fast and the model is super good at general inquiries. Notably worse than Qwen 3.6 for coding tasks but that says more about the Qwen model.
recsv-heredoc 35 minutes ago
CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider.
They serve gemma-4-26b-a4b-it.
[-]
- andruby 27 minutes ago
  They do indeed. See https://developers.cloudflare.com/workers-ai/models/ They seem to allow some free usage without user account. Do they list limits anywhere?
pu_pe 46 minutes ago
So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?
[-]
- tarruda 34 minutes ago
  They also published draft models for E4B and E2B. For those, the draft models are only 78m parameters: https://huggingface.co/google/gemma-4-E4B-it-assistant
- coder543 29 minutes ago
  MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.
  [-]
  - a_e_k 6 minutes ago
    From the linked post, it didn't read like a separate KV cache was needed:
    > The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.
    [-]
    - coder543 3 minutes ago
      [delayed]
disiplus 1 hour ago
nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.
mchusma 1 hour ago
I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?
[-]
- Havoc 51 minutes ago
  There is a decent yt here going through what google's logic with gemma overall might be
  https://www.youtube.com/watch?v=sXgZhGzqPmU
  As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway
- Farmadupe 1 hour ago
  I wonder if for a model that small with a permissive license it might not be worth their time to host a commercial grade inference stack?
  Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?
  Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?
  As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis
  [-]
  - disiplus 1 hour ago
    i dont know what are you talking about, i replaced an older gpt4o with a finetuned qwen. there is a huge amount of "AI, that can be done with those models, or partly by those models." Huge amount of people would not notice the difference. And if you prepare the context correctly, even bigger slice of people would not notice.
    [-]
    - dakolli 1 hour ago
      Genuinely curious, what are you "fine tuning" these smaller models to do reliably? I hear this talked about a lot but very few people actually cough up examples, and I'd love to actually hear of one.
      [-]
      - disiplus 41 minutes ago
        depends, a super small one finetuned to do function calling instead sending it to big model and waiting, instead, you ask for a revenue in last month, i do a small llm function call -> show results. some bigger ones, analysis, summary, classification. what is great with smaller ones, and im looking at 2b, 4b is you can get a huge throughput with just vllm and a couple of consumer gpus. what i usually do is basically distillation of a big one onto smaller one.
- nolist_policy 27 minutes ago
  What do you mean? It just works with Google AI Studio.
- seamossfet 1 hour ago
  [dead]
nalinidash 52 minutes ago
technical details are here: https://x.com/googlegemma/status/2051694045869879749
deskamess 38 minutes ago
Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.
shay_ker 54 minutes ago
curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron
https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...
[-]
- zargon 49 minutes ago
  They're using the term speculative decoding but doing MTP. It's the same thing as Nemotron, but Google removed the MTP heads from the original safetensora release. (They were not removed from the LiteRM format.)
brcmthrowaway 38 minutes ago
Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?
m3kw9 37 minutes ago
ok so? Anyone got a verdict/review?