Unsloth Dynamic 2.0 GGUFs

(unsloth.ai)

62 points | by tosh 3 hours ago

10 comments

Maxious 2 hours ago
ICYMI unsloth has had some major breakthroughs today with the Qwen3.5 local models https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.
[-]
- danielhanchen 27 minutes ago
  Oh I didn't expect this to be on HN haha - but yes for our new benchmarks for Qwen3.5, we devised a slightly different approach for quantization which we plan to roll out to all new models from now on!
- Kayou 2 hours ago
  Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had
  [-]
  - Koffiepoeder 37 minutes ago
    The A3B part in the name stands for `Active 3B`, so for the inference jobs a core 3B is used in conjunction with another subpart of the model, based on the task (MoE, mixture of experts). If you use these models mostly for related/similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models.
  - segmondy 2 hours ago
    llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.
  - Maxious 1 hour ago
    Yep. These Mixture of Experts models are well suited for paging in only the relevant data for a certain task https://huggingface.co/blog/moe
    There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/
  - nurettin 38 minutes ago
    This is why they say "A3B" meaning only 3B is active at a time, limiting VRAM usage.
- mirekrusin 1 hour ago
  2x RTX 4090, Q8, 256k context, 110 t/s
- jychang 2 hours ago
  Not really breakthroughs, more like bugfixes for their broken first batch.
  [-]
  - danielhanchen 12 minutes ago
    No this is false - unsure if you saw our new blog - https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks which shows SOTA on nearly all bits, and we shared all our research as well
qskousen 1 hour ago
This is pretty interesting, based on the blog post, it seems like they are using a technique similar to what I have been using to generate "layer sensitivity" data in my (still pretty beta) ggufy project, which is more aimed at diffusion (image) models. https://github.com/qskousen/ggufy
tenpa0000 1 hour ago
I run Llama 3.2 3B locally for latency-sensitive classification (sub-50ms, so no room for bigger models). At that scale Q2_K vs Q4_K_M isn't just smaller — Q2 starts flipping yes/no answers that Q4 gets right. Not often, but enough to notice in production.
So the KL divergence numbers here are more useful to me than the MMLU tables honestly. I've had MMLU hold steady while the output distribution drifted enough to break things downstream.
Does the calibration dataset make much difference at 3B though? There's so little redundancy that I'd expect it to hit a floor pretty fast regardless of how good the calibration data is.
[-]
- zozbot234 1 hour ago
  For a simple classification task you generally want to prioritize regularization over more sophisticated behavior, so fewer parameters with larger quantization makes sense. For more generic chat-like purposes, Q2 of a larger model may often be preferable to Q4 of a smaller one.
- am17an 1 hour ago
  What do you use for sub-50ms inference?
electroglyph 1 hour ago
Cheers Daniel and Mike and team, keep up the good work!
[-]
- danielhanchen 27 minutes ago
  Thank you!
Havoc 2 hours ago
Advances in this space are always welcome.
I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc
[-]
- danielhanchen 18 minutes ago
  Yes the new blog post https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks has some benchmarks from community people on our quants vs others on LiveCodeBench for eg!
dyl000 1 hour ago
So q6 is practically perfect, and q3 is meaningfully decent. very impressive!
jychang 2 hours ago
What's up with this post? It's a link to something which has existed for a long time, and there's a bunch of dead comments below. Some weird SEO campaign thing?
[-]
- danielhanchen 26 minutes ago
  Didn't expect this as well haha on HN again - probably related to Qwen3.5
- tosh 2 hours ago
  Unsloth have just released benchmarks on how their dynamic quants perform for Qwen 3.5
  https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
  [-]
  - jychang 1 hour ago
    I'm aware of that, but that's not the link of the post. The post is linking to their UD 2.0 quants from a few months back.
    Also, the benchmarks are because they messed up the first version of their quants for Qwen 3.5 by quanting some tensors to mxfp4 that should have been in higher quality, and this is their bugfix. The post literally starts out with "We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits" without explaining WHY they needed to update from the original version.
    [-]
    - danielhanchen 19 minutes ago
      Didn't expect this to be on HN haha - but sometimes HN does have older posts come up sometimes.
      No your conclusion is false - only the old Q4_K_XL had slightly higher perplexity, all other quants are fine. We uploaded 9TB of research artifacts to https://huggingface.co/unsloth/Qwen3.5-35B-A3B-Experiments-G... for the community.
      If you read our blog, it says KLD and PPL are actually sometimes counterintuitive - for example MiniMax some of our quants do worse on PPL and KLD vs AesSedai's one for example, but does worse on LiveCodeBench by a lot see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-3-...
      This is because see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-1-... - although bitwidths are in general monotonic ie q2_k < q3_k < q4_k < q5_k etc, we find KLD and PPL are actually not monotonic ie q3_k can actually have BETTER PPL than q4_k.
      So the main point is bad luck on quantization - sometimes lower bits might get lower PPL and KLD, but actually this is a ruse and wrong, since on actual real world tasks, it's worse.
  - lostmsu 1 hour ago
    Looking at their benchmarks there doesn't appear to be meaningful difference between their quants and bartowsky quants.
    [-]
    - danielhanchen 25 minutes ago
      No our Qwen3.5 new ones show the opposite see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
aichen_dev 2 hours ago
[dead]
MarcLore 2 hours ago
[dead]
shablulman 3 hours ago
[dead]