Qwen 3.6 27B is the sweet spot for local development

(quesma.com)

143 points | by stared 1 hour ago

27 comments

Otternonsenz 1 minute ago
Is there any hope for people that cant even run 27B parameters, Qwen3.6 or otherwise? Are there any quantized models that do well with tool calling at smaller parameter sizes?
I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).
And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.
bensyverson 38 minutes ago
The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]
Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.
[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...
[-]
- dofm 27 minutes ago
  The maths there is pretty undeniable, but it is not where I'd make the split. Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.
  I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that.
  I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.
  Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task.
  The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.
  [-]
  - ddalex 6 minutes ago
    I just got Claude to download and install all the models and servers and agents and prepare all the launch scripts for me... no need to learn, just ask it to do it for you
  - rusk 17 minutes ago
    > I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled.
    I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)
- Catloafdev 30 minutes ago
  The model they reference can be easily run with 24gb+ of VRAM, and there are other similar models capable of running easily on 16gb of VRAM. It's not like 128gb is a requirement here.
  [-]
  - thewebguyd 20 minutes ago
    I'd go for at least 32GB+. It'll fit in 24GB but leaves you little to no room for context, and that's at 4-bit quantization.
    If you want to run unquantized, you definitely need 128GB.
    [-]
    - Catloafdev 15 minutes ago
      Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.
- porphyra 18 minutes ago
  You can also run Qwen 3.6 27B dense model on DGX Spark with comparable performance [1][2] for about $4000 (Asus Ascent GX10 is $3999 at various retailers).
  In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.
  [1] https://x.com/MiaAI_lab/status/2070859135399182444
  [2] https://github.com/MiaAI-Lab/Qwen3.6-27B-NVFP4-vLLM
  [-]
  - esperent 15 minutes ago
    > 48GB of VRAM with, say, two 3090s
    So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.
- nozzlegear 27 minutes ago
  Just putting it out there: I run Qwen 3.6 on my M1 Mac Studio with 64gb. It's quantized and all that, but I agree with TFA: it's the sweet spot for local development right now.
- stymaar 7 minutes ago
  > The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]
  Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.
  Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.
- dannyw 24 minutes ago
  I’m running the same model on a 48GB MBP with a q4 quant and it’s pretty decent. You definitely don’t 128GB. That’s the scale for 70B models at q8 or something.
  [-]
  - doodlesdev 11 minutes ago
    How much does one of those cost in the US? Here in Brazil, your notebook is worth as much as a used Honda Fit, which seems absolutely insane. For comparison, the ThinkPad I'm currently running cost me 1/20 of how much this MBP costs here, leaving me with over $8.000 to spend with LLM inference (if I actually spent money with that).
    [-]
    - dannyw 0 minutes ago
      I purchased mine for approximately $4400 AUD before the price hikes. That unit is now ~$5100 AUD.
      I use my MBP essentially as my workstation, it's almost always plugged in. I have a MBA (M4, 24GB RAM) that I picked up for ~A$1500 or so, and that's an amazing daily driver. I don't do local LLM inference on that unit, I can just hit my own APIs (via LM Studio) on the MBP over Tailscale.
- cyanydeez 0 minutes ago
  AMD started their 128GB Halo Strix at a pretty damn good point at ~2.5k; I got mine after the first memory bump at $3k.
  I think you might be a little to into the stew here.
- organsnyder 18 minutes ago
  I run Qwen 3.6 on my Framework Desktop 128GB, and it's very performant. I know Framework has had to raise the price since I preordered mine, but they're still well under half the cost of that Macbook.
  [-]
  - andy99 13 minutes ago
    I get ~55 Tok/s on my framework desktop with the 35B A3B q8 model, and so far am also very happy with the coding performance.
- dvduval 12 minutes ago
  Absolutely for the average developer the token speed is just going to be too slow for it to be workable. I think we’re looking at 2028 when memory becomes cheaper again and they’ll be a lot more people using local models.
- georgeven 23 minutes ago
  I have a 1500 dollar machine that can run it at 50 tok/s (3 V100s)
- Insanity 37 minutes ago
  But you have to factor in that this device will last you 5-10 years. That said, I wouldn't spend almost $7k USD on this macbook lol.
  [-]
  - petilon 33 minutes ago
    Memory requirements of newer models will increase, so while the hardware may last 10 years it won't be able to run the latest models for 10 years.
    [-]
    - roadside_picnic 22 minutes ago
      My experience working in the open model space pretty deeply (both LLMs and diffusion models) for years now is that it is not quite as simple as that.
      In the open model space an insane amount of effort goes into getting more powerful models to run with the same or less RAM. For example in the diffusion world many things that could not be run on easily under 24GB of VRAM actually run much better today with much less VRAM than they did a few eyars. You can do many things today with 8-16GB of VRAM that would not have been possible. At the same time the most advanced open models, like LTX 2.3 for video gen, still seem to respect 24GB of VRAM as the upper bound.
      Similarly the standard "big" but localish open model back in the day was Llama 3 70B, this was both a much worse and much larger model than Qwen 3.6 27B
      So in two different spaces I've witnessed the "RAM required to run the best" decreasing or at least remaining stable, while the performance being achieved in both areas is astounding (LTX 2.3 is faster, better and more capable than the Wan 2.2 model that held popularity before it).
      The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.
    - Insanity 29 minutes ago
      You raise a fair point, but I'm not convinced it'll offer a meaningful difference in performance as long as we're stuck with the current AI paradigm.
    - bluGill 23 minutes ago
      Will they? Or will we find ways to optimize models and need less? Only time will tell.
    - simonw 27 minutes ago
      It can't run the latest models today - GLM-5.2 class models already need 1TB+ of RAM.
      ... but, the models that WILL run on 128GB (or 64GB or even 32GB) models today are a huge improvement on the best models that would run in the same amount of memory six months ago.
  - someperson 34 minutes ago
    In 5-10 years, incremental cloud tokens will be far cheaper (likely but not guaranteed).
  - jubilanti 11 minutes ago
    > But you have to factor in that this device will last you 5-10 years.
    Hahhahahahhahahahhahahhahaha
- oldfuture 28 minutes ago
  a lot of credits? we can’t predict any price change for them
- AnimalMuppet 19 minutes ago
  How many credits would it buy? How long would it take to use them up? What's the payback period?
  From what I understand, for a developer, $5000/month is maybe the high end, but $5000/year is fairly standard. (Is that accurate?) So if it pays back in 15 months, that's pretty decent. If it pays back in two months, that's spectacular.
  [-]
  - eli 0 minutes ago
    Are you comparing the cost of hosted Opus to running Qwen 3.6 locally? That doesn't really seem fair.
- h4ny 32 minutes ago
  What kind of narrative are you trying to push?
  Do you know how much VRAM/unified is needed for the 27B model, which is generally regarded as better between the two compared in the article, is needed with little to no KLD loss and at 256k context?
  Also, once you worked out how much memory is needed for that, maybe tell us how much a non-Apple system that you can run that (probably similarly or faster) would cost?
  And when you have answered that, can you tell us how much privacy costs? Maybe also tell us how private OpenRouter is?
  Edit: looking at other replies that are basically pointing out the same thing I did, I guess it's my wording. It's frustrating that people who misinform others in some nicely packaged ways or just simply uninformed get to keep doing that if they sound nice. Thanks.
  [-]
  - kllrnohj 28 minutes ago
    > maybe tell us how much a non-Apple system that you can run that (probably similarly or faster) would cost?
    Ryzen AI Max 395+ with 128GB of unified memory can be found around $3-4k.
    But 27B isn't that large, either, especially if you are ok with the quantized models. So this laptop choice seems to more be a "because they had it" rather than "this is what's necessary for this particular workflow"
    [-]
    - h4ny 22 minutes ago
      That's my point. You can run Qwen3.6 27B with MTP and whatever else you want to bolt onto it at 256k context for much less than even a Ryzen AI Max 395+ with 128GB would cost. Even unquantized you don't need 128 GB so given your comment and the downvotes maybe I didn't word my original comment properly for this?
blopker 7 minutes ago
I've been working with local models for the past year. There's so many possibilities, but I don't think coding is one. Coding requires so many layers beyond inference; I spent so much time trying to replicate what Claude Code does end to end locally. Understanding all the layers and keeping up with the advancements feels like a slog. Even this article messes up and misunderstands what some of the settings are doing. Qwen in particular seems to work at first, then often gets stuck in thought loops when used for actual work.
However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.
Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.
Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.
Right now this app has to download and manage all these models, then bundle an inference engine to run them. It's a lot of code that probably should belong to the OS, or at least a standard interface.
While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.
Anyway, if you're getting started on a Mac, I'd suggest trying out oMLX (https://github.com/jundot/omlx) before messing with llama.cpp. In particular they have community benchmarks so you can see what kind of performance you're likely to get: https://omlx.ai/benchmarks. I wished each one had more configuration details though.
[-]
- iwontberude 1 minute ago
  > I don't think coding is one
  Certainly this is falsifiable easily by any of us doing it on a regular basis
  > Qwen stuck in thought loops
  This does happen when context is not managed effectively; creating plans, using subagents and compactions strategically resolves this
onion2k 43 minutes ago
None of the examples reflect 'real work', at least not what I'd consider real work. Being able to nail a zero-shot greenfield project is relatively easy even for a small model. There's not much context to build up and it can fall back to similar examples in the training data easily. So long as you're not asking it to invent something wholly new it'll probably manage.
The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.
[-]
- sosodev 15 minutes ago
  In my experience, even with basic project concepts the small models struggle to spin up greenfield stuff. There's just too many decisions to be made and they're not good at that.
  Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.
- h4ny 27 minutes ago
  > In my limited experiments Qwen 3.5 (maybe 3.6 is loads better)
  1. Maybe you should tell us what those limited experiments are.
  2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope.
  3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.
jjcm 3 minutes ago
I'd also look at the qwopus distil if you're using qwen 3.6 27b. It's a nice refinement of the current 27b with slightly better stats.
Jackrong has a few different ones available depending on what you're trying to do: https://huggingface.co/Jackrong
SkitterKherpi 1 minute ago
27-30B in general seems to be the level where you actually start having decent models. I just wish consumer hardware hadn't stagnated so much that we can't easily go higher than that, and that even running those requires a $5k machine now.
doodlesdev 3 minutes ago
I feel like I'm going insane seeing people buy these 128gb MBP for thousands of dollars to run models that are objectively much worse than SOTA and spending so much more. The amount spent on a 128gb M5 MAX can buy you a damned new car here. What the hell am I missing? Are developers in other countries living in such different worlds?
(I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)
markdog12 2 minutes ago
I've tested it extensively for actual local development for my job, and hard disagree here. It's a waste of time to use it. Wish it were not true.
beastman82 30 minutes ago
FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well.
QAT, MTP, 128k context.
I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.
[-]
- kofu 24 minutes ago
  My experience also aligns with this. I'm running gemma4 31B on a 4090 through llm.cpp with unsloth models. I also run Qwen 3.6. Qwen is good for thinking and planning as it is faster, but Gemma4's generated code is much higher quality in the first try (Rust, C++ and C#). so it needs less revisions to be at a level I'm comfortable for merging.
- accrual 25 minutes ago
  Nice. I flip flop between Qwen 3.5 9B Q6_M and Gemma4 12B Q4_K_M on a 4080 Super. They run at about the same speed and I can have them review each other's plan or diffs. For smaller projects I find them very capable, and I can step up to a better quant for slightly more challenging work.
0x0000000 43 minutes ago
> ... on my Macbook Max M5 128 GB
Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?
[-]
- kllrnohj 33 minutes ago
  You don't need nearly that much RAM to run Qwen 3.6 27B, though. qwen3.6:27b-q4_K_M is only 17GB, for example.
  [-]
  - DanHulton 4 minutes ago
    This is what I run on an M5 MacBook Air 32GB. Works great.
    I’m not having it build whole features from scratch, though. I give it pretty explicit instructions closer to the class or function level, and it still saves me an immense amount of time, while I’m very connected to the code that’s written.
    Definitely the sweet spot for me.
- mr_mitm 8 minutes ago
  Think commercial. My company invested in a local rig since privacy is important to our customers and sometimes I want to use these models on private data.
- wpm 41 minutes ago
  It wasn't $10k a month ago
- rhdunn 27 minutes ago
  A 27B model can fit easily on a 32GB VRAM card (e.g. 5090) or a 32GB computer in RAM at FP8/Q8 (unsloth have 28.6GB Q8 files).
  For 24GB VRAM cards (e.g. 4090) you can use Q6_K (22.5GB) or Q5_K_M (19.5GB) quants, possibly offloading some of the weights to RAM.
- __s 28 minutes ago
  I'm on 128GB ram strix halo, bought framework desktop for a few thousand CAD back when everyone was calling framework desktop overpriced
- spike021 33 minutes ago
  Certainly won't work on my M4 Pro with 24GB lol
  [-]
  - whynotmaybe 23 minutes ago
    I feel you!
    Sent from my 8gb M2 Mac mini.
rhgraysonii 49 minutes ago
I have been having pretty good success with Qwen 3.5 9B for "nontrivial but not challenging work all things considered" -- it runs great on my 24gb unified memory m4 pro MacBook Pro. What do the baseline specs look like Mac-wise for getting this model to run? Am I looking at a 96gb? 128? 256?
[-]
- dofm 46 minutes ago
  You might be interested in Ornith 1.0 9B, which is a new intriguing post-training of Qwen 3.5 9B.
  Qwen 3.6 27B will run in full offload with a 4-bit quantisation in 64GB on an M1 Max. It is quite slow.
  I don't know about 48GB but 64GB should be enough.
  [-]
  - simonw 19 minutes ago
    I've been trying Ornith 1.0 35B, I'm pretty impressed with it: https://simonwillison.net/2026/Jun/29/ornith/
  - rhgraysonii 44 minutes ago
    Thanks! I was thinking of doing the 128gb to have some future proofing. I figure at this point, it's akin to a mechanic keeping great tools around, when it comes to having this sort of homelab and exposing it for your own uses. And great practice for building the next era of user facing computing that will be around as this proliferates.
    [-]
    - dofm 37 minutes ago
      I would not buy a 64GB model again, probably, if this were to remain particularly important to me. But I gather memory bandwidth is pretty important here.
      So for example I'd favour a used M1 Max over a used M2 Pro, at least based on my naïve understanding. Not quite sure where the balance changes.
      There appear to be some hardware improvements with the M3 and up regarding the Apple Neural Engine which I'd hope would show up in MLX performance; I remember seeing some optimisations in image generation models that are only possible on later hardware.
      The GPU cores are progressively better I believe, but the memory bandwidth is lower. Though perhaps the M4 can get closer to actually saturating said bandwidth.
      (And I must reiterate that my understanding of this stuff is pretty naïve.)
kpw94 45 minutes ago
> What it does:
>
> --jinja for tool calling support
Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year
RedCinnabar 39 minutes ago
Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.
[-]
- giancarlostoro 38 minutes ago
  You need it to run in about 8 GB so you have extra space for the context window.
- Catloafdev 37 minutes ago
  Hello, it's the internet calling, today is that day.
  https://github.com/ikawrakow/ik_llama.cpp
  Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.
seemaze 31 minutes ago
I was interested to see that Qwen3.5-122B-A10B narrowly beat Qwen3.6-27B on Donato Capitella's SWEBench-verified-mini run with a similar 128GB UMA architecture.
https://pi-local-coding-bench.dev
dmezzetti 1 minute ago
Local models are great for a lot of things past just software development. We need to move towards solving other real world problems vs just building software. I've been focused on that with TxtAI (https://github.com/neuml/txtai) for 6 years now.
aand16 55 minutes ago
I've come from the future to say Qwen 3.7 27B is just around the corner and slaps!
[-]
- lor_louis 53 minutes ago
  Do no give me hope like that.
- layer8 30 minutes ago
  Are RAM prices down?
- mendeza 48 minutes ago
  I am eagerly waiting!
mbgerring 22 minutes ago
Something I find really confusing from this post is the MLX versions of the model running much slower. As I understand it, these model versions are meant to take advantage of Apple Silicon and MacOS APIs, and should produce better/faster results. Any insight into what’s happening here?
blobbers 47 minutes ago
How does llama.cpp use the GPU efficiently as opposed to MLX?
Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?
TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.
If I can generate voice at the same time as video, that would be useful.
[-]
- dannyw 30 minutes ago
  Llama.cpp uses the GPU very effectively because inference of LLMs is very rudimentary and basically as simple as your GPU memory bandwidth. That's essentially the baseline performance ceiling, with model-specific optimisations like MTP potentially increasing it.
  The neural cores aren't suitable for LLMs/transformers and isn't used in LLM inference. On the M5 and later chips, it comes with neural accelerators, aka Tensor Cores, which speed up the 'prefill' (i.e. processing your context window) part, but don't do anything for inference.
  The MLX vs GGUF debate is mostly irrelevant. The GGUF pathways are optimised for apple silicon to the extent of practically identical performance to MLX. MLX is just one way of using Apple GPUs, it comes with many optimisations in the box, but they're not hard and they're no longer MLX-exclusive.
HotGarbage 55 minutes ago
And AI companies will continue to buy up all the silicon to make this prohibitively expensive to run at home.
[-]
- dofm 49 minutes ago
  It will run (somewhat slowly) on a five year old M1 Max with 64GB RAM.
  Personally I prefer the 35B MoE model, which is fast enough to be interactively useful, and capable, but I would probably use the 27B if I wanted to generate whole applications like that.
  I am unconvinced that most "local" AI applications need anything much more powerful than the Gemma 4 12B model. Local agentic coding is a small niche, but there are plenty of ways a local model can help with development tasks.
  I would really like to see a 12B or 16B Qwen 3.6.
  I am currently playing with Ornith 1.0 in the MoE configuration, which is based on the 35B variant of Qwen 3.5; I am not sure if it is better than the 3.6 version.
  Benchmarks say it is; my own silly tests either suggest otherwise or suggest that I have to talk to it a bit differently.
  [-]
  - sleepyeldrazi 32 minutes ago
    I need to ask, since I have desperately wanted to make Gemma 4 12B work, but im not sure if its the quant (i usually up it to q8, which is a lot higher than iq4_nl that i use for 3.6 27B) or the model itself, but it just starts confusing itself really quickly when I give it coding tasks. And quickly starts failing tool calls.
    I really want to have a model that i can run locally on my 24gb m4 pro mbp for when i don't have internet to connect to my 3090 running the qwen, and i love how gemma 4 models 'feel', but i can't make them be competent. I am in the middle of finetuning both qwen3.5 9B and gemma 4 12B just to try and make those bridge closer to 27B for coding/agentic tasks (and am trying to ternarize and DQT 27B so that it fits in ~9gb pre-KV).
    How do you run the gemma? What do you use it for (and in what harness), maybe llama.cpp and pi-mono just aren't for this model and that's what i'm doing wrong.
anonym29 41 minutes ago
Strix Halo user here. While Qwen 3.6 27B exhibits remarkable intelligence density, I will still take unsloth's dynamic IQ2_XXS of Minimax M2.7 over Q8_0 Qwen 3.6 27B any day of the week, and this isn't just because of generation speed either. I wrote my own custom harness, and I get hallucinated tool call parameters and bizarre invocations with Q3.6 27B even at Q8_0, but no issues with the IQ2_XXS of M2.7.
[-]
- BoredomIsFun 3 minutes ago
  > I get hallucinated tool call parameters and bizarre invocations
  tweaking sampler might help
cat_plus_plus 11 minutes ago
Gemma4 31B with MTP enabled is faster and I feel a bit stronger at coding. Either one can run in 32GB VRAM or unified RAM with some tuning (3 bit weights, 8 bit kv cache)
mikert89 42 minutes ago
none of these local models are good for development, complete waste of time. nobody has $100k+ hardware sitting around at home to actually run a good model
[-]
- jlongr 40 minutes ago
  skill issue
verdverm 15 minutes ago
Qwen's new AgentWorld model is good too: https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B
I'm running the NVFP4 alongside Gemma4 at the same quant on an OEM Spark
ascii0eks84 44 minutes ago
Very capable lora adapters are surfacing but it seems they are very niche.
[-]
- DenisM 29 minutes ago
  Can you share more? It’s the first I hear of lora outside research papers. Practical applications would be great to see.
  Lora if effective could be a great reason to run local models.
CurbStomper 2 minutes ago
[dead]
rusk 52 minutes ago
Spent a week trying to get sensible results out of llama 3. At one point it even simulated doing the work, log output and everything and when I challenged it about the missing artefacts it actually started questioning my intelligence. Seems appropriate for a Zuck enterprise.
Qwen on the other hand got straight to work with astonishing competency on the same system.
From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.
[-]
- culi 48 minutes ago
  You might find this helpful. llama is not anywhere near the Pareto distribution (performance vs cost)
  https://arena.ai/leaderboard/code/webdev/pareto?license=open...
  https://arena.ai/leaderboard/text/pareto?license=open-source
  [-]
  - k__ 38 minutes ago
    Llama3.1 instruct seems to be doing okay on that page, mostly because it's dirt cheap.
- am17an 51 minutes ago
  llama 3? Are you from 2023?
217 57 minutes ago
This is kind of like saying grass is green to be honest
[-]
- madduci 48 minutes ago
  Like everybody got 128 GB RAM..
  [-]
  - sleepyeldrazi 38 minutes ago
    I've been running it almost since launch on a 3090 (24gb vram), you really don't need that much. Second hand those are really cheap and i get 50-70 t/s (with MTP at 2), full ctx. IQ4_NL (unsloth) on this model seems suspiciously competent, and after the (by now not so recent) updates to q4 KV on llama.cpp, I just keep going back to it after dsv4pro disappointed me for the 100th time because it gave up on a task.
  - dofm 45 minutes ago
    Doesn't need it at Q4 at least; it'll run in 64GB.