Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop

Dear Hackers, I’m interested in your real-world workflows for using open-source LLMs and open-source coding assistants on your laptop (not just cloud/enterprise SaaS). Specifically:

Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?

What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?

What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).

I'm conducting my own investigation, which I will be happy to share as well when over.

Thanks! Andrea.

323 points | by threeturn 1 day ago

69 comments

lreeves 1 day ago
I sometimes still code with a local LLM but can't imagine doing it on a laptop. I have a server that has GPUs and runs llama.cpp behind llama-swap (letting me switch between models quickly). The best local coding setup I've been able to do so far is using Aider with gpt-oss-120b.
I guess you could get a Ryzen AI Max+ with 128GB RAM to try and do that locally but non-nVidia hardware is incredibly slow for coding usage since the prompts become very large and take exponentially longer but gpt-oss is a sparse model so maybe it won't be that bad.
Also just to point it out, if you use OpenRouter with things like Aider or roocode or whatever you can also flag your account to only use providers with a zero-data retention policy if you are truly concerned about anyone training on your source code. GPT5 and Claude are infinitely better, faster and cheaper than anything I can do locally and I have a monster setup.
[-]
- fm2606 1 day ago
  gpt-oss-120b is amazing. I created a RAG agent to hold most of GCP documentation (separate download, parsing, chunking, etc). ChatGPT finished a 50 question quiz in 6 min with a score of 46 / 50. gpt-oss-120b took over an hour but got 47 / 50. All the other local LLMs I tried were small and performed way worse, like less than 50% correct.
  I ran this on an i7 with 64gb of RAM and an old nvidia card with 8g of vram.
  EDIT: Forgot to say what the RAG system was doing which was answering a 50 question multiple choice test about GCP and cloud engineering.
  [-]
  - embedding-shape 1 day ago
    > gpt-oss-120b is amazing
    Yup, I agree, easily best local model you can run today on local hardware, especially when reasoning_effort is set to "high", but "medium" does very well too.
    I think people missed out on how great it was because a bunch of the runners botched their implementations at launch, and it wasn't until 2-3 weeks after launch that you could properly evaluate it, and once I could run the evaluations myself on my own tasks, it really became evident how much better it is.
    If you haven't tried it yet, or you tried it very early after the release, do yourself a favor and try it again with updated runners.
  - giorgioz 14 hours ago
    on what hardware you manate to run gpt-oss-120b locally?
  - whatreason 1 day ago
    What do you use to run gpt-oss here? ollama, vLLM, etc
    [-]
    - embedding-shape 15 hours ago
      Not parent, but frequent user of GPT-OSS, tried all different ways of running it. Choice goes something like this:
      - Need batching + highest total throughoutput? vLLM, complicated to deploy and install though, need special versions for top performance with GPT-OSS
      - Easiest to manage + fast enough: llama.cpp, easier to deploy as well (just a binary) and super fast, getting ~260 tok/s on a RTX Pro 6000 for the 20B version
      - Easiest for people not used to running shell commands or need a GUI and don't care much for performance: Ollama
      Then if you really wanna go fast, try to get TensorRT running on your setup, and I think that's pretty much the fastest GPT-OSS can go currently.
  - gkfasdfasdf 10 hours ago
    What were you using for RAG? Did you build your own or some off the shelf solution (e.g. openwebui)
    [-]
    - fm2606 4 hours ago
      I used pg vector chunking on paragraphs. For the answers I saved in a flat text file and then parsed to what I needed.
      For parsing and vectorizing of the GCP docs I used a Python script. For reading each quiz question, getting a text embedding and submitting to an LLM, I used Spring AI.
      It was all roll your own.
      But like I stated in my original post I deleted it without backup or vcs. It was the wrong directory that I deleted. Rookie mistake for which I know better.
  - rovr138 1 day ago
    > I created a RAG agent to hold most of GCP documentation (separate download, parsing, chunking, etc)
    If you share the scripts to gather the GCP documentation this, that'd be great. Because I have had an idea to do something like this, and the part I don't want to deal with is getting the data
    [-]
    - fm2606 4 hours ago
      I tried scripts but got blocked. I used wget to download tthem
  - lacoolj 1 day ago
    you can run the 120b model on an 8GB GPU? or are you running this on CPU with the 64GB RAM?
    I'm about to try this out lol
    The 20b model is not great, so I'm hoping 120b is the golden ticket.
    [-]
    - ThatPlayer 1 day ago
      With MoE models like gpt-oss, you can run some layers on the CPU (and some on GPU): https://github.com/ggml-org/llama.cpp/discussions/15396
      Mentions 120b is runnable on 8GB VRAM too: "Note that even with just 8GB of VRAM, we can adjust the CPU layers so that we can run the large 120B model too"
    - gunalx 1 day ago
      I have in many cases had better results with the 20b model, over the 120b model. Mostly because it is faster and I can iterate prompts quicker to choerce it to follow instructions.
      [-]
      - embedding-shape 15 hours ago
        > had better results with the 20b model, over the 120b model
        The difference of quality and accuracy of the responses between the two is vastly different though, if tok/s isn't your biggest priority, especially when using reasoning_effort "high". 20B works great for small-ish text summarization and title generation, but for even moderately difficult programming tasks, 20B fails repeatedly while 120B gets it right on the first try.
    - fm2606 1 day ago
      Everything I run, even the small models, some amount goes to the GPU and the rest to RAM.
    - fm2606 1 day ago
      Hmmm...now that you say that, it might have been the 20b model.
      And like a dumbass I accidentally deleted the directory and didn't have a back up or under version control.
      Either way, I do know for a fact that the gpt-oss-XXb model beat chatgpt by 1 answer and it was 46/50 at 6 minutes and 47/50 at 1+ hour. I remember because I was blown away that I could get that type of result running locally and I had texted a friend about it.
      I was really impressed but disappointed at the huge disparity between time the two.
  - adastra22 22 hours ago
    What quantization settings?
- neilv 1 day ago
  https://github.com/mostlygeek/llama-swap
egberts1 1 day ago
Ollama, 16-CPU Xenon E6320 (old), 1.9Ghz, 120GB DDRAM4, 240TB RAID5 SSDs, on Dell Precision T710 ("The Beast"). NO GPU. 20b (n oooooot f aah st at all). Pure CPU bound. Tweaked for 256KB chunking into RAG.
Ingested election laws of 50 states, territories and Federal.
Goal. Mapping out each feature of the election and deal with (in)consistent terminologies sprouted by different university-trained public administration. This is the crux of hallunications: getting a diagram of ballot handling and their terminologies.
Then maybe tackle the multitude ways of election irregularities, or at least point out integrity gaps at various locales.
https://figshare.com/articles/presentation/Election_Frauds_v...
[-]
- banku_brougham 1 day ago
  bravo, this is a great use of talent for society
gcr 1 day ago
For new folks, you can get a local code agent running on your Mac like this:
1. $ npm install -g @openai/codex
2. $ brew install ollama; ollama serve
3. $ ollama pull gpt-oss:20b
4. $ codex --oss -m gpt-oss:20b
This runs locally without Internet. Idk if there’s telemetry for codex, but you should be able to turn that off if so.
You need an M1 Mac or better with at least 24GB of GPU memory. The model is pretty big, about 16GB of disk space in ~/.ollama
Be careful - the 120b model is 1.5× better than this 20b variant, but takes 5× higher requirements.
[-]
- windexh8er 1 day ago
  I've been really impressed by OpenCode [0]. The limitations of all the frontier TUI is removed and it is feature complete and performant compared to Codex or Claude Code.
  [0] https://opencode.ai/
  [-]
  - embedding-shape 15 hours ago
    > OpenCode will be available on desktop soon
    Anyone happen to know what that means exactly? The install instructions at the top seems to indicate it already is available on desktop?
    [-]
    - windexh8er 11 hours ago
      It's a terminal only (TUI) tool today. They're releasing a graphical (GUI) version in the future.
      [-]
      - embedding-shape 11 hours ago
        > It's a terminal only (TUI) tool today.
        But to use that TUI you need a desktop, or at least a laptop I guess, but that distinction doesn't make sense. Are they referring to the GUI being the "Desktop Version"? Never heard it put that way before if so.
- nickthegreek 1 day ago
  have you been able to build or reiterate anything of value using just 20b to vibe code?
- abacadaba 1 day ago
  As much as I've been using llms via api all day every day, being able to run it locally on my mba and talk to my laptop still feels like magic
- giancarlostoro 1 day ago
  LM Studio is even easier, and things like JetBrains IDEs will sync to LM Studio, same with Zed.
dust42 1 day ago
On a Macbook pro 64GB I use Qwen3-Coder-30B-A3B Q4 quant with llama.cpp.
For VSCode I use continue.dev as it allows to set my own (short) system prompt. I get around 50token/sec generation and prompt processing 550t/s.
When giving well defined small tasks, it is as good as any frontier model.
I like the speed and low latency and the availability while on the plane/train or off-grid.
Also decent FIM with the llama.cpp VSCode plugin.
If I need more intelligence my personal favourites are Claude and Deepseek via API.
[-]
- redblacktree 1 day ago
  Would you use a different quant with a 128 GB machine? Could you link the specific download you used on huggingface? I find a lot of the options there to be confusing.
  [-]
  - dust42 1 day ago
    I usually use unsloth quants, in this case https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-... - the Q4_K_M variant.
    On 128GB I would definitely run a larger model, probably with ~10B active parameters. All depends how many tokens per second is comfortable for you.
    To get an idea of the speed difference, there is a benchmark page for llama.cpp on Apple silicon here: https://github.com/ggml-org/llama.cpp/discussions/4167
    About quant selection: https://gist.github.com/Artefact2/b5f810600771265fc1e3944228...
    And my workaround for 'shortening' prompt processing time: I load the files I want to work on (usually 1-3) into context with the instruction: read the code and wait. And while the LLM is doing the prompt processing I write my instructions of what I want to have done. Usually the LLM is long finished with PP before I am finished with writing instructions. Due to KV caching the LLM then gives almost instantly the answer.
  - tommy_axle 1 day ago
    Not the OP but yes you can definitely get a bigger quant like Q6 if it makes a difference but you also can go with a bigger param model like gpt oss 120B. A 70B would probably be great for a 128GB machine, which I don't think qwen has. You can search for the model you're interested in on hugging face often with "gguf" to get it ready to go (e.g. https://huggingface.co/ggml-org/gpt-oss-120b-GGUF/tree/main). Otherwise it's not a big deal to quantize yourself using llama.cpp.
- Xenograph 1 day ago
  Have you tried continue.dev's new open completion model [1]? How does it compare to llama.vscode FIM with qwen?
  [1] https://blog.continue.dev/instinct/
- codingbear 1 day ago
  how are you running qwen3 with llama-vscode? I am still using qwen-2.5-7b.
  There is an open issue about adding support for Qwn3 which I have been monitoring, would love to use Qwen3 if possible. Issue - https://github.com/ggml-org/llama.vscode/issues/55
softfalcon 1 day ago
For anyone who wants to see some real workstations that do this, you may want to check out Alex Ziskind's channel on YouTube:
https://www.youtube.com/@AZisk
At this point, pretty much all he does is review workstations for running LLM's and other machine-learning adjacent tasks.
I'm not his target demographic, but because I'm a dev, his videos are constantly recommended to me on YouTube. He's a good presenter and his advice makes a lot of sense.
[-]
- zargon 1 day ago
  His reviews are very amateurish and slapdash. He has no test methodology and just throws together random datapoints that can't be compared to anything. In his Pro 6000 video he compares it with an M3 128GB with empty context. Then he runs a large context (a first for him!) on the 6000 and notes how long prompt processing takes, and never mentions the M3 again!
  [-]
  - embedding-shape 15 hours ago
    Seems pretty spot on for what you'd expect from YouTube developers, most of the channels are like that to be honest. Something about "developer" + "youtube" that just seems to spawn low effort content that tries to create drama rather than fair replacements for blog posts that teach you something.
- fm2606 1 day ago
  > I'm not his target demographic Me either and I am a dev as well
  > He's a good presenter and his advice makes a lot of sense. Agree
  Not that I think he forms his answers on who is sponsoring him, but I feel he couldn't do a lot of the stuff he does without sponsors. If the sponsors aren't supplying him with all that hardware then, in my opinion, he is taking a significant risk in buying all of it out of pocket and hoping that the money he makes from YT covers it (which I am sure it does, several times over). But there is no guarantee that the money he makes from YT will cover the costs, is the point I'm making.
  But, then again, he does use the hardware in other videos so the it isn't like he is banking on a single video to cover the costs.
- hereme888 1 day ago
  Dude... what a good YT channel. The guy is no nonsense, straight to the point. Thanks.

jetsnoc 1 day ago

  Models
    gpt-oss-120b, Meta Llama 3.2, or Gemma
    (just depends on what I’m doing)

  Hardware
    - Apple M4 Max (128 GB RAM)
      paired with a GPD Win 4 running Ubuntu 24.04 over USB-C networking

  Software
    - Claude Code
    - RA.Aid
    - llama.cpp

  For CUDA computing, I use an older NVIDIA RTX 2080 in an old System76 workstation.

  Process

    I create a good INSTRUCTIONS.md for Claude/Raid that specifies a task & production process with a task list it maintains. I use Claude Agents with an Agent Organizer that helps determine which agents to use. It creates the architecture, prd and security design, writes the code, and then lints, tests and does a code review.

[-]

Infernal 1 day ago
What does the GPD Win 4 do in this scenario? Is there a step w/ Agent Organizer that decides if a task can go to a smaller model on the Win 4 vs a larger model on your Mac?

altcognito 1 day ago

What sorts of token/s are you getting with each model?

[-]

jetsnoc 1 day ago

Model performance summary:

  **openai/gpt-oss-120b** — MLX (MXFP4), ~66 tokens/sec @ Hugging Face: `lmstudio-community/gpt-oss-120b-MLX-8bit`

  **google/gemma-3-27b** — MLX (4-bit), ~27 tokens/sec @ Hugging Face: `mlx-community/gemma-3-27b-it-qat-4bit`

  **qwen/qwen3-coder-30b** — MLX (8-bit), ~78 tokens/sec @ Hugging Face: `Qwen/Qwen3-Coder-30B-A3B-Instruct`

Will reply back and add Meta Llama performance shortly.

CubsFan1060 1 day ago
What is the Agent Organizer you use?
[-]
- jetsnoc 1 day ago
  It’s a Claude agent prompt. I don’t recall who originally shared it, so I can’t yet attribute the source, but I’ll track that down shortly and add proper attribution here.
  Here’s the Claude agent markdown:
  https://github.com/lst97/claude-code-sub-agents/blob/main/ag...
  Edit: Updated from the old Pastebin link to the GitHub version. Attribution found: lst97 on GitHub
  [-]
  - nicce 1 day ago
    How it looks like Claude agent is written by Claude...

giancarlostoro 1 day ago
If you're going to get a MacBook, get the Pro, it has a built-in fan, you don't want the heat just sitting there on the MacBook Air. Same with the Mac mini, get the studio instead, it has a fan, the Mini does not. I don't know about you but I wouldn't want my brand new laptop / desktop to be heating up the entire time I'm coding with 0 cool off. If you go the Mac route, I recommend getting TG Pro, the default fan settings on the Mac are awful they don't kick in soon enough, TG Pro lets you make it a little more "sensitive" to those temperature shifts, its like $20 for TG Pro if I remember correctly, but worth it.
I have a MacBook Pro with an M4 Pro chip, and 24GB of RAM, I believe only 16 of it is usable by the models, so I can run the GPT OSS 20B model (iirc) but the smaller one. It can do a bit, but the context window fills up quickly, so I do find myself switching context windows often enough. I do wonder if a maxed out MacBook Pro would be able to run larger context windows, then I would easily be able to code all day with it offline.
I do think Macs are phenomenal at running local LLMs if you get the right one.
[-]
- amonroe805-2 1 day ago
  Quick correction: The mac mini does have a fan. Studio is definitely more capable due to bigger, better chips, but my understanding is the mini is generally not at risk of thermal throttling with the chips you can buy it with. The decision for desktop macs really just comes down to how much chip you want to pay for.
- suprjami 1 day ago
  Correction, if you're going to get a Mac then get a Max or Ultra, with as much memory as possible, the increase in RAM bandwidth will make running larger models viable.
- Terretta 1 day ago
  And yes, for context windows / cached context, the MacBook Pro with 128GB memory is a mind boggling laptop.
  The Studio Ultras are surprisingly strong as well for a pretty monitor stand.
- embedding-shape 1 day ago
  > I do think Macs are phenomenal at running local LLMs if you get the right one.
  How does the prompt processing speed look like today? I think it was either M3 or M4 together with 128GB, trying to run even slightly longer prompts took forever for the initial prompt processing so whatever speed gain you get at inference, basically didn't matter. Maybe it works better today?
  [-]
  - giancarlostoro 1 day ago
    I have only ever used the M4 (on my wife's Macbook Air) and M4 Pro (on my Macbook Pro) and it was reasonable speeds, I was able to tie LM Studio with PyCharm and ask it questions about code, but my context Window kept running out, I don't think the 24GB model is the right choice, the key thing you have to also look out for is for example I might hvae 24GB of RAM, but only 16 of it can be used as VRAM, so I'm more competitive than my 3080 in terms of VRAM, though my 3080 could probably run circles around my M4 Pro if it wanted to.
- trailbits 1 day ago
  The default context window using ollama or lmstudio is small, but you can easily quadruple the default size while running gpt-oss-20b on a 24GB Mac.
erikig 1 day ago
Hardware: MacBook Pro M4 Max, 128GB
Platform: LMStudio (primarily) & Ollama
Models:
- qwen/qwen3-coder-30b A3B Instruct 8-bit MLX
- mlx-community/gpt-oss-120b-MXFP4-Q8
For code generation especially for larger projects, these models aren't as good as the cutting edge foundation models. For summarizing local git repos/libraries, generating documentation and simple offline command-line tool-use they do a good job.
I find these communities quite vibrant and helpful too:
- https://www.reddit.com/r/LocalLLM/
- https://www.reddit.com/r/LocalLLaMA/
[-]
- mkagenius 1 day ago
  Since you are on Mac, if you need some kind code execution sandbox, check out Coderunner[1] which is based on Apple container, provides a way execute any LLM generated cod e without risking arbitrary code execution on your machine.
  I have recently added claude skills to it. So, all the claude skills can be executed locally on your mac too.
  1. https://github.com/instavm/coderunner
- shell0x 1 day ago
  I have a Mac Studio with the M4 Max and 128GB RAM
  The Qwen3-coder model you use is pretty good. You can enable the LM Studio API and install the qwen CLI and point to the API endpoint. This basically gives you functionality similar to Claude code.
  I agree that the code quality is not on part with gpt5-codex and Claude. I also haven't tried z.ai's models locally yet. I think on a Mac with that size GLM 4.5 Air should be able to run.
  For README generation I like gemma3-27b-it-qat and gpt-oss-120b.
toogle 10 hours ago
I built a custom code-completion server: <https://github.com/toogle/mlx-dev-server>.
The key advantage is that it cancels generation when you continue typing, so invalidated completions don’t waste time. This makes completion latency predictable (about 1.5 seconds for me).
My setup: - MacBook Pro (M3 Max) - Neovim - https://github.com/huggingface/llm.nvim
Models I typically use: - mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx - mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit
whitehexagon 1 day ago
Qwen3:32b on MBP M1 Pro 32GB running Asahi linux. Mainly command line for some help with armv8 assembly, and some SoC stuff (this week explaining I2C protocol). I couldnt find any good intro on the web-of-ads. It's not much help with Zig, but then nothing seems to keep up with Zig at the moment.
I get a steady stream of tokens, slightly slower than my reading pace, which I find is more than fast enough. In fact I´d only replace with exact same, or maybe M2 + Asahi with enough RAM to run the bigger Qwen3 model.
I saw qwen3-coder mentioned here. I didnt know about that one. Anyone got any thoughts on how that compares to qwen3? Will it also fit in 32GB?
I'm not interested in agents, or tool integration, and especially wont use anything cloud. I like to own my env. and code top-to-bottom. Having also switched to Kate and Fossil it feels like my perfect dev environment.
Currently using an older Ollama, but will switch to llama.cpp now that ollama has pivoted away from offline only. I got llama.cpp installed, but not sure how to reuse my models from ollama, I thought ollama was just a wrapper, but they seems to be different model formats?
[edit] be sure to use it powered, linux is a bit battery heavy, but Qwen3 will pull 60W+ and flatten a battery real fast.
[-]
- suprjami 1 day ago
  Qwen 3 Coder is a 30B model but an MoE with 3B active parameters, so it should be much faster for you. Give it a try.
  It's not as smart as dense 32B for general tasks, but theoretically should be better for the sort of coding tasks from StackExchange.
woile 1 day ago
I just got a AMD AI 9 HX 370 with 128GB RAM from laptopwithlinux.com and I've started using zed + ollama. I'm super happy with the machine and the service.
Here's my ollama config:
https://github.com/woile/nix-config/blob/main/hosts/aconcagu...
I'm not an AI power user. I like to code, and I like the AI to autocomplete snippets that are "logical", I don't use agents, and for that, it's good enough.
[-]
- altcognito 1 day ago
  What sorts of token/s are you getting with qwen/gemma?
juujian 1 day ago
I passed on the machine, but we set up gpt-oss-120b on a 128GB RAM Macbook pro and it is shockingly usable. Personally, I could imagine myself using that instead of OpenAI's web interface. The Ollama UI has web search working, too, so you don't have to worry about the model knowing the latest and greatest about every software package. Maybe one day I'll get the right drivers to run a local model on my Linux machine with AMD's NPU, too, but AMD has been really slow on this.
[-]
- j45 1 day ago
  LM Studio also works well on Mac.
KETpXDDzR 3 hours ago
I tried using oolama to run an LLM on my A6000 for Cursor. It fits completely in VRAM. Nevertheless it was significantly slower than Claude 4.5 opus. Also, the support in Cursor for local models is really bad.
simonw 1 day ago
I'd be very interested to hear from anyone who's finding local models that work well for coding agents (Claude Code, Codex CLI, OpenHands etc).
I haven't found a local model that fits on a 64GB Mac or 128GB Spark yet that appears to be good enough to reliably run bash-in-a-loop over multiple turns, but maybe I haven't tried the right combination of models and tools.
[-]
- embedding-shape 1 day ago
  I've had good luck with GPT-OSS-120b (reasoning_effort set to "high") + Codex + llama.cpp all running locally, but I needed to do some local patches to Codex as they don't allow configuring and setting the right values for temperature and top_p for GPT-OSS. Also heavy prompting via AGENTS.md was needed to get it to have similar workflow to GPT-5, it didn't seem to pick up that by itself, so I'm assuming GPT-5 been trained with Codex in mind while GPT-OSS wasn't.
  [-]
  - Xenograph 1 day ago
    Would love for you to share the Codex patches you needed to make and the AGENTS.md prompting, if you're open to it.
    [-]
    - embedding-shape 1 day ago
      Basically just find the place where the inference call happens, add top_k, top_p and temperature to hard-coded numbers (0, 1.0 and 1.0 for GPT-OSS) and you should be good to go. If you really need it, I could dig out patch from it, but it should be really straightforward today, and my patch might be conflicting with the current master of codex, I've diverged for other reasons since I did this.
      [-]
      - Xenograph 1 day ago
        That makes sense, wasn't sure if it was as simple as tweaking those two numbers or not, thanks for sharing!
        If there's any insight you can share about your AGENTS.md prompting, it may also be helpful for others!
bravetraveler 1 day ago
I'm more local than anything, I guess. A Framework Desktop off in another room. 96G set aside for VRAM though I barely use it.
Kept it simple: ollama, whatever the latest model is in fashion [when I'm looking]. Feel silly to name any one in particular, I make them compete. I usually don't bother: I know the docs I need.
firefax 1 day ago
I've been using Ollama, Gemma3:12b is about all my little air can handle.
If anyone has suggestions on other models, as an experiment I tried asking it to design me a new latex resumé and it struggled for two hours with the request to put my name prominently at the top in a grey box with my email and phone number beside it.
[-]
- james2doyle 1 day ago
  I was playing with the new IBM Granite models. They are quick/small and they do seem accurate. You can even try them online in the browser because they are small enough to be loaded via the filesystem: https://huggingface.co/spaces/ibm-granite/Granite-4.0-Nano-W...
  Not only are they a lot more recent than gemma, they seem really good at tool calling, so probably good for coding tools. I haven’t personally tried it myself for that.
  The actual page is here: https://huggingface.co/ibm-granite/granite-4.0-h-1b
  [-]
  - firefax 1 day ago
    Interesting. Is there a way to load this into Ollama? Doing things in browser is a cool flex, but my interest is specifically in privacy respecting LLMs -- my goal is to run the most powerful one I can on my personal machine, with the end goal being those little queries I used to send to "the cloud" can be done offline, privately.
    [-]
    - fultonn 1 day ago
      > Is there a way to load this into Ollama?
      Yes, the granite 4 models are on ollama:
      https://ollama.com/library/granite4
      > but my interest is specifically in privacy respecting LLMs -- my goal is to run the most powerful one I can on my personal machine
      The HF Spaces demo for granite 4 nano does run on your local machine, using Transformers.js and ONNX. After downloading the model weights you can disconnect from the internet and things should still work. It's all happening in your browser, locally.
      Of course ollama is preferable for your own dev environment. But ONNX and transformers.js is amazingly useful for edge deployment and easily sharing things with non-technical users. When I want to bundle up a little demo for something I typically just do that instead of the old way I did things (bundle it all up on a server and eat the inference cost).
      [-]
      - firefax 14 hours ago
        Thanks for this pointer and explanation, I appreciate it.
        Also my "dev enviornment" is vi -- I come from infosec (so basically a glorified sysadmin) so I'm mostly making little bash and python scripts, so I'm learning a lot of new things about software engineering as I explore this space :-)
        Edit: Hey which of the models on that page were you referring to? I'm grabbing one now that's apparently double digit GB? Or were you saying they're not CPU/ram intensive but still a bit big?
  - brendoelfrendo 1 day ago
    Not the person you replied to, but thanks for this recommendation. These look neat! I'm definitely going to give them a try.
Greenpants 20 hours ago
I got a personal Mac Studio M4 Max with 128GB RAM for a silent, relatively power-efficient yet powerful home server. It runs Ollama + Open WebUI with GPT-OSS 120b as well as GLM4.5-Air (default quantisations). I rarely ever use ChatGPT anymore. Love that all data stays at home. I connect remotely only via VPN (my phone enables this automatically via Tasker).
I'm 50% brainstorming ideas with it, asking critical questions and learning something new. The other half is actual development, where I describe very clearly what I know I'll need (usually in TODOs in comments) and it will write those snippets, which is my preferred way of AI-assistance. I stay in the driver seat; the model becomes the copilot. Human-in-the-loop and such. Worked really well for my website development, other personal projects and even professionally (my work laptop has its own Open WebUI account for separation).
[-]
- mark_l_watson 16 hours ago
  I like your method of adding TODOs in your code, then using a model - I am going to try that. I only have a 32G M2 Mac so I have to use Ollama Cloud to run some of the larger models but that said I am surprised by what I can do ‘all local’ and it really is magical running all on my own hardware, when I can.
  [-]
  - Greenpants 9 hours ago
    The TODOs really help me get my logic sorted out first in pseudocode. Glad to inspire someone else with it!
    I've read that GPT-OSS:20b is still a very powerful model, I believe it fits in your Mac's RAM as well and could still be quite fast to output. For me personally, only the more complex questions require a better model than local ones. And then I'm often wondering if LLMs are the right tool to solve the complexity.
alexfromapex 1 day ago
I have a MacBook M3 Max with 128 GB unified RAM. I use Ollama with Open Web UI. It performs very well with models up to 80B parameters but it does get very hot with models over 20B parameters.
I use it to do simple text-based tasks occasionally if my Internet is down or ChatGPT is down.
I also use it in VS Code to help with code completion using the Continue extension.
I created a Firefox extension so I can use Open WebUI in my browser by pressing Cmd+Shift+Space too when I am browsing the web and want to ask a question: https://addons.mozilla.org/en-US/firefox/addon/foxyai/
embedding-shape 1 day ago
> Which model(s) are you running (e.g., Ollama, LM Studio, or others)
I'm running mainly GPT-OSS-120b/20b depending on the task, Magistral for multimodal stuff and some smaller models I've fine-tuned myself for specific tasks..
All the software is implemented by myself, but I started out with basically calling out to llama.cpp, as it was the simplest and fastest option that let me integrate it into my own software without requiring a GUI.
I use Codex and Claude Code from time to time to do some mindless work too, Codex hooked up to my local GPT-OSS-120b while Claude Code uses Sonnet.
> What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?
Desktop, Ryzen 9 5950X, 128GB of RAM, RTX Pro 6000 Blackwell (96GB VRAM), performs very well and I can run most of the models I use daily all together, unless I want really large context then just GPT-OSS-120B + max context, ends up taking ~70GB of VRAM.
> What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).
Almost anything and everything, but mostly coding. But then general questions, researching topics, troubleshooting issues with my local infrastructure, troubleshooting things happening in my other hobbies and a bunch of other stuff. As long as you give the local LLM access to a search tool (I use YaCy + my own adapter), local models works better for me than the hosted models, mainly because of the speed and I have better control over the inference.
It does fall short on really complicated stuff. Right now I'm trying to do CUDA programming, creating a fused MoE kernel for inference in Rust, and it's a bit tricky as there are a lot of moving parts and I don't understand the subject 100%, and when you get to that point, it's a bit hit or miss. You really need to have a proper understanding of what you use the LLM for, otherwise it breaks down quickly. Divide and conquer as always helps a lot.
[-]
- andai 1 day ago
  gpt-oss-120b keeps stopping for me in Codex. (Also in Crush.)
  I have to say "continue" constantly.
  [-]
  - embedding-shape 1 day ago
    See https://news.ycombinator.com/item?id=45773874 (TLDR, you need to hard-code some inference parameters to be the right ones, otherwise you'd get really bad behaviour + prompting to get the workflow right)
    [-]
    - andai 1 day ago
      Thanks. Did you need to modify Codex's prompt?
dennemark 1 day ago
I have AMD Strix Halo (395) on my work laptop (HP Ultrabook G1A) as well as at home with Framework Desktop.
On both i have setup lemonade-server on system start. At work i use Qwen3 Coder 30B-3A with continue.dev. It serves me well in 90% of cases.
At home i have 128GB RAM. I try a bit GPT120B. I host Open WebUI on it and connect via https and wireguard to it, so i can use it as PWA on my phone. I love not needing to think about where my data goes. But i would like to allow parallel requests, so i need to tinker a bit more. Maybe llama-swap is enough.
I just need to see how to deal with context length. My models stop or go into infinite loop after some messages. But then i often start a new chat.
Lemonade-server runs with llama.cpp, vllm seems to be scaling better thoug, but is not so easy to set up.
Unsloth GGUFs are great resource for models.
Also for Strix Halo check out kyuz0 repositorIES! Also has image gen. I didnt try those yet. But the benchmarks are awesome! Lots to learn from. Framework forum can be useful, too.
https://github.com/kyuz0/amd-strix-halo-toolboxes Also nice: https://llm-tracker.info/ It links to some benchmark site with models by size. I prefer such resources, since it is quite easy to see which one fit in my RAM (even though i have this silly thumbrule Billion Token ≈ GB RAM).
Btw. even a AMD HX 370 with non soldered RAM can get some nice t/s for smaller models. Can be helpful enough when disconnected from internet and you dont know how to style a svg :)
Thanks for opening up this topic! Lots of food :)
[-]
- Balinares 13 hours ago
  Does Qwen3 Coder do a good job invoking its tools as appropriate for you? Under continue.dev at least, I've found I need to remind it constantly.
hacker_homie 1 day ago
Any halo strix laptop, I have been using the hp zbook ultra g1a with 128gb of unified memory. Mostly with the 20B parameters models but it can load larger ones. I find local models (gpt oss 20B) are good quick references but if you want to refactor or something like that you need a bigger model. I’m running llama.cpp directly and using the api it offers for neovim’s avante plugin, or a cli tool like aichat, it comes with a basic web interface as well.
[-]
- zamadatix 1 day ago
  Do you run into hibernation/sleep issues under current mainline Linux kernels by chance? I have this laptop and that's the only thing which isn't working out of the box for me on the Linux side, but it works fine in Windows. I know it's officially supported under the Ubuntu LTS, but I was hoping that wouldn't be needed as I do want a newer+customized kernel.
  [-]
  - hacker_homie 1 day ago
    Under current kernels (6.17) it seems there is an issue with the webcam driver, https://bugzilla.kernel.org/show_bug.cgi?id=220702 . looks like there are still some issues with sleep/webcam at this time, they might be fixed by the 6.18 release.
    I got sleep working by disabling webcam in the bios for now.
sprior 1 day ago
I wanted to dip my toe in the AI waters, so I bought a cheap Dell Precision 3620 Tower i7-7700, upgraded the RAM (sold what it came with on eBay) and ended up upgrading the power supply (this part wasn't planned) so I could install a RTX 3060 GPU. I set it up with Ubuntu server and set it up as a node on my home kubernetes(k3s) cluster. That node is tainted so only approved workloads get deployed to it. I'm running Ollama on that node and OpenWebUI in the cluster. The most useful thing I use it for is AI tagging and summaries for Karakeep, but I've also used it for a bunch of other applications including code I've written in Python to analyze driveway camera footage for delivery vehicles.
sho 1 day ago
Real-world workflows? I'm all for local LLM, tinker with it all the time, but for productive coding use no local LLM approaches cloud and it's not even close. There's no magic trick or combination of pieces, it just turns out that a quarter million dollars worth of H200s is just much, much better than anything a normal person could possibly deploy at home.
Give it time, we'll get there, but not anytime soon.
[-]
- exac 1 day ago
  I thought you would just use another computer in your house for the flows?
  My development flow takes a lot of RAM (and yes I can run it minimally editing in the terminal with language servers turned off), so I wouldn't consider running the local LLM on the same computer.
  [-]
  - sho 18 hours ago
    It's not about which of your computers you run it on, it's about the relative capability of any system you're likely to own vs. what a cloud provider can do. The difference is hilarious - probably 100x. Knowing that, unless you have good reasons (and experimenting/playing around IS a good reason) - not many people would choose to actually base their everyday workflow on an all-local setup.
    It's sort of like doing all your work on an 80386. Can it be made to work? Probably. Are you going to learn a whole lot making it work? Without a doubt! Are you going to be the fastest dev on the team? No.
- starik36 1 day ago
  You are right. This is the current situation. Plus the downside is that your laptop heats up like a furnace if you use the local LLM a lot.
dnel 19 hours ago
I recently picked up a Threadripper 3960x, 256GB DDR4 and RTX2080ti 11GB running Debian 13 and open web-ui w/ ollama.
It runs well, not much difference to Claude etc but still learning the ropes and how to get the best out of it and local llms in general. Having tonnes of memory is nice for switching out models in ollama quickly since everything stays in cache.
The GPU memory is the weak point though so I'm mostly using models up to 18b parameters that can fit in the vram.
scosman 1 day ago
What are folks motivation for using local coding models? Is it privacy and there's no cloud host you trust?
I love local models for some use cases. However for coding there is a big gap between the quality of models you can run at home and those you can't (at least on hardware I can afford) like GLM 4.6, Sonnet 4.5, Codex 5, Qwen Coder 408.
What makes local coding models compelling?
[-]
- realityfactchex 1 day ago
  > compelling
  >> motivation
  It's the only way to be sure it's not being trained on.
  Most people never come up with any truly novel ideas to code. That's fine. There's no point in those people not submitting their projects to LLM providers.
  This lack of creativity is so prevalent, that many people believe that it is not possible to come up with new ideas (variants: it's all been tried before; or: it would inevitably be tried by someone else anyway; or: people will copy anyway).
  Some people do come up with new stuff, though. And (sometimes) they don't want to be trained on. That is the main edge IMO, for running local models.
  In a word: competition.
  Note, this is distinct from fearing copying by humans (or agents) with LLMs at their disposal. This is about not seeding patterns more directly into the code being trained on.
  Most people would say, forget that, just move fast and gain dominance. And they might not be wrong. Time may tell. But the reason can still stand as a compelling motivation, at least theoretically.
  Tangential: IANAL, but I imagine there's some kind of parallel concept around code/concept "property ownership". If you literally send your code to a 3P LLM, I'm guessing they have rights to it and some otherwise handwavy (quasi important) IP ownership might become suspect. We are possibly in a post-IP world (for some decades now depending on who's talking), but not everybody agrees on that currently, AFAICT.
- jckahn 1 day ago
  I don't ever want to be dependent on a cloud service to be productive, and I don't want to have to pay money to experiment with code.
  Paying money for probabilistically generated tokens is effectively gambling. I don't like to gamble.
  [-]
  - nprateem 1 day ago
    Where did you get your free GPU from?
    [-]
    - nicce 1 day ago
      The problem is the same as owning the house vs. renting.
    - jckahn 1 day ago
      I just use my AMD Framework 13 and 24GB M4 Mac mini. They run gpt-oss models, but only the 20b fits on the mini.
    - serf 1 day ago
      GPUs can do other things. Cloud service LLM providers cannot.
- voakbasda 1 day ago
  Zero trust in remote systems run by others with unknowable or questionable motives.
  [-]
  - scosman 1 day ago
    Makes sense that you'd run locally then.
    But really no host you trust to not keep data? Big tech with no-log guarantees and contractual liability? Companies with no-log guarantees and clear inference business model to protect like Together/Fireworks? Motives seem aligned.
    I'd run locally if I could without compromise. But the gap from GLM 4.5 Air to GLM 4.6 is huge for productivity.
    [-]
    - xemdetia 1 day ago
      This really isn't an all or nothing sort of situation. Many of the AI players have a proven record of simply not following existing norms. Until there is a consumer oriented player who is not presuming that training on my private data and ideas is permitted it only makes sense to do some stuff things locally. Beyond that many of the companies providing AI have either weird limits or limitations that interrupt me. I just know as an individual or a fledgling company I am simply not big enough to fight some of these players and win, and the compliance around companies running AI transparently is too new for me to rely on so the rules of engagement are all over the place. Also don't forget in a few years when the dust settles that company with that policy you like is highly likely to be consumed by a company who may not share the same ethics but your data is still held by them.
      Why take a chance?
  - fm2606 1 day ago
    > Zero trust in remote systems run by others with unknowable or questionable motives.
    This all day long.
    Plus I like to see what can be done without relying on big tech (relying on someone to create an LLM that I can use, notwithstanding).
- zargon 1 day ago
  Another reason along with the others is that the output quality of the top commercial models varies wildly with time. They start strong and then deteriorate. The providers keep changing the model and/or its configuration without changing the name. With a local open weights model, you can learn each model's strengths and it can't be taken away with an update.
- brailsafe 1 day ago
  I don't run any locally, but when I was thinking about investing in a setup, it would just be to have the tool offline. I haven't found the online subscription models to be sufficiently and frequently useful enough beyond occasional random tedious implementations that I'd consider investing in either online or offline LLMs long-term, and I've reverted back to normal programming for the most part, since it just keeps me more engaged.
  [-]
  - IanCal 1 day ago
    Something to consider is using a middleman like openrouter, you can buy some credits and then use them at whatever provider through them - no subscription just payg. For a few ad hoc things you can put a few bucks in and not worry about some monthly thing.
- johnisgood 1 day ago
  What setup would you (or other people) recommend for a local model, and which model, if I want something like Claude Sonnet 4.5 (or actually, earlier versions, which seemed to be better)?
  Anyone could chime in! I just want to have working local model that is at least as good as Sonnet 4.5, or 3.x.
  [-]
  - scosman 1 day ago
    Nothing open is quite as good as Sonnet 4.5 and Codex 5. GLM 4.6, MiniMax M2, Deepseek v3.2, Kimi K2 and Qwen Coder 3 are close. But those are hundreds of billions of parameters, so running locally is very very expensive.
    [-]
    - johnisgood 1 day ago
      That is unfortunate. I will never be able to afford such hardware that could run them. :(
- garethsprice 1 day ago
  It's fun for me. This is a good enough reason to do anything.
  I learn a lot about how LLMs work and how to work with them.
  I can also ask my dumbest questions to a local model and get a response faster, without burning tokens that count towards usage limits on the hosted services I use for actual work.
  Definitely a hobby-category activity though, don't feel you're missing out on some big advantage (yet, anyway) unless you feel a great desire to set fire to thousands of dollars in exchange for spending your evenings untangling CUDA driver issues and wondering if that weird smell is your GPU melting. Some people are into that sort of thing, though.
- nprateem 1 day ago
  Deep-seated paranoia, delusions of grandeur, bragging rights, etc, etc.
reactordev 1 day ago
I use LM Studio with GGUF models running on either my Apple MacBook Air M1 (it’s, ok…) or my Alienware x17 R2 with an RTX 3080 on a Core i9 (runs like autocomplete) in VS Code using Continue.dev
My only complaint is agent mode needs good token gen so I only go agent mode on the RTX machine.
I grew up on 9600baud so I’m cool with watching the text crawl.
mjgs 1 day ago
I use podman compose to spin up an Open WebUI container and various Llama.cpp containers, 1 for each model. Nothing fancy like a proxy or anything. Just connect direct. I also use Continue extension inside vscode, and always use devcontainers when I'm working with any LLMs.
I had to create a custom image of llama.cpp compiled with vulkan so the LLMs can access the GPU on my MacBook Air M4 from inside the containers for inference. It's much faster, like 8-10x faster than without.
To be honest so far I've been using mostly cloud models for coding, the local models haven't been that great.
Some more details on the blog: https://markjgsmith.com/posts/2025/10/12/just-use-llamacpp
wongarsu 1 day ago
$work has a GPU server running Ollama, I connect to it using the continue.dev VsCode extension. Just ignore the login prompts and set up models via the config.yaml.
In terms of models, qwen2.5-coder:3b is a good compromise for autocomplete, as agent choose pretty much just the biggest sota model you can run
vinhnx 1 day ago
> Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?
Open-source coding assistant: VT Code (my own coding agent -- github.com/vinhnx/vtcode) Model: gpt-oss-120b remote hosted via Ollama cloud experimental
> What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?
Macbook Pro M1
> What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).
All agentic coding workflow (debug, refactor, refine and testing sandbox execution). VT Code is currently in preview and being active developed, but currently it is mostly stable.
[-]
- jdthedisciple 1 day ago
  Wait ollama cloud has a free tier?
  Sounds too good. Where's the catch? And is it private?
  [-]
  - bradfa 1 day ago
    The catch is ollama cloud is likely to increase prices and/or decrease usage limit levels soon. Free tier has more restrictions than their $20/mo tier. They claim to not store anything (https://ollama.com/cloud) but you'll have to clarify what you mean by "private" (your model likely runs on shared hardware with other users).
    [-]
    - vinhnx 1 day ago
      I agree. "Free" usage could mean tradeoff. But for side-project and experiments, to accesss open source model like gpt-oss, as my machine can not run, I think I will accept it.
      [-]
      - bradfa 1 day ago
        My experience with the free tier and qwen3-coder cloud is the hourly limit gets you about 250k tokens input and then your usage is paused till the hour is up. Enough to try something very small.
  - vinhnx 1 day ago
    Yeah, Ollama recently announces Cloud, as I think it still in beta and free, usage is generously too, enough to build and hack on. But I'm not sure about data training, I don't see the settings to turn this off..
BirAdam 1 day ago
Mac Studio, M4 Max
LM Studio + gpt-oss + aider
Works quite quickly. Sometimes I just chat with it via LM Studio when I need a general idea for how to proceed with an issue. Otherwise, I typically use aider to do some pair programming work. It isn't always accurate, but it's often at least useful.
__mharrison__ 1 day ago
I have a MBP with 128GB.
Here's the pull request I made to Aider for using local models:
https://github.com/Aider-AI/aider/issues/4526
sharms 1 day ago
FWIW I bought the M4 max with 128GB and it is useful for local LLMs for OCR, I don't find it as useful for coding (ala Codex / Claude Code) with local LLMs. I find that even with GPT 5 / Claude 4.5 Sonnet that trust is low, and local LLMs can lower that just enough to not be as useful. The heat is also a factor - Apple makes great hardware, but I don't believe it is designed for continuous usage the way a desktop is.
disambiguation 1 day ago
Not my build and not coding, but I've seen some experimental builds (oss 20b on a 32gb mac mini) with Kiwix integration to make what is essentially a highly capable local private search engine.
[-]
- stuxnet79 1 day ago
  Any resources you can share for these experimental builds? This is something I was looking into setting up at some point. I'd love to take a look at examples in the wild to gauge if it's worth my time / money.
  An aside, if we ever reach a point where it's possible to run an OSS 20b model at reasonable inference on a Macbook Pro type of form factor, then the future is definitely here!
  [-]
  - disambiguation 23 hours ago
    In reference to this post i saw a few weeks ago:
    https://lemmy.zip/post/50193734
    (Lemmy is a reddit style forum)
    The author mainly demos their "custom tools" and doesn't elaborate further. But IMO is still an impressive showcase for an offline setup.
    I think the big hint is "open webui" which supports native function calls.
    Some more searching and i found this: https://pypi.org/project/llm-tools-kiwix/
    It's possible the future is now.. assuming you have an M series with enough RAM. My sense is that you need ~1gb of RAM for every 1b paramters, so 32gb should in theory work here. I think macs also get a performance boost over other hardware due to unified memory.
    Spit balling aside, I'm in the same boat, saving my money, waiting for the right time. If it isn't viable already its damn close.
baby_souffle 1 day ago
Good quality still needs more power than what a laptop can do. The local llama subreddit has a lot of people doing well with local rigs, but they are absolutely not laptop size.
codingbear 1 day ago
I use local for code completions only. Which means models supporting FIM tokens.
My current setup is the llama-vscode plugin + llama-server running Qwen/Qwen2.5-Coder-7B-Instruct. It leads to very fast completions, and don't have to worry about internet outages which take me out of the zone.
I do wish qwen-3 released a 7B model supporting FIM tokens. 7B seems to be the sweet spot for fast and usable completions
[-]
- Mostlygeek 1 day ago
  qwen3-coder-30B-A3B supports FIM and should be faster than the 7B if you got the vram.
  I use bartowkski’s Q8 quant over dual 3090s and it gets up to 100tok/sec. The Q4 quant on a single 3090 is very fast and decently smart.
sehugg 1 day ago
I use "aider --commit" sometimes when I can't think of a comment. I often have to edit it because it's too general or it overstates the impact (e.g. "improved the foo", are you sure you improved the foo?) but that's not limited to local models. I like gemma3:12b or qwen2.5-coder:14b, not much luck with reasoning models.
Weloveall 8 hours ago
I use ollama in openwebui and llama3 in model. But I also have an ssh flask where I use an claude api.
nvdnadj92 1 day ago
Laptop: Apple M2 Max, 32GB memory (2023)
Setup:
Terminal:
- Ghostty + Starship for modern terminal experience
- Homebrew to install system packages
IDE:
- Zed (can connect to local models via LM-Studio server)
- also experimenting with warp.dev
LLMs:
- LM-studio as open-source model playground
- GPT-OSS 20B
- QWEN3-Coder-30B-AEB-quantized-4bit
- Gemma3-12B
Other utilities:
- Rectangle.app (window tile manager)
- Wispr.flow - create voice notes
- Obsidian - track markdown notes
lux_sprwhk 1 day ago
I use it to analyze my dreams and mind dumps. Just running it on my local machine, cus it’s not resource intensive, but building a general solution out of it.
I think for stuff that isn’t super private like code and such, it’s not worth the effort
jwpapi 1 day ago
On a side note I really thing latency is still important. Is there some benefit in choosing location for where you get your responses from? Like with Openrouter f.e.
Also I could think that a local model just for autocomplete could help reducing latency for completion suggestions.
[-]
- oofbey 1 day ago
  Latency matters for the autocomplete models. But IMHO those suck and generally just get in the way.
  For the big agentic tasks or reasoned questions, the many seconds or even minutes of LLM time dwarf RTT even to another continent.
  Side note: I recently had GPT5 in Cursor spend fully 45 minutes on one prompt chewing on why a bug was flaky, and it figured it out! Your laptop is not gonna do that anytime soon.
ghilston 1 day ago
I have a m4 max mbp with 128 gb. What model would you folks recommend? I'd ideally like to integrate with a tool that can auto read context like Claude code (via a proxy) or cline. I'm open to any advice
dethos 1 day ago
Ollama, Continue.dev extension for editor/IDE, and Open-WebUI. My hardware is a bit dated, so I only use this setup for some smaller open models.
On the laptop, I don't use any local models. Not powerful enough.

mwambua 1 day ago

Tangential question. What do people use for search? What search engines provide the best quality to cost ratios?

Also are there good solutions for searching through a local collection of documents?

[-]

andai 1 day ago

ddg (python lib) is free and I'd say good enough for most tasks. (I think the endpoint is unofficial, but from what I've heard it's fine for typical usage.)

There's also google, which gives you 100 requests a day or something.

Here's the search.py I use

    import os
    import json
    from req import get

    # https://programmablesearchengine.google.com/controlpanel/create
    GOOGLE_SEARCH_API_KEY = os.getenv('GOOGLE_SEARCH_API_KEY')
    GOOGLE_SEARCH_API_ID = os.getenv('GOOGLE_SEARCH_API_ID')

    url = "https://customsearch.googleapis.com/customsearch/v1"

    def search(query):
        data = {
            "q": query,
            "cx": GOOGLE_SEARCH_API_ID,
            "key": GOOGLE_SEARCH_API_KEY,
        }
        results_json = get(url, data)
        results = json.loads(results_json)
        results = results["items"]
        return results

    if __name__ == "__main__":
        while True:
            query = input('query: ')
            results = search(query)
            print(results)

and the ddg version

    from duckduckgo_search import DDGS

    def search(query, max_results=8):
        results = DDGS().text(query, max_results=max_results)
        return results

[-]

mwambua 1 day ago
Oh, nice! Thanks! This reminds me of the unofficial yahoo finance api.

nickthegreek 12 hours ago
just setup searxng yesterday and mcp for it in lm studio to be able to search the net for answers to simple queries. small ibm granite worked surpassingly well, while oss20b seemed to be looping searches.

loudmax 1 day ago
I have a desktop computer with 128G of RAM and an RTX 3090 with 24G of VRAM. I use this to tinker with different models using llama.cpp and ComfyUI. I manged to get a heavily quantized instance of DeepSeek R1 running on it by following instructions from the Level1 tech forums, but it's far too slow to be useful. GPT-OSS-120b is surprisingly good, though again too quantized and too slow to be more than a toy.
For actual real work, I use Claude.
If you want to use an open weights model to get real work done, the sensible thing would be to rent a GPU in the cloud. I'd be inclined to run llama.cpp because I know it well enough, but vLLM would make more sense for models that runs entirely on the GPU.
Frannky 1 day ago
I wonder how long it will be before 400B+ local models achieving 2K TPS become a reality for under $10K
lovelydata 1 day ago
llama.cpp + Qwen3-4B running on older PC with AMD Radeon GPU (Vulcan). Users connect via web UI. Usually around 30 tokens/sec. Usable.
[-]
- NicoJuicy 1 day ago
  What do they use it for? It's a very small model
  [-]
  - embedding-shape 1 day ago
    Autocomplete words, I'd wager, as yeah, super tiny model that can barely output coherent output in many cases.
packetmuse 1 day ago
Running local LLMs on laptops still feels like early days, but it’s great to see how fast everyone’s improving and sharing real setups.
timenotwasted 1 day ago
I have an old 2080TI that I use to run Ollama and Qdrant. It has been ok, I haven't found it so good that it has replaced using Claude or Codex but there are times where having RAG available locally is a nice setup for more specific queries. I also just enjoy tinkering with random models which this makes super easy.
My daily drivers though are still either Codex or GPT5, Claude Code used to be but it just doesn't deliver the same results as it has previously.
Gracana 1 day ago
I don’t own a laptop. I run DeepSeek-V3 IQ4_XS on a Xeon workstation with lots of RAM and a few RTX A4000s.
It’s not very fast, and I built it up slowly without knowing quite where I was headed. If I could do it over again, I’d go with a recent EPYC with 12 channels of DDR5 and pair it with a single RTX 6000 Pro Blackwell.
ThrowawayTestr 1 day ago
I use the abliterated and uncensored models to generate smut. SwarmUI to generate porn. I can only get a few tokens/s on my machine so not fast enough for quick back and forth stuff.
saubeidl 1 day ago
I think local LLM and laptop is not really compatible, for anything useful. You're gonna want a bigger box and have your laptop connect to that.
mooiedingen 1 day ago
Vim+ollama-vim Start new file with at the top in the comments the instructions needed to follow to become the solution to the problem and let it work like a sort of auto complete... example: # The Following is a python # Script that uses the # libraries requests and # BeautifullSoup to scrape # url_to_scrape = input( # "what url do i need to # fetch?") import ... """autocompletes from here the rest""" anywhere in a script one can # comment ' Instructions this way i find the most effective instead of asking Write me a script for this or that.. take a coding model and finetune it with commonly used snippets of code... This is completely customizable and will stay coherent to your own writing style.. i made embeddings per language, even md. python javascript vimscript lua php html json(however output is json) xml Css ...
kabes 1 day ago
Let's say I have a server with an h200 gpu at home. What's the best open model for coding I can run on it today? And is it somewhat competitive with commercial models like sonnet 4.5?
[-]
- suprjami 1 day ago
  If you have ~$25k to buy a H200 then don't buy one. Rent them out much cheaper and keep renting newer models when your H200 becomes an outdated paperweight.
  Assuming you ran inference for the full working day, you'd need to run your H200 for almost 2 years to break even. Realistically you don't run inference full time so you'll never realise the value of the card before it's obsolete.
  [-]
  - kabes 21 hours ago
    The company I work for is in the defense industry and by contract can't send any code outside their own datacenter. So cloud-rented H200's are a no-go and obviously commercial LLM's as well. so breaking even is not the goal here.
    [-]
    - suprjami 6 hours ago
      In that case I suggest you buy cheaper desktop cards instead of a H200. Two or three 5090s will let you run decent models at very good speed.
- skhameneh 1 day ago
  That's still very limiting when comparing to commercial models. To be truly competitive with commercial offerings the bar is closer to 4-8x that for one node .
  That said, maybe a quantized version of GLM 4.5 Air, but if we're talking no hardware constraints I find some of the responses from LongCat-Chat-Flash to be favorable over Sonnet when playing around with LMArena.
- hamdingers 1 day ago
  If you do, damn bro
  I played around with renting H200s and coding with aider and gpt-oss 120b. It was impressive but not at the level of claude. I decided buying $30k worth of tokens made far more sense than buying 30k worth of one GPU.
justsomedumbass 1 day ago
qwen3 coder 30B with ollama server via continue. Big box Win11 Ryzen9-7900,128GB-DDR5,8TB,5090. Use for all of the above. Works pretty well for simple coding tasks. Haven't given it much in the way of smoke tests yet. It's quite snappy, but of course the limitation is context size if I run it solely on GPU. Haven't plugged OSS into the pipeline yet. Wanted to use code specific model.
gnarlouse 1 day ago
Omarchy ArchLinux+ollama:deepseek-r1+open-webui
On an RTX 3080 Ti+Ryzen 9
itake 1 day ago
Ollama qwen3-coder
- auto git commit message
- auto jira ticket creation from git diff
ge96 1 day ago
I don't, although I'm not a puritan eg. I'll use the AI summary that shows first in browsers
finfun234 1 day ago
lmstudio with local models
brendoelfrendo 1 day ago
I keep mine pretty simple: my desktop at home has an AMD 7900XT with 20gb VRAM. I use Ollama to run local models and point Zed's AI integration at it. Right now I'm mostly running Devstral 24b or an older Qwen 2.5 Coder 14b. Looking at it, I might be able to squeak by running Qwen 3 Coder 30b, so I might give it a try to test it out.
NicoJuicy 1 day ago
Rtx 3090 24gb. Pretty affordable.
Gos-oss:20b and qwen3 coder/instruct, devstrall are my usual.
Ps. Definitely check out open-web ui.
[-]
- ThrowawayTestr 1 day ago
  What's your tokens/s on that?
j45 1 day ago
The M2/3/4 Max CPUs in a Mac Studio or Macbook Pro when paired with enough ram are quite capable.
In more cases than expected, the M1/M2 Ultras are still quite capable, especially performance power per watt of electricity, as well as ability to serve one user.
The Mac Studio has better bang for the buck than the laptop for computational power to price.
Depending on your needs, the M5's might be worth waiting for, but M2 Max onward are quite capable with enough ram. Even the M1 Max continues to be a workhorse.
manishsharan 1 day ago
I am here to hear from folks running LLM on Framework desktop (128GB). Is it usable for agentic coding ?
[-]
- strangattractor 1 day ago
  Just started going down that route myself. For the money it performs well and runs most of the models at reasonable speeds.
  1. Thermal considerations are important due to throttling for thermal protection. Apple seems best at this but $$$$. The Framework (AMD) seems a reasonable compromise (you can have almost 3 for 1 Mini). Laptops will likely not perform as well. NVIDIA seems really bad at thermal/power considerations.
  2. Memory model matters and AMD's APU design is an improvement. NVIDIA GPUs where designed for graphics but where better than CPUs for AI so they got used. Bespoke AI solutions will eventually dominate. That may or may not be NVIDIA in the future.
  My primary interest is AI at the edge.
platevoltage 1 day ago
I've been using qwen2.5-coder for code assistant and code completion which has worked pretty well. I recently started trying mistral:7b-instruct. I use Continue with VS Code. It works ok. I'm limited to 16GB on an M2 MacBook Pro. I definitely wish I had more RAM to play with.
more_corn 1 day ago
My friend uses a 4 gpu server in her office and hits the ollama api over the local network. If you want it to work from anywhere a free tailscale account.
system2 1 day ago
Those who use these can you compare the quality of code compared to Claude Sonnet 4.5 or Opus 4.1?
dboreham 1 day ago
I've run smaller models (I forget which ones, this was about a year ago) on my laptop just to see what happened. I was quite surprised that I could get it to write simple Python programs. Actually very surprised which led me to re-evaluate my thinking on LLMs in general. Anyway, since then I've been using the regular hosted services since for now I don't see a worthwhile tradeoff running models locally. Apart from the hardware needed, I'd expect to be constantly downloading O(100G) model files as they improve on a weekly basis. I don't have the internet capacity to easily facilitate that.
garethsprice 1 day ago
HP G9 Z2 Mini with a 20GB ADA 4000, 96GB RAM, 2TB SSD, Ubuntu. Would get a Macbook with a ton of RAM if I was buying today, a full form factor PC, the mini form factor looks nice but gets real hot and is hard to upgrade.
Tools: LM Studio for playing around with models, the ones I stabilize on for work go into ollama.
Models: Qwen3 Coder 30b is the one I come back to most for coding tasks. It is decent in isolation but not so much at the multi-step, context-heavy agentic work that the hosted frontier models are pushing forward. Which is understandable.
I've found the smaller models (the 7B Qwen coder models, gpt-oss-20B, gemma-7b) extremely useful given they respond so fast (~80t/s for gpt-oss-20B on the above hardware), making them faster to get to an answer than Googling or asking ChatGPT (and fast to see if they're failing to answer so I can move on to something else).
Use cases: Mostly small one-off questions (like 'what is the syntax for X SQL feature on Postgres', 'write a short python script that does Y') where the response comes back quicker than Google, ChatGPT, or even trying to remember it myself.
Doing some coding with Aider and a VS Code plugin (kinda clunky integration), but I quickly end up escalating anything hard to hosted frontier models (Anthropic, OpenAI via their clis or Cursor). I often hit usage limits on the hosted models so it's nice to have a way my dumbest questions don't burn tokens I want to reserve for real work.
Small LLM scripting tasks with dspy (simple categorization, CSV munging type tasks), sometimes larger RAG/agent type things with LangChain but it's a lot of overhead for personal scripts.
My company is building a software product that heavily utilizes LLMs so I often point my local dev environment at my local model (whatever's loaded, usually one of the 7B models), initially I did this not to incur costs but as prices have come down it's now more as it's less latency and I can test interface changes etc faster - especially as new thinking models can take a long time to respond.
It is also helpful to try and build LLM functions that work with small models as it means they run efficiently and portably on larger ones. One technical debt trap I have noticed with building for LLMs is that as large models get better you can get away with stuffing them with crap and still getting good results... up until you don't.
It's remarkable how fast things are moving in the local LLM world, right now the Qwen/gpt-oss models "feel" like gpt-3.5-turbo did a couple of years back which is remarkable given how groundbreaking (and expensive to train) 3.5 was and now you can get similar results on sub-$2k consumer hardware.
However, its very much still in the "tinkerer" phase where it's overall a net productivity loss (and massive financial loss) vs just paying $20/mo for a hosted frontier model.
[-]
- uxcolumbo 20 hours ago
  This is a great break down and thanks for including the type of questions you ask depending what model you use.
  What coding tasks do you use Qwen3 Coder 30b model for? Simple function definitions and / or as autocomplete in VSC?
lloydatkinson 19 hours ago
What a fucking stupid suggestion that only laptops are used here.
bukablokirBWS 1 day ago
[dead]