How to setup a local coding agent on macOS

(ikyle.me)

154 points | by kkm 4 hours ago

21 comments

  • mark_l_watson 11 minutes ago
    Nice writeup, thanks.

    I run something very similar except for directly using pi as the agentic harness I use little-coder that wraps pi with reasonable defaults for running local models. Even though my local setup is a bit slow, it is a thrill to do real work completely locally.

  • Aurornis 2 hours ago
    > The benchmark prompt was:

    > Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.

    > Each benchmark generated about 128 tokens.

    Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.

    llama.cpp includes a tool specifically for benchmarking that will sweep the arguments for you so you don't have to restart the server and send it prompts:

    https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...

    EDIT: Also the section about downloading the models should have mentioned that llama.cpp has a "-hf" argument that will download the models for you. I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.

    • reactordev 10 minutes ago
      This is akin to saying “it runs on my machine” without actually examining the problem. Sad. You’re absolutely right that 128 tokens is nothing, it’s a little more than a hello response.
    • liuliu 1 hour ago
      Realistically, you need to experiment with any user prompt + a good amount of system prompt (at least > 1000 tokens, but realistically, in the range of 3000 tokens probably good).

      llama.cpp includes tools for that, what you are looking at is to have a prefill before token generation to measure it properly. Increasingly also, measuring token generation speed at longer context (32k or 64k) is important too.

  • ig0r0 3 hours ago
    I wrote a similar post some time ago just used ollama and opencode https://blog.kulman.sk/running-local-llm-coding-server/
    • takethebus 1 hour ago
      this is the way, given anyone could swap for oh my pi / pi / etc
      • mark_l_watson 4 minutes ago
        yes, whether for home experiments or at work, it is good practice (good hygiene) to be able to swap out both agentic harnesses and models. It is important to have a good strategy for exporting skills, etc.
    • sleepybrett 2 hours ago
      actually useful and the ollama gui could probably even simplify this more.
  • vladgur 2 hours ago
    I have used omlx.ai with great success to both download multiple mlx models (including gemma and qwen) suited for my hardware AND to be able to automagically launch both open-source and close-source (claude code, codex) harnesses using these models. All from a web or desktop UI

    You would not need to follow a blog post with omlx IMHO

    • dofm 27 minutes ago
      FWIW I have not, on a 64GB M1 Max, seen any advantage from oMLX specifically or MLX generally over GGUF with llama.cpp.

      The Gemma 4 MLX builds I have found so far have been slower at the same quantisation and much slower with MTP.

      The built-in web UI for llama.cpp is really quite good once you have chosen your model. Otherwise I quite like LM Studio for tinkering.

      One thing I would say is that both Gemma-4 and Qwen 3.6 simply do not need a large chunk of the typical opencode system prompt. Better off without it.

    • Dotnaught 1 hour ago
      In case anyone is looking for a sandbox to go with oMLX and Pi: https://github.com/Dotnaught/pi-sandbox
      • dofm 25 minutes ago
        This is useful. I'm still tinkering with Multipass VMs because I need the whole VM environment anyway and I'm on Sequoia. But I'd be interested if you did anything like that with Apple's container CLI instead; sooner or later I will have to upgrade to Tahoe because I want to play with the container CLI (and apfel).
    • fridder 2 hours ago
      It truly is the SOTA for local inference on mac. Even when there are regressions the dev(s) are insanely responsive. It is the most impressive opensource project I've seen in a awhile
      • benbojangles 1 hour ago
        Omlx needs to incorporate macos native shortcuts use - macos can almost instantly extract text from pdfs and a bunch of other things using it's ane neural engine keeping unified ram for llm use. The two together would be awesome
  • c-hendricks 3 hours ago
    Not sure you really need huggingface-cli to download anything if you're just using llama.cpp. You can pass `-hf ...` and it will download the models for you. Set `LLAMA_CACHE` to change where the downloads go:

      LLAMA_CACHE="models" ./llama-server \
        -hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
        ...
    • dofm 3 hours ago
      Yes.

      -hfd for the draft model.

      • c-hendricks 2 hours ago
        Nice, was wondering if there was a flag for the draft as well.

        Not knocking huggingface-cli, just find it's much easier for people to try out this stuff when they can just

          mise use --global github:ggml-org/llama.cpp
          LLAMA_CACHE="models" llama-server \
            -hf unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL \
            --host 0.0.0.0 \
            --port 11434 \
            ...
  • reenorap 34 minutes ago
    My biggest pet peeve with all these articles on local AI is the only thing they talk about is tokens per second. No one mentions the quality of the answers. No one. I don't mind waiting a little longer if the quality is better. Quickly serving me slop doesn't make it more useful. Are people really only looking at tokens per second?
    • akman 13 minutes ago
      That's fair. There are even many dimensions to define 'quality' which include use case (coding? writing? multimedia?) and prompt. I suppose if you ask testers to provide benchmarks with their analysis, that might hamper their desire to share.
  • dofm 3 hours ago
    Useful stuff in here that I wish I'd seen a few days ago :-)

    I am not convinced that the MTP setup for the QAT model adds very much in terms of speed on my M1 Max, but it is definitely worth experimenting with.

    Fiddling about with local models has done so much for my conceptual understanding of what is going on.

    FWIW and YMMV but I also found the Gemma 4 MTP head was occasionally breaking markup in Opencode, causing the thinking to display untidily and ultimately in some cases missing the stop token. So I've stopped using MTP there for now.

    Recent Qwen 3.6 models have developer role support so it will occasionally surprise you with a structured multiple choice questionnaire.

    • mark_l_watson 1 minute ago
      when I started using QAT recently, I stopped trying to improve my configuration after that. I will try tuning my local environment again in a few months, but with QAT things are good enough for now.
    • mft_ 2 hours ago
      I found a marginal downside to Qwen3.6-35B-A3B-MTP vs. the non-MTP equivalent on an M1 Max. I’ll maybe experiment with settings further though.
      • freehorse 2 hours ago
        And the upsides of using draft models for MOE models with so low number of active parameters (as here or as in the article) are quite low, compared to dense models where you can get enormous speedups. I would prefer running the dense 27b models with speculative decoding instead.
        • dofm 54 minutes ago
          That is what I have learned, yes. Not tested the dense Qwen yet. IIRC the 31B Gemma was slow enough that I doubt MTP will help me much.
      • dofm 2 hours ago
        Yeah. I think it might speed up time to first token but I am not sure how much that matters.

        I do enjoy their different personalities when they are tackling "explain this" type puzzles, though.

        Gemma writes so well — like a concise code blogger. It makes you understand that the thing we hate about AI slop writing is specifically the cheesy, marketingese sycophantic ChatGPT tone. It's a choice to sound that way.

        Qwen writes more tersely by default, like much english language documentation in Chinese open source projects. A couple of lines, code example, fact, code example, line of blurb.

        I use this prompt every now and then with a new model. It's obviously a classic SQL puzzle but I've asked new web developers this in the past (prompted by discovering that a client's subcontractor didn't understand it and was therefore unable to migrate some code from relying on dodgy pre-MySQL 5.x behaviours)

          I have a MySQL 5 table like this: [id, label, category, score].   It contains a list of items in different categories (text names like cat1, cat2, cat3) with a numerical score. Is there a way I can write a SQL query to find the item in each category that has the highest score, without using a subquery? No two entries in any category share a score.
        

        I enjoy seeing what it deduces from the subtext.

        Without "thinking" mode on, they always initially fail and you need to prompt them to find the answer. With thinking mode, they both produce really nice explanations.

        For me, as an old freelancer who is pretty cynical about vibe coding or "agentic engineering", what I really want is an AI tool that can help me start to solve problems and help me find the right terminology or generate some boilerplate I can tinker with. Both of these models do fine at the kind of "starter" writing that I want when I am trying to untangle an idea.

  • jmkni 2 hours ago
    FYI you can open Claude code in the terminal, point it at this article and just tell it to "do it", if you're feeling extra lazy
    • echelon 2 hours ago
      This is the way.

      I'm not Googling much of anything anymore. 9/10 times the information is awful, it's hard to parse out of whatever other spam it's surrounded by. Meanwhile, Claude will just do the thing one-shot or with a tiny bit of refinement.

      The gateway to knowledge and getting stuff done is the LLM.

      Google Search is a dinosaur.

      It feels like we're living a century into the future. Not even smartphones were this cool.

      • kingofthehill98 1 hour ago
        Yeah, if the future is "Claude, think for me" I'm happy to stay at the good old present.
        • echelon 1 hour ago
          https://en.wikipedia.org/wiki/Is_Google_Making_Us_Stupid%3F

          https://newsletter.pessimistsarchive.org/p/when-educators-mo...

          New decade, same old argument.

          It's not

          > "Claude, think for me"

          It's

          > "Claude, be my subordinate and get this done for me"

          Instead of complaining on the sidelines, I'm getting a shit ton of work done.

          • ultrarunner 1 hour ago
            For what it's worth, even this reply reads like LLM output. It's not "quote describing the scenario", it's "some other linked-in-coded plot twist". If you're the average of the people you spend the most time around, and you spend the most time around a chatbot, do you start to absorb its speech patterns and logic structures?

            Yeah, good ol' present for me too then, thanks.

          • this_user 1 hour ago
            > Instead of complaining on the sidelines, I'm getting a shit ton of work done.

            Nah, you are just producing a bunch of slop and hope that nobody notices.

          • sdevonoes 1 hour ago
            > I'm getting a shit ton of work done.

            It’s weird when people are proud of doing ton of work. Im the opposite, Im proud that Im doing minimal stuff without llms.

      • tobyhinloopen 1 hour ago
        Claude “respond in a friendly way that I agree with this comment”
  • faitswulff 28 minutes ago
    I really liked that this worked on the first try for me (M4 Max MBP):

        ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4
    
    This was from Ollama's MLX support announcement blog. I think this updated version may work too:

        ollama launch claude --model qwen3.6:35b-a3b-coding-nvfp4
  • reddit_clone 2 hours ago
    >64 GB

    Thats the rub. I have an M4 with 48G. I wonder if it is worth testing this out.

    My past attempts (with Ollama and various LLMs) were too slow to use.

    • dofm 22 minutes ago
      These models will be a bit of a squeeze at Q4_0 I suspect; almost certainly they will be using CPU.

      But if you just want to play around rather than code, you really might find the Gemma 4 12B model worth mucking about with just so you've gone through the steps. Especially if you want to muck about with image analysis or audio transcription.

      If you're writing PHP I think you could even find it good enough. I've been modestly surprised. You can do that basic fiddling with the Edge AI Gallery app, which can enable thinking and has a customisable system prompt and some agent support.

      You could also try the 14B Deepseek R1.

      Honestly even if it is not good enough, if you are anything like me, I think you'll find that going through this process is really quite educational — it has made a lot of things more concrete for me in a way that I have found reassuring and valuable.

    • codazoda 1 hour ago
      I'm running an M3 on an Air with just 16GB. I can still get useful results without an internet connection in "chat mode". It's a different experience than using Claude, for sure, but it's workable. I typically use the Qwen variants these days.
    • hkchad 2 hours ago
      I have a M5 MAX with 128, local models are toys compared to hosted ones. I've spent a lot of time and money trying to make it work even 1/2 as well.
      • tmh88j 1 minute ago
        I think they're a perfect fit for personal use when your goal is fun over productivity. I want to be doing the thinking and planning, so I often offload repetitive or tedious, but straightforward tasks.
      • iluvcommunism 2 hours ago
        [dead]
    • contingencies 1 hour ago
      M4 24GB here. You'll be fine, if you're anything like me minor latency is acceptable to obtain (a) privacy (b) reliability (c) CI/CD/guardrails (d) network independence (e) future-proofing vs. AIaaS. https://omlx.ai/ gives you intelligent local hardware based model download recommendations. That said it probably depends heavily on your workload, process and polish expectations. See also https://news.ycombinator.com/item?id=48089091
  • hanifbbz 2 hours ago
    Here's a visual post for using LM Studio and VS Code (and Pi): https://blog.alexewerlof.com/p/local-llms-for-agentic-coding

    One way or another local AI is the future. I actually find weaker models more interesting because it keeps me sharp (at the cost of velocity of course).

  • rectang 1 hour ago
    Does anybody run a local agent on a Mac using an outboard GPU?
    • benbojangles 1 hour ago
      I run a second Mac for local llm use and access it remotely using ssh from the first mac
  • cdolan 3 hours ago
    Is there a link to the video? It did not render when I went to the page. Curious about the real-time feel of this
  • metadaemon 2 hours ago
    Has anyone compared a setup like this to just using LM Studio?
    • CharlesW 2 hours ago
      Yes, I can confirm that LM Studio works great for this.
  • attogram 2 hours ago
    8b max on a std 16gb macbook. Anything more and your mac is toast
  • namnnumbr 3 hours ago
    oMLX (https://github.com/jundot/omlx) makes running the mlx inference server quite easy for those interested in UI-based hosting. oMLX also supports mtp or dflash drafting.
    • w10-1 2 hours ago
      Agreed (not sure what you mean by UI-based hosting).

      oMLX does the caching I need to fit models that are near gross memory, and it handles most of the work in finding usable models. After cobbling together various solutions over months, I now just use oMLX, often from Xcode. I can tell the difference between Gemma-4 (local/free) and Claude (paid) only on the largest tasks.

  • LoganDark 1 hour ago
    I poured a couple days into custom Burn inference for Qwen3-Coder-Next only to find it doesn't come with a speculative decoder, so on my M4 Max I can't push it much further than 120t/s. That's still kinda slow, though still faster than llama.cpp's 70.9t/s and MLX's 80.6t/s with the same model. Claude Fable 5 is recommending I use the Qwen3 MTP -- I worry that will compromise the quality somewhat, but might give it a try to see if I can get more usable speeds.
  • sleepybrett 2 hours ago
    or you can just load up ollama, have it load a local model and point claude or opencode at it...

    is this article old? It's not. I'm not sure why he went through all the bother of llama.cpp

    • malkosta 2 hours ago
      That was exactly my same question. Then I finished reading the post. The reason is pretty clear, and written in the post: it is faster than ollama+mlx.
  • flowbarai 2 hours ago
    [flagged]
  • aplomb1026 2 hours ago
    [flagged]