⎨ Saurabh Kumar ⎬

My Local LLM On A 32 GB M1 Pro

My pi setup has a fast mode: a slot I wired into my own config for the small, high-volume questions, pointed at Bedrock’s Haiku 4.5. It is a good model. Cheap, responsive, and it answers the kind of small, contained question I throw at it twenty times a day without complaining. It is also not mine. It runs on hardware I do not own, it can be deprecated or quietly swapped the week I have come to lean on it, and every one of those small questions hands a slice of my code to a third party.

None of that is a crisis. Bedrock is fast and it works. But the machine that could do the same job is sitting on my desk, and when Kyle Howells got Gemma 4 26B-A4B running locally on an M1 Max at 72 tokens per second and the Hacker News thread filled up with people on smaller Macs reporting the same setup was workable, I wanted to know whether my own laptop could host that slot.

The catch is that my laptop is an M1 Pro with 32 GB of unified memory: half the RAM and roughly half the memory bandwidth of the machine in Kyle’s post. Copying his commands verbatim was never going to work. I needed to know what to change, in what order, and what to give up. So I did not copy the commands. I gave pi the article and asked it to build me a plan for my machine, and to interrogate me before it wrote a line.

I Handed The Guide To An LLM And Made It Grill Me

The whole thing started with one message. Typos and all, this is what I actually sent:

Act local AI setup expert working on M1 silicon with 32GB ram. Read though [the article] and also pull all the comments on this article on news.ycombinator.com article id 48507020. I’ve ollama and muse already installed there are models installed as well. Prep and action plan to set this up in the most performant and maintaibale way. Not sure quite sure if i need to uninstall previous ollama app and it’s modetls. Create and write a plan after analysing things. /grill-me

The /grill-me at the end is a small skill in my pi setup. It is two paragraphs of instructions that tell the model to walk down the decision tree one node at a time, recommend an answer at each step, and wait for my reply before moving on. It is not magic. It is a way of forcing the conversation I would otherwise have with myself in a notebook, except the model will not let me skip a step.

The interview was mostly the model proposing and me confirming in monosyllables. My replies, verbatim, were things like:

A

i would prefer ~/code/localLLM

A, also note i’ve “llm” cli already installed, and alias would be nice.

let’s go with recommendation, write the plan.

That is the entire texture of it. A simple opening prompt that points at an existing guide and names my constraints, then a handful of one-word answers steering the branches. The model did the reading (the article and the full HN thread) and the arithmetic. I made the calls.

And it pushed back, which is the point. When I leaned toward the 65 K context window from the article, it did the memory math and showed me I would be swapping inside the first generation on 32 GB, so we settled on 32 K. When I waffled on keeping Ollama “just in case,” it pointed out I was at 89 % disk with a thirteen-gigabyte Ollama models directory I was not using. When I asked for an alias, it caught that llm was already taken by Simon Willison’s tool and proposed lcl instead. None of these are insights I lacked. They are checks I would have skipped because I was already mentally on the next step.

By the end there was a written record of every decision and its reason, in the repo, not in my head. That is where it should be.

The Decisions That Fell Out

The interview converged on a short list, and the reasoning behind each one is worth more than the choice itself.

Model: Gemma 4 26B-A4B, the MoE, not a dense model. The single most important fact about local inference on a Mac is not how much RAM you have, it is how fast it can be read. Generation is bottlenecked on memory bandwidth. The M1 Max in Kyle’s article runs at 400 GB/s; my M1 Pro at roughly 200. Everything else being equal, I should expect about half his throughput. His 72 tokens per second becomes my 36. That is still faster than I read. It is also why a mixture-of-experts model is the right answer here: Gemma 4 26B-A4B holds 26 billion parameters but activates only four billion per token. The full weights sit in RAM, but each forward pass streams a fraction of them. The dense 27B sibling has the same disk footprint and meaningfully worse generation rate. It is probably smarter. It does not matter, because it is too slow for the slot I am filling.

Speculative decoding, on. Gemma 4 ships an MTP draft head, a tiny network that predicts the next several tokens cheaply, which the main model accepts or rejects in one pass. When acceptance is high, you get most of those tokens almost for free. Kyle measured +24 % on his machine. The HN thread is full of caveats about acceptance rates and MoE models benefiting less. None of it changes the decision: turn it on, measure, keep what you get.

Context: 32 K, not 65 K. The arithmetic the model made me do. A 65 K KV cache plus 16 GB of weights plus a draft model does not leave enough headroom on 32 GB once Chrome and an IDE are open.

mise, not a shell script. The article uses huggingface-cli to fetch the GGUFs and a hand-written tmux script to launch the server. Both work. Both are also infrastructure you now own and will forget you configured. llama.cpp’s -hf flag does the download without a Python virtualenv. mise’s github: backend installs the prebuilt Metal binary and rolls it forward with mise upgrade. The tmux script becomes a few named tasks in a .mise.toml. The choice is not about elegance, it is about which moving piece you are willing to own. I already own mise.

Ollama goes. I had it from earlier experiments, with gemma4:latest and a couple of other models taking 13 GB, plus the app idling in the background on port 11434. Ollama is a wrapper around llama.cpp. Once you run llama-server directly, the wrapper is in the way. So the app, the models directory, and the port all go, and the 13 GB comes back on a disk that was at 89 %.

The Setup, Start To Finish

This is the reproducible part. You can follow it by hand, or point an LLM at this section and let it do the work. That is roughly how it got built in the first place. The numbers below are the actual results on my M1 Pro, not estimates.

1. Decommission Ollama (skip if you never had it):

osascript -e 'tell application "Ollama" to quit' 2>/dev/null
pkill -f Ollama
rm -rf ~/.ollama /Applications/Ollama.app   # reclaimed 13 GB

2. Install llama.cpp via mise. The prebuilt arm64 binary has Metal and Accelerate baked in:

mise use --global 'github:ggml-org/llama.cpp@latest'
llama-server --help | grep -E -- '--spec-type|-hf'   # confirm MTP + HF download support

The one thing worth verifying is that --spec-type draft-mtp is present, since MTP support landed only recently. On my install (build b9631) it was there. If it ever is not, a source build with -DGGML_METAL=ON is the five-minute fallback.

3. A project with mise tasks at ~/code/localLLM/.mise.toml. The [env] block makes the model cache live with the project; the tasks replace the shell script:

[env]
LLAMA_CACHE = "{{ config_root }}/models"

[tools]
"github:ggml-org/llama.cpp" = "latest"

[tasks.start]
description = "Start llama-server (Gemma 4 26B-A4B + MTP) in tmux"
run = """
tmux has-session -t llama 2>/dev/null && { echo running; exit 0; }
tmux new-session -d -s llama -c "$PWD" \\
  "llama-server \\
    -hf  unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \\
    -hfd unsloth/gemma-4-26B-A4B-it-GGUF:Q8_0-MTP \\
    --spec-type draft-mtp --spec-draft-n-max 3 \\
    -ngl 999 -fa on -c 32768 --parallel 1 \\
    --host 127.0.0.1 --port 8082 \\
    2>&1 | tee -a logs/llama-server.log"
"""

[tasks.stop]
run = "tmux kill-session -t llama 2>/dev/null && echo stopped || echo 'not running'"

[tasks.status]
run = "curl -fsS http://127.0.0.1:8082/v1/models >/dev/null && echo up || echo down"

A note on the port. The article uses 8080. On my machine 8080 and 8081 were already taken by Google cloud-run processes, so the server silently answered on a port that was not mine until I noticed the responses carried a Server: Google Frontend header. I moved everything to 8082. Check lsof -i :8080 before you assume a port is free.

4. A one-key alias in ~/.zshrc. llm was taken, so lcl for “local”:

lcl() { mise -C "$HOME/code/localLLM" run "$@"; }
lcl start    # boot the server (first run downloads the model)
lcl status   # is it up?
lcl stop     # free the ~22 GB of RAM

5. First boot and download. lcl start triggers the download through llama.cpp’s -hf. The real footprint came in smaller than the article’s “~21 GB” estimate. The total was 18 GB: 16 GB for the Q4_K_XL main model, 1.2 GB for the MTP head (an auxiliary head, not a full second model), and a 441 MB multimodal projector that gets pulled automatically even though I do not need images. Add --no-mmproj if you want to skip it.

6. Point pi at it. A local provider in ~/.pi/agent/models.json:

{
  "providers": {
    "local": {
      "name": "Local llama.cpp",
      "baseUrl": "http://127.0.0.1:8082/v1",
      "api": "openai-completions",
      "apiKey": "local",
      "models": [{
        "id": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
        "name": "Gemma 4 26B-A4B Q4 + MTP",
        "reasoning": true,
        "input": ["text"],
        "contextWindow": 32768,
        "maxTokens": 8192,
        "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
      }]
    }
  }
}

And a local mode in modes.json pointing at it. Set "reasoning": true. Gemma 4 IT thinks by default and returns its working in reasoning_content, leaving content empty if you cap max_tokens before it finishes thinking. That one caught me out on the first benchmark: a tiny token budget produced an empty answer because the model had spent the whole budget reasoning.

What it actually does, measured on the M1 Pro:

MetricResult
Generation34–36 tok/s (predicted half of Kyle’s 72, got it)
Prompt ingestion77 tok/s
Time to first token~376 ms
MTP acceptance~65 % (e.g. 1350 of 2090 drafted tokens accepted)

Half the bandwidth, half the tokens, just as predicted. The MTP acceptance beat what I feared. The article saw +24 % on a dense workload; here, two thirds of the drafted tokens landed.

How I Actually Use A Fast Model

A fast model in pi is not the thing I have a conversation with. It is the model that does the small, high-volume, latency-sensitive jobs around the edges of the real work, where Opus is overkill and the round-trip cost is what matters. Those are the jobs I would rather keep on my own machine, and the place where 36 tokens per second is plenty.

In my setup, that means:

None of these need a frontier model. They need something that responds now and gets the structure right. That is the bar Gemma 4 has to clear to take the slot from Haiku.

So I gave it the slot. I went back and forth here. The cautious move was to add a separate local mode, keep fast on Haiku, and switch into the local one when I felt like testing it. I know how that goes: I would never remember to switch, and a week later I would have no data. So fast now points at Gemma 4 and the Haiku mode moved to lcl, there if I want to compare. Local models can be brilliant in a chat window and brittle inside an agent loop, and the only way I learn which one Gemma 4 is here is by living with it in the slot that fires twenty times a day. If it holds up, on tool calling, on diff quality, on the ten-line edits that fill my day, it stays. If it does not, the rollback is one line in modes.json.

What I Am Not Doing Yet

I am not putting the server behind launchd. At ~22 GB resident on a 32 GB machine, leaving it running alongside Chrome and an IDE is asking for swap. On-demand lcl start / lcl stop is honest about the cost in a way always-on is not.

I am not running a second model on a second port. Maybe Qwen 3.6 35B-A3B earns a slot for the cases where Gemma is too weak. Maybe one model is enough. I will know in a week.

The Smaller Lesson

The bigger story is bandwidth and quantisation and what fits in 32 GB. The smaller one, which I think is more useful, is about the planning. I have spent two years watching people, myself included, let agents do work the agent should not be doing, because asking felt cheaper than thinking. Handing a guide to a model and making it interrogate my plan is the opposite move: the legwork moved to the model, the agency stayed with me.

The local agent might work or it might not. I will know in a week. The transcript is useful regardless, because it forced me to write down why I made each choice, in a form I can read later when I have forgotten.

<< Previous Post

|

Next Post >>

#Ai #Local-Llm #Macos #Pi