Local models, inference incantations and pi extensions

May 09, 2026

Putting an API key into Pi and using a hosted model is a very boring operation. You select the provider, paste the key and then you are done thinking about how to get tokens. Doing the same thing locally, even when you have a high-end Mac with a lot of memory, is a completely different experience. You choose an inference engine, then a model, then a quantization, then a template, then a context size, then you’ve got to throw a bunch of JSON configs into different parts of the stack and then you discover that one of those choices quietly made the model worse or that something just does not work at all.

That is the gap I am interested in. … This is why I am excited about ds4.c. It’s Salvatore Sanfilippo’s deliberately narrow inference engine for DeepSeek V4 Flash on Macs with 128GB+ of RAM only. It is not a generic GGUF runner and it is not trying to be a framework. It is a model-specific native engine with a Metal path, model-specific loading, prompt rendering, KV handling, server API glue, and tests. … But if you have the right hardware and you care about local agents, I would love for you to try it within pi:

pi install https://github.com/mitsuhiko/pi-ds4

My hope is that this becomes a useful forcing function to really polish one coding agent experience. But really, the focal point should be ds4.c itself.

Source: Pushing Local Models With Focus And Polish

New local inference incantations

ref: Experiments with Qwen 3.6

I feel the same vibes as my love for ruby when I think about local models. I love them for their promise, but their management is still a nightmare. (oh hai jekyll).

ollama is easy, but its performance is terrible especially for using qwen / gemma on my machine. Heading one level deeper, llama.cpp has great performance but needs the right incantation to avoid dropping performance. In my own examples, after additional research and a better understanding of what each parameter does, here’s my new command for llama.cpp on the right level of performance for my machine - an M1 Max with 64GB of memory for running Qwen3.6

llama-server \
  -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q5_K_XL \
  --n-gpu-layers -1 \
  -c 262144 \
  -n 65536 \
  -ctk q8_0 \
  -ctv q8_0 \
  -t 8 \
  --flash-attn on \
  --no-context-shift \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --reasoning-budget 8192 \
  --repeat-penalty 1.00 \
  --presence-penalty 0.00 \
  --batch-size 4096 \
  --ubatch-size 4096 \
  --mlock \
  --chat-template-kwargs '{"preserve_thinking": true}'  

The key unlocks are:

-c 262144: higher context window - changed from 131072 to 262144
--n-gpu-layers -1: offloads as many layers on to the gpu as possible
-n 65536: increases maxtokens to 65536
-ctk q8_0; -ctv q8_0: enables kv cache quantization (for a slight drop in tps)
--mlock: to ensure that weights are not offloaded to the ssd

This came after an insane amount of tuning and experimentation and I am sharing back to the community through these posts.

I am keeping a close eye on ds4.c even though it doesn’t really apply to my own context (heh).

pi extensions

The biggest unlock in my daily workflow has come from pi extensions. They are the best way to learn a) how to leverage models and develop tools and tool calls that work across different models; b) how to use a specific model for a specific task and c) do this in a more deterministic and reliable manner than just stating it in a SKILL.md file.

Note: I still recommend using the SKILL.md and frontmatter structure especially as it looks like most models are focusing on reading them, so that whatever you are building can be portable across models, harnesses and agents.

For example, here’s a sample of my “researcher” subagent:

---
name: researcher
description: Web researcher — searches the web and synthesizes findings
tools: web_search, fetch_content
model: openai-codex/gpt-5.5
reasoning: high
---

You are a research specialist. Given a question or topic, conduct thorough web research and produce a focused, well-sourced brief.

frontmatter + pi extensibility is a really good combination.

multi-model agent and harness

I think we are entering a world of different models (heh) of how harnesses and agents work. Anthropic is clearly on the path of my model, my agent, my harness. To their credit, they are developing a good harness, agent and model. However, they are rushing headlong into a generic UI for everything and every one. I wish them luck. It would be incredible if they pull it off.

However, history suggests that the browser is the closest we’ve come to a universal interface. The best browsers realized quickly that it means letting go and embracing and extending the web to be successful.

I find there is a dissonance in what Anthropic wants to do and how they are approaching the agent and the harness.

I will bet on the agent on the side of the user - the user-agent, if you will. To me, that’s not claude or codex or gemini. It’s pi.

Minimal, Extensible, Malleable. Adopting the best of what the world will produce - be it the benefits of skills, plugins or the determinism of extensions.

Keep the conversation going