experiments with qwen 3.6
“all roads leads to local ai”
Source: nvk on X
My post yesterday was already circling this idea. This is an updated report on my experiments.
I first tried to make gemma4 work with ollama. It did not go great. It just did not feel like the right path for the kind of computer use I wanted. The model was promising in some places, especially tool calls, but the whole setup felt a little off for the tasks I gave it - especially simple things - format conversions, using templates for posts, editing frontmatter and yaml etc.
I was sharing this with Nikhil Ramesh and he pointed me to: llama.cpp. llama.cpp turned out to be an efficient unlock that also allowed me to try multiple different hugging-face models.
Hugging face sent me down the quantization rabbit hole. It helped me move away from “can I run this model?” to “what shape of model can I actually live with on this machine.” and “what type of model helps me run the system I want successfully?”
That is how I ended up at Qwen3.6-35B-A3B.
Right now I am testing unsloth’s 8-bit quantized build on my system. I have enough machine to make this interesting. It’s good enough to run most of my tool calls and the tasks that I end up running locally.
Here is the command I am using:
llama-server \
-hf unsloth/Qwen3-30B-A3B-GGUF:UD-Q8_K_XL \
--n-gpu-layers 49 \
-c 131072 \
-n 65536 \
--flash-attn on \
--no-context-shift \
-ctk q8_0 \
-ctv q8_0 \
-t 8 \
-b 2048 \
-ub 512 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--reasoning-budget 8192 \
--repeat-penalty 1.00 \
--presence-penalty 0.00 \
--chat-template-kwargs '{"preserve_thinking": true}'
On this setup, I get about 17 tps for prompt eval, then it jumps to 153.63 tps when it is generating. That is fast enough to matter. Actual usable fast. Fast enough to focus on the work and less on coaxing the model.
I also tried the 5-bit quantized version. Feels much lighter on the machine. The tradeoff is still acceptable, which surprised me more than the speed.
Running both of them still cause some weird new issues. ssh sessions cut off very quickly. It feels like mac OS is trying to preserve RAM :P.
Overall, this is a different experience from gemma4. Gemma never felt this capable as the primary model for running pi. qwen3.6 still has familiar problems: overthinking, loops. However, with some prompt management, it’s way more functional.
This cements my usage of pi. I can send mundane work to the local model, then save codex for harder tasks or failure cases. That changes the cost curve, the psychology and changes the default to local.
Maybe being banned from gemini-cli was good for me.
It still feels like a dumb little exile. But, it forced me to look at other options. It pushed me toward local AI, toward llama.cpp, toward quantized models, toward a setup that actually fits the machine.
The tinkering continues.