Update: Dan’s latest version upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well.
Source: Autoresearching Apple’s “LLM in a Flash” to run Qwen 397B locally
I love these projects. I think the hardware is getting good. I am excited for this project to get better and allow us to run local models.
However, $2600 for a 48GB Macbook Pro gives you access to the best models available on the cloud for 10 years at ~ $200 / year for the mid-tier plans.
I know it’s “local and private.” However, as it stands right now, I am not sure I will make the trade off for “private, fast and latest.”