Smaller models might not make the best agentic coding assistants, but I have a 128GB RAM headless machine serving llama.cpp with a number of local models that handles various tasks on a daily basis and works great.
- Qwen3-VL:30b > A file watcher on my NAS sends new images to it, which autocaptions and adds the text descriptions as a hidden EXIF layer into the image along with an entry into a Qdrant vector database for lossy searching and organization.
- Gemma3:27b > Used for personal translation work (mostly English and Chinese). Haven't had a chance to try out the Gemma4 models yet.
- Llama3.1:8b > Performs sentiment analysis on texts / comments / etc.
I run the biggest quant because it is more capable, spark has enough memory for two qwen at 8bit and full context length (roughly 48G each)
I find gemini/gemma to have become worse at coding, they are better for non-coding tasks, but maybe not even that, the hallucinations and instruction following have both degraded ime
Comparing to Opus is a little unfair, a comparison against Haiku would be more fair. And for a really fast cloud model, I'd be interested to see how latency stacks up against GPT-5.3 Instant or Gemini 3.1 Flash Lite.
I've been using Qwen-3.6-35B-A3B on a spark for my local coding adventures. It's really quite good and getting faster (running the PR before it gets merged is already a 2x token gen speedup)
I cannot imaging spending the prices for Opus anymore, the little guys are getting good enough and in another year I expect to be even better.
You need to be less etymology-pilled.
Seriously tho, its a practical word choice in a lot of cases. puts emphasis on the 'maxxing'
Think of it as claiming the word as your own.
Smaller models might not make the best agentic coding assistants, but I have a 128GB RAM headless machine serving llama.cpp with a number of local models that handles various tasks on a daily basis and works great.
- Qwen3-VL:30b > A file watcher on my NAS sends new images to it, which autocaptions and adds the text descriptions as a hidden EXIF layer into the image along with an entry into a Qdrant vector database for lossy searching and organization.
- Gemma3:27b > Used for personal translation work (mostly English and Chinese). Haven't had a chance to try out the Gemma4 models yet.
- Llama3.1:8b > Performs sentiment analysis on texts / comments / etc.
Look into updating to Gemma4 and Qwen3.6, they are good at agentic things. qwen36moe with unsloth's 8bit quant is my daily driver now.
Have you noticed a gap between 8bit and 4bit quant? I've always ran 4bit quant cause less memory required
I run the biggest quant because it is more capable, spark has enough memory for two qwen at 8bit and full context length (roughly 48G each)
I find gemini/gemma to have become worse at coding, they are better for non-coding tasks, but maybe not even that, the hallucinations and instruction following have both degraded ime
In my country we write longer and more detailed texts in primary school. That's blogpostmaxxing.
Comparing to Opus is a little unfair, a comparison against Haiku would be more fair. And for a really fast cloud model, I'd be interested to see how latency stacks up against GPT-5.3 Instant or Gemini 3.1 Flash Lite.
I've been using Qwen-3.6-35B-A3B on a spark for my local coding adventures. It's really quite good and getting faster (running the PR before it gets merged is already a 2x token gen speedup)
I cannot imaging spending the prices for Opus anymore, the little guys are getting good enough and in another year I expect to be even better.
This whatever '-maxxxing' nonsense is really 'cringemaxxing'.
The people using it need to look at the origins of it first before using it everywhere.
You need to be less etymology-pilled. Seriously tho, its a practical word choice in a lot of cases. puts emphasis on the 'maxxing' Think of it as claiming the word as your own.