For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.
But recently I tried running a smaller model like llama3.2 3B
with 8bit quant and qwen2.5-1.5B-coder
on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).
So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?
What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.
UPDATE: Woah I just tried llama3.1 8B Q4 on ollama again, and what a WORLD of difference to a llama3.2 3B 16fp!
The difference is super massive. The 3B and 1B llama3.2 models seem to be mostly good at summarizing text and maybe generating some JSON based on previous input. But the bigger 3.1 8B model can actually be used in a chat environment! It has a good response length (about 3 lines per message) and it doesn’t stretch out its answer. It seems like a really good model and I will now use it for more complex tasks.
A 2bit or 3bit quantization is quite some trade-off. At 2bit, it’ll probably be worse then a smaller model with a lesser quantization. At the same effective size.
There is a sweet spot somewhere between 4 to 8 bit(?). And more than 8bit seems to be a waste, it seems indistinguishable from full precision.
General advice seems to be: Take the largest model you can fit at somewhere around 4bit or 5bit.
The official way to compare such things is calculate the perplexity for all of the options and choose the one with the smallest perplexity, that fits.
And by the way: I don’t really use the tiny models like 3B parameters. They write text, but they don’t seem to be able to store a lot of knowledge. And in turn they can’t handle any complex questions and they generally make up a lot of things. I usually use 7B to 14B parameter models. That’s a proper small model. And I stick to 4bit or 5bit quants for llama.cpp
Your graphics card should be able to run a 8B parameter LLM (4-bit quantized) I’d prefer that to a 3B one, it’ll be way more intelligent.