Ah I mean fair enough :) I don’t keep up much with car brands and ownerships, but still TIL haha
Huh, didn’t realize Volvo was primarily owned by a Chinese company, you got me there lol, genuinely always thought they were standalone and therefore a Swedish company
If you’re using text generation webui there’s a bug where if your max new tokens is equal to your prompt truncation length it will remove all input and therefore just generate nonsense since there’s no prompt
Reduce your max new tokens and your prompt should actually get passed to the backend. This is more noticable in models with only 4k context (since a lot of people default max new tokens to 4k)
Colour me intrigued. I want more manufactures that go against the norm. If they put out a generic slab with normal specs at an expected price, I won’t be very interested, but if they do something cool I’m all for it
Except I just noticed the part where it’s developed by Meizu so nevermind probably will be a generic Chinese phone
Stop making me want to buy more graphics cards…
Seriously though this is an impressive result, “beating” gpt3.5 is a huge milestone and I love that we’re continuing the trend. Will need to try out a quant of this to see how it does in real world usage. Hope it gets added to the lmsys arena!
If you go for it and need any help lemme know I’ve had good results with Linux and Nvidia lately :)
Btw I know this is old and you may have already figured out your hardware and setup, but p40s and p100s go for super cheap on eBay.
P40 is an amazing $/GB deal, only issue is the fp16 performance is abysmal so you’ll want to run either full fp32 models or use llama.cpp which is able to cast up to that size
The p100 has less VRAM but really good fp16 performance which makes it ideal for exllamav2 usage. I picked up one of each recently, p40 was failed to deliver and p100 was delivered while I’m away, but once I have both on hand I’ll probably post a comparison to my 3090 for interests sake
Also I run all my stuff on Linux (Ubuntu 22.04) with no issues
You shouldn’t need nvlink, I’m wondering if it’s something to do with AWQ since I know that exllamav2 and llama.cpp both support splitting in oobabooga
Yeah q2 logic is definitely a sore point, I’d highly recommend going with Mistral dolphin 2.6 DPO instead, the answers have been very high quality for a 7b model
But good info for anyone wanting to keep up to date on very low bit rate quants!
I don’t have a lot of experience with either at this time, I’ve used them here and there for programming questions but usually I stick to 7b models because I use them for code completion and I only find that useful if it completes the code before I do lol
That said, I’ve had overall good answers from either whenever I’ve decided to pull them out, it feels like wizard coder should be better since it’s so much newer but overall it hasn’t been that different. Wish phind would release an update :(
I run my Nvidia stuff in containers to not have to deal with all the stupid shenanigans
The 3060 is a nice cheap one for running okay sized models, but if you can find a way to stretch for a 3090 or a 7900 XTX you’ll be able to run these 33B models with decent quant levels
First few quants are up: https://huggingface.co/bartowski/WizardCoder-33B-V1.1-exl2
4.25 should fit nicely into 24gb (3090, 4090)
Smaller sizes still being created, 3.5, 3.0, and 2.4
I use text-generation-webui mostly. If you’re only using GGUF files (llama.cpp), koboldcpp is a really good option
A lot of it is the automatic prompt formatting, there’s probably like 5-10 specific formats that are used, and using the right one for your model is very important to achieve optimal output. TheBloke usually lists the prompt format in his model card which is handy
Rope and yarn refer to extending the default context of a model through hacky (but functional) methods and probably deserve their own write up
Yeah so those are mixed, definitely not putting each individual weight to 2 bits because as you said that’s very small, i don’t even think it averages out to 2 bits but more like 2.56
You can read some details here on bits per weight: https://huggingface.co/TheBloke/LLaMa-30B-GGML/blob/8c7fb5fb46c53d98ee377f841419f1033a32301d/README.md#explanation-of-the-new-k-quant-methods
Unfortunately this is not the whole story either, as they get further combined with other bits per weight, like q2_k is Q4_K for some of the weights and Q2_K for others, resulting in more like 2.8 bits per weight
Generally speaking you’ll want to use Q4_K_M unless going smaller really benefits you (like you can fit the full thing on GPU)
Also, the bigger the model you have (70B vs 7B) the lower you can go on quantization bits before it degrades to complete garbage
If you’re using llama.cpp chances are you’re already using a quantized model, if not then yes you should be. Unfortunately without crazy fast ram you’re basically limited to 7B models if you want any amount of speed (5-10 tokens/s)
Very interesting they wouldn’t let him film the camera bump… it must have some kind of branding on it like Hasselblad? Or maybe they’ve secretly found a way to have no bump! One can dream…
Yeah definitely need to still understand the open source limits, they’re getting pretty dam good at generating code but their comprehension isn’t quite there, I think the ideal is eventually having 2 models, one that determines the problem and what the solution would be, and another that generates the code, so that things like “fix this bug” or more vague questions like “how do I start writing this app” would be more successful
I’ve had decent results with continue, it’s similar to copilot and actually works decently with local models lately:
You can get the resulting PPL but that’s only gonna get you a sanity check at best, an ideal world would have something like lmsys’ chat arena and could compare unquantized vs quantized but that doesn’t yet exist