Im using Ollama on my server with the WebUI. It has no GPU so its not quick to reply but not too slow either.
Im thinking about removing the VM as i just dont use it, are there any good uses or integrations into other apps that might convince me to keep it?
I use the Continue VS Code plugin with Ollama to use a couple of different models (deepseek-coder-v2 & starcoder2) to recreate a local only Github Copilot type experience for coding. This is on an M1 Apple Silicon though. For autocomplete the generation needs to be pretty brisk - I’m not sure how that would go in a VM without a GPU.
How well does the M1 chip keep up? What size models are you running with it? Interested in getting an M1 laptop and I am curious.
starcoder2:latest f67ae0f64584 1.7 GB 3 days ago phi3:latest d184c916657e 2.2 GB 3 weeks ago deepseek-coder-v2:latest 8577f96d693e 8.9 GB 3 weeks ago llama3:8b-instruct-q8_0 1b8e49cece7f 8.5 GB 3 weeks ago dolphin-mistral:latest 5dc8c5a2be65 4.1 GB 3 weeks ago codeqwen:latest df352abf55b1 4.2 GB 3 weeks ago llama3:latest 365c0bd3c000 4.7 GB 4 weeks ago
I mostly use starcoder2 with Continue for code autocomplete, the big deepseek coder is a bit slow (I can feel it thinking), but it and the regular llama3 are good for chatbot type programming questions.
I don’t really have anything to compare the M1 performance to. I guess the 8GB models output text a little slower than the web versions of the same models, and the 4GB ones about the same. Using ollama in the terminal, there’s sometimes a 0.5-2 second pause before it starts outputting. Not with phi3 though - it’s surprisingly snappy for the quality of answers.