In general there's no "best" LLM model, all of them will have some strengths and weaknesses. There are a bunch of good picks; for example:
> DeepSeek-R1-0528-Qwen3-8B - https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
Released today; probably the best reasoning model in 8B size.
> Qwen3 - https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...
Recently released. Hybrid thinking/non-thinking models with really great performance and plethora of sizes for every hardware. The Qwen3-30B-A3B can even run on CPU with acceptable speeds. Even the tiny 0.6B one is somewhat coherent, which is crazy.
There's no one "best" model, you just try a few and play with parameters and see which one fits your needs the best.
Since you're on HN, I'd recommend skipping Ollama and LMStudio. They might restrict access to the latest models and you typically only choose from the ones they tested with. And besides what kind of fun is this when you don't get to peek under the hood?
llamacpp can do a lot itself, and you can do most recently released models (when changes are needed they adjust literally within a few days). You can get models from huggingface obviously. I prefer GGUF format, saves me some memory (you can use lower quantization, I find most 6-bit somewhat satisfactory).
I find that the the size of the model's GGUF file with roughly tell me if it'll fit in my VRAM. For example 24Gb GGUF model will NOT fit in 16Gb, whereas 12Gb likely will. However, the more context you add the more RAM will be needed.
Keep in mind that models are trained with certain context window. If it has 8Kb context (like most older models do) and you load it with 32Kb context it won't be much help.
You can run llamacpp on Linux, Windows, or MacOS, you can get the binaries or compile on your local. It can split the model between VRAM and RAM (if the model doesn't fit in your 16Gb). It even has simple React front-end (llamacpp-server). The same module provides REST service which has similar (but simpler) protocol to OpenAI and all the other "big" guys.
Since it implements OpenAI REST API, it also works with a lot of front-end tools if you want more functionality (ie oobabooga aka textgeneration webui).
Koboldcpp is another backend you can try if you find llamacpp to be too raw (I believe it's the still llamacpp under the hood).
Qwen is pretty good and in a variety of sizes. I’d suggest this one Qwen/Qwen3-14B-GGUF Q4_K_M for you given your vram and to run it using llama-server or lm studio (might be alternatives to lm studio but generally these are nice uis for llama server), it’ll use around 7-8GB for weights, leaving room for incidentals
Llama 3.3 could work for you
Devstral is too big but could run a quantized model
Gemma is good, tends to refuse a lot. Medgemma is a nice thing to have in case
“Uncensored” Dolphin models from Eric Hartford and “abliterated” models are what you want if you don’t want them refusing requests, it’s mostly not necessary for routine use, but sometimes you ask em to write a joke and they won’t do it, or if you wanted to do some business which involves defense contracting or security research, that kind of thing, could be handy.
Generally it’s bf16 dtype so you multiply the number of billions of parameters by two to get the model size unquantized
Then to get a model that fits on your rig, generally you want a quantized model, typically I go for “Q4_K_M” which means 4bits per param, so you divide the number of billions of params by two to calculate the vram for the weights.
Not sure the overhead for activations but might be a good idea to leave wiggle room and experiment with sizes well below 16GB
Llama server is a good way to run AI and has a gui on the index route and -hf to download models
LM Studio is a good gui and installs llama server for you and can help with managing models
Make sure you run some server that loads the model once. You definitely don’t want to load many gigabytes of weights into vram every question if you want fast realtime answers
You can even keep track of the quality of the answers over time to help guide your choice.
However it's heavily censored on political topics because of its Chinese origin. For world knowledge, I'd recommend Gemma3.
This post will be outdated in a month. Check https://livebench.ai and https://aider.chat/docs/leaderboards/ for up to date benchmarks
Going below Q4 isn't worth it IMO. If you want significantly more context, probably drop down to a Q4 quant of Qwen3-8B rather than continuing to lobotomize the 14B.
Some folks have been recommending Qwen3-30B-A3, but I think 16GB of VRAM is probably not quite enough for that: at Q4 you'd be looking at 15GB for the weights alone. Qwen3-14B should be pretty similar in practice though despite being lower in param count, since it's a dense model rather than a sparse one: dense models are generally smarter-per-param than sparse models, but somewhat slower. Your 5060 should be plenty fast enough for the 14B as long as you keep everything on-GPU and stay away from CPU offloading.
Since you're on a Blackwell-generation Nvidia chip, using LLMs quantized to NVFP4 specifically will provide some speed improvements at some quality cost compared to FP8 (and will be faster than Q4 GGUF, although ~equally dumb). Ollama doesn't support NVFP4 yet, so you'd need to use vLLM (which isn't too hard, and will give better token throughput anyway). Finding pre-quantized models at NVFP4 will be more difficult since there's less-broad support, but you can use llmcompressor [1] to statically compress any FP16 LLM to NVFP4 locally — you'll probably need to use accelerate to offload params to CPU during the one-time compression process, which llmcompressor has documentation for.
I wouldn't reach for this particular power tool until you've decided on an LLM already, and just want faster perf, since it's a bit more involved than just using ollama and the initial quantization process will be slow due to CPU offload during compression (albeit it's only a one-time cost). But if you land on a Q4 model, it's not a bad choice once you have a favorite.
That said, Unsloth's version of Qwen3 30B, running via llama.cpp (don't waste your time with any other inference engine), with the following arguments (documented in Unsloth's docs, but sometimes hard to find): `--threads (number of threads your CPU has) --ctx-size 16384 --n-gpu-layers 99 -ot ".ffn_.*_exps.=CPU" --seed 3407 --prio 3 --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20` along with the other arguments you need.
Qwen3 30B: https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF (since you have 16GB, grab Q3_K_XL, since it fits in vram and leaves about 3-4GB left for the other apps on your desktop and other allocations llama.cpp needs to make).
Also, why 30B and not the full fat 235B? You don't have 120-240GB of VRAM. The 14B and less ones are also not what you want: more parameters are better, parameter precision is vastly less important (which is why Unsloth has their specially crafted <=2bit versions that are 85%+ as good, yet are ridiculously tiny in comparison to their originals).
Full Qwen3 writeup here: https://unsloth.ai/blog/qwen3
Qwen3 family from Alibaba seem to be the best reasoning models that fit on local hardware right now. Reasoning models on local hardware are annoying in contexts where you just want an immediate response, but vastly outperform non-reasoning models on things where you want the model to be less naive/foolish.
Gemma3 from google is really good at intuition-oriented stuff, but with an obnoxious HR Boy Scout personality where you basically have to add "please don't add any disclaimers" to the system prompt for it to function. Like, just tell me how long you think this sprain will take to heal, I already know you are not a medical professional, jfc.
Devstral from Mistral performs the best on my command line utility where I describe the command I want and it executes that for me (e.g. give me a 1-liner to list the dotfiles in this folder and all subfolders that were created in the last month).
Nemo from Mistral, I have heard (but not tested) is really good for routing-type jobs, where you need something with to make a simple multiple-choice decision competently with low latency, and is easy to fine-tune if you want to get that sophisticated.
I was trying Patricide unslop mell and some of the Qwen ones recently. Up to a point more params is better than worrying about quantization. But eventually you'll hit a compute wall with high params.
KV cache quantization is awesome (I use q4 for a 32k context with a 1080ti!) and context shifting is also awesome for long conversations/stories/games. I was using ooba but found recently that KoboldCPP not only runs faster for the same model/settings but also Kobold's context shifting works much more consistently than Ooba's "streaming_llm" option, which almost always re-evaluates the prompt when hooked up to something like ST.
I'd like to know how many tokens you can get out of the larger models especially (using Ollama + Open WebUI on Docker Desktop, or LM Studio whatever). I'm probably not upgrading GPU this year, but I'd appreciate an anecdotal benchmark.
- gemma3:12b
- phi4:latest (14b)
- qwen2.5:14b [I get ~3 t/s on all these small models, acceptably slow]
- qwen2.5:32b [this is about my machine's limit; verrry slow, ~1 t/s]
- qwen2.5:72b [beyond my machine's limit, but maybe not yours]
I've found that Qwen3 is generally really good at following instructions and you can also very easily turn on or off the reasoning by adding "/no_think" in the prompt to turn it off.
The reason Qwen3:30B works so well is because it's a MoE. I have tested the 14B model and it's noticeably slower because it's a dense model.
It’s like asking what the best pair of shoes is.
Go on Ollama and look at the most popular models. You can decide for yourself what you value.
And start small, these things are GBs in size so you don’t want to wait an hour for a download only to find out a model runs at 1 token / second.
Ollama is the easiest way to get started trying things out IMO: https://ollama.com/
I asked it a question about militias. It thought for a few pages about the answer and whether to tell me, then came back with "I cannot comply".
Nidum is the name of uncensored Gemma, it does a good job most of the time.
Qwen_Qwen3-14B-IQ4_XS.gguf https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF
Gemma3 is a good conversationalist but tends to hallucinate. Qwen3 is very smart but also very stubborn (not very steerable).
And the part I like the most is there is almost no censorship, at least not for the models I tried. For me, having an uncensored model is one of the most compelling reasons for running a LLM locally. Jailbreaks are a PITA and abliteration and other uncensoring fine-tunings tends to make models that have been made dumb by censorship even dumber.
It holds it's value so you won't lose much if anything when you resell it.
But otherwise, as said, install Ollama and/or Llama.cpp and run the model using the --verbose flag.
This will print out the token per second result after each promt is returned.
Then find the best model that gives you a token per second speed you are happy with.
And as also said, 'abliterated' models are less censored versions of normal ones.
SmolVLM is pretty useful. https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct
I realize they aren’t going to be as good… but the whole search during reasoning is pretty great to have.
For 16gb and speed you could try Qwen3-30B-A3B with some offload to system ram or use a dense model Probably a 14B quant
It's slow-ish but still useful, getting 5-10 tokens per second.
I'll give Qwen2.5 a try on the Apple Silicon, thanks.