Who uses open LLMs and coding assistants locally? Share setup and laptop

Question

Dear Hackers, I&rsquo;m interested in your real-world workflows for using open-source LLMs and open-source coding assistants on your laptop (not just cloud/enterprise SaaS). Specifically:Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you&rsquo;re using?What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).I'm conducting my own investigation, which I will be happy to share as well when over.Thanks! Andrea.

embedding-shape · Accepted Answer

> Which model(s) are you running (e.g., Ollama, LM Studio, or others)
I'm running mainly GPT-OSS-120b/20b depending on the task, Magistral for multimodal stuff and some smaller models I've fine-tuned myself for specific tasks..
All the software is implemented by myself, but I started out with basically calling out to llama.cpp, as it was the simplest and fastest option that let me integrate it into my own software without requiring a GUI.
I use Codex and Claude Code from time to time to do some mindless work too, Codex hooked up to my local GPT-OSS-120b while Claude Code uses Sonnet.
> What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?
Desktop, Ryzen 9 5950X, 128GB of RAM, RTX Pro 6000 Blackwell (96GB VRAM), performs very well and I can run most of the models I use daily all together, unless I want really large context then just GPT-OSS-120B + max context, ends up taking ~70GB of VRAM.
> What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).
Almost anything and everything, but mostly coding. But then general questions, researching topics, troubleshooting issues with my local infrastructure, troubleshooting things happening in my other hobbies and a bunch of other stuff. As long as you give the local LLM access to a search tool (I use YaCy + my own adapter), local models works better for me than the hosted models, mainly because of the speed and I have better control over the inference.
It does fall short on really complicated stuff. Right now I'm trying to do CUDA programming, creating a fused MoE kernel for inference in Rust, and it's a bit tricky as there are a lot of moving parts and I don't understand the subject 100%, and when you get to that point, it's a bit hit or miss. You really need to have a proper understanding of what you use the LLM for, otherwise it breaks down quickly. Divide and conquer as always helps a lot.

lreeves · Answer

I sometimes still code with a local LLM but can't imagine doing it on a laptop. I have a server that has GPUs and runs llama.cpp behind llama-swap (letting me switch between models quickly). The best local coding setup I've been able to do so far is using Aider with gpt-oss-120b.
I guess you could get a Ryzen AI Max+ with 128GB RAM to try and do that locally but non-nVidia hardware is incredibly slow for coding usage since the prompts become very large and take exponentially longer but gpt-oss is a sparse model so maybe it won't be that bad.
Also just to point it out, if you use OpenRouter with things like Aider or roocode or whatever you can also flag your account to only use providers with a zero-data retention policy if you are truly concerned about anyone training on your source code. GPT5 and Claude are infinitely better, faster and cheaper than anything I can do locally and I have a monster setup.

sho · Answer

Real-world workflows? I'm all for local LLM, tinker with it all the time, but for productive coding use no local LLM approaches cloud and it's not even close. There's no magic trick or combination of pieces, it just turns out that a quarter million dollars worth of H200s is just much, much better than anything a normal person could possibly deploy at home.Give it time, we'll get there, but not anytime soon.

bravetraveler · Answer

I'm more local than anything, I guess. A Framework Desktop off in another room. 96G set aside for VRAM though I barely use it.Kept it simple: ollama, whatever the latest model is in fashion [when I'm looking]. Feel silly to name any one in particular, I make them compete. I usually don't bother: I know the docs I need.

jwpapi · Answer

On a side note I really thing latency is still important. Is there some benefit in choosing location for where you get your responses from? Like with Openrouter f.e.Also I could think that a local model just for autocomplete could help reducing latency for completion suggestions.

juujian · Answer

I passed on the machine, but we set up gpt-oss-120b on a 128GB RAM Macbook pro and it is shockingly usable. Personally, I could imagine myself using that instead of OpenAI's web interface. The Ollama UI has web search working, too, so you don't have to worry about the model knowing the latest and greatest about every software package. Maybe one day I'll get the right drivers to run a local model on my Linux machine with AMD's NPU, too, but AMD has been really slow on this.

disambiguation · Answer

Not my build and not coding, but I've seen some experimental builds (oss 20b on a 32gb mac mini) with Kiwix integration to make what is essentially a highly capable local private search engine.

reactordev · Answer

I use LM Studio with GGUF models running on either my Apple MacBook Air M1 (it’s, ok…) or my Alienware x17 R2 with an RTX 3080 on a Core i9 (runs like autocomplete) in VS Code using Continue.dev
My only complaint is agent mode needs good token gen so I only go agent mode on the RTX machine.
I grew up on 9600baud so I’m cool with watching the text crawl.

firefax · Answer

I've been using Ollama, Gemma3:12b is about all my little air can handle.If anyone has suggestions on other models, as an experiment I tried asking it to design me a new latex resum&eacute; and it struggled for two hours with the request to put my name prominently at the top in a grey box with my email and phone number beside it.

baby_souffle · Answer

Good quality still needs more power than what a laptop can do. The local llama subreddit has a lot of people doing well with local rigs, but they are absolutely not laptop size.

ThrowawayTestr · Answer

I use the abliterated and uncensored models to generate smut. SwarmUI to generate porn. I can only get a few tokens/s on my machine so not fast enough for quick back and forth stuff.

lovelydata · Answer

llama.cpp + Qwen3-4B running on older PC with AMD Radeon GPU (Vulcan). Users connect via web UI. Usually around 30 tokens/sec. Usable.

NicoJuicy · Answer

Rtx 3090 24gb. Pretty affordable.Gos-oss:20b and qwen3 coder/instruct, devstrall are my usual.Ps. Definitely check out open-web ui.

j45 · Answer

The M2/3/4 Max CPUs in a Mac Studio or Macbook Pro when paired with enough ram are quite capable.
In more cases than expected, the M1/M2 Ultras are still quite capable, especially performance power per watt of electricity, as well as ability to serve one user.
The Mac Studio has better bang for the buck than the laptop for computational power to price.
Depending on your needs, the M5's might be worth waiting for, but M2 Max onward are quite capable with enough ram. Even the M1 Max continues to be a workhorse.

manishsharan · Answer

I am here to hear from folks running LLM on Framework desktop (128GB). Is it usable for agentic coding ?

saubeidl · Answer

I think local LLM and laptop is not really compatible, for anything useful. You're gonna want a bigger box and have your laptop connect to that.

ge96 · Answer

I don't, although I'm not a puritan eg. I'll use the AI summary that shows first in browsers

dust42 · Answer

On a Macbook pro 64GB I use Qwen3-Coder-30B-A3B Q4 quant with llama.cpp.
For VSCode I use continue.dev as it allows to set my own (short) system prompt. I get around 50token/sec generation and prompt processing 550t/s.
When giving well defined small tasks, it is as good as any frontier model.
I like the speed and low latency and the availability while on the plane/train or off-grid.
Also decent FIM with the llama.cpp VSCode plugin.
If I need more intelligence my personal favourites are Claude and Deepseek via API.

softfalcon · Answer

For anyone who wants to see some real workstations that do this, you may want to check out Alex Ziskind's channel on YouTube:
https://www.youtube.com/@AZisk
At this point, pretty much all he does is review workstations for running LLM's and other machine-learning adjacent tasks.
I'm not his target demographic, but because I'm a dev, his videos are constantly recommended to me on YouTube. He's a good presenter and his advice makes a lot of sense.

vinhnx · Answer

> Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?
Open-source coding assistant: VT Code (my own coding agent -- github.com/vinhnx/vtcode) Model: gpt-oss-120b remote hosted via Ollama cloud experimental
> What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?
Macbook Pro M1
> What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).
All agentic coding workflow (debug, refactor, refine and testing sandbox execution). VT Code is currently in preview and being active developed, but currently it is mostly stable.

mjgs · Answer

I use podman compose to spin up an Open WebUI container and various Llama.cpp containers, 1 for each model. Nothing fancy like a proxy or anything. Just connect direct. I also use Continue extension inside vscode, and always use devcontainers when I'm working with any LLMs.
I had to create a custom image of llama.cpp compiled with vulkan so the LLMs can access the GPU on my MacBook Air M4 from inside the containers for inference. It's much faster, like 8-10x faster than without.
To be honest so far I've been using mostly cloud models for coding, the local models haven't been that great.
Some more details on the blog: https://markjgsmith.com/posts/2025/10/12/just-use-llamacpp

hacker_homie · Answer

Any halo strix laptop, I have been using the hp zbook ultra g1a with 128gb of unified memory. Mostly with the 20B parameters models but it can load larger ones. I find local models (gpt oss 20B) are good quick references but if you want to refactor or something like that you need a bigger model. I&rsquo;m running llama.cpp directly and using the api it offers for neovim&rsquo;s avante plugin, or a cli tool like aichat, it comes with a basic web interface as well.

simonw · Answer

I'd be very interested to hear from anyone who's finding local models that work well for coding agents (Claude Code, Codex CLI, OpenHands etc).I haven't found a local model that fits on a 64GB Mac or 128GB Spark yet that appears to be good enough to reliably run bash-in-a-loop over multiple turns, but maybe I haven't tried the right combination of models and tools.

BirAdam · Answer

Mac Studio, M4 MaxLM Studio + gpt-oss + aiderWorks quite quickly. Sometimes I just chat with it via LM Studio when I need a general idea for how to proceed with an issue. Otherwise, I typically use aider to do some pair programming work. It isn't always accurate, but it's often at least useful.

Gracana · Answer

I don&rsquo;t own a laptop. I run DeepSeek-V3 IQ4_XS on a Xeon workstation with lots of RAM and a few RTX A4000s.It&rsquo;s not very fast, and I built it up slowly without knowing quite where I was headed. If I could do it over again, I&rsquo;d go with a recent EPYC with 12 channels of DDR5 and pair it with a single RTX 6000 Pro Blackwell.

wongarsu · Answer

$work has a GPU server running Ollama, I connect to it using the continue.dev VsCode extension. Just ignore the login prompts and set up models via the config.yaml.In terms of models, qwen2.5-coder:3b is a good compromise for autocomplete, as agent choose pretty much just the biggest sota model you can run

system2 · Answer

Those who use these can you compare the quality of code compared to Claude Sonnet 4.5 or Opus 4.1?

packetmuse · Answer

Running local LLMs on laptops still feels like early days, but it&rsquo;s great to see how fast everyone&rsquo;s improving and sharing real setups.

dboreham · Answer

I've run smaller models (I forget which ones, this was about a year ago) on my laptop just to see what happened. I was quite surprised that I could get it to write simple Python programs. Actually very surprised which led me to re-evaluate my thinking on LLMs in general. Anyway, since then I've been using the regular hosted services since for now I don't see a worthwhile tradeoff running models locally. Apart from the hardware needed, I'd expect to be constantly downloading O(100G) model files as they improve on a weekly basis. I don't have the internet capacity to easily facilitate that.

brendoelfrendo · Answer

I keep mine pretty simple: my desktop at home has an AMD 7900XT with 20gb VRAM. I use Ollama to run local models and point Zed's AI integration at it. Right now I'm mostly running Devstral 24b or an older Qwen 2.5 Coder 14b. Looking at it, I might be able to squeak by running Qwen 3 Coder 30b, so I might give it a try to test it out.

giancarlostoro · Answer

If you're going to get a MacBook, get the Pro, it has a built-in fan, you don't want the heat just sitting there on the MacBook Air. Same with the Mac mini, get the studio instead, it has a fan, the Mini does not. I don't know about you but I wouldn't want my brand new laptop / desktop to be heating up the entire time I'm coding with 0 cool off. If you go the Mac route, I recommend getting TG Pro, the default fan settings on the Mac are awful they don't kick in soon enough, TG Pro lets you make it a little more "sensitive" to those temperature shifts, its like $20 for TG Pro if I remember correctly, but worth it.
I have a MacBook Pro with an M4 Pro chip, and 24GB of RAM, I believe only 16 of it is usable by the models, so I can run the GPT OSS 20B model (iirc) but the smaller one. It can do a bit, but the context window fills up quickly, so I do find myself switching context windows often enough. I do wonder if a maxed out MacBook Pro would be able to run larger context windows, then I would easily be able to code all day with it offline.
I do think Macs are phenomenal at running local LLMs if you get the right one.

loudmax · Answer

I have a desktop computer with 128G of RAM and an RTX 3090 with 24G of VRAM. I use this to tinker with different models using llama.cpp and ComfyUI. I manged to get a heavily quantized instance of DeepSeek R1 running on it by following instructions from the Level1 tech forums, but it's far too slow to be useful. GPT-OSS-120b is surprisingly good, though again too quantized and too slow to be more than a toy.
For actual real work, I use Claude.
If you want to use an open weights model to get real work done, the sensible thing would be to rent a GPU in the cloud. I'd be inclined to run llama.cpp because I know it well enough, but vLLM would make more sense for models that runs entirely on the GPU.

timenotwasted · Answer

I have an old 2080TI that I use to run Ollama and Qdrant. It has been ok, I haven't found it so good that it has replaced using Claude or Codex but there are times where having RAG available locally is a nice setup for more specific queries. I also just enjoy tinkering with random models which this makes super easy.My daily drivers though are still either Codex or GPT5, Claude Code used to be but it just doesn't deliver the same results as it has previously.

alexfromapex · Answer

I have a MacBook M3 Max with 128 GB unified RAM. I use Ollama with Open Web UI. It performs very well with models up to 80B parameters but it does get very hot with models over 20B parameters.
I use it to do simple text-based tasks occasionally if my Internet is down or ChatGPT is down.
I also use it in VS Code to help with code completion using the Continue extension.
I created a Firefox extension so I can use Open WebUI in my browser by pressing Cmd+Shift+Space too when I am browsing the web and want to ask a question: https://addons.mozilla.org/en-US/firefox/addon/foxyai/

scosman · Answer

What are folks motivation for using local coding models? Is it privacy and there's no cloud host you trust?
I love local models for some use cases. However for coding there is a big gap between the quality of models you can run at home and those you can't (at least on hardware I can afford) like GLM 4.6, Sonnet 4.5, Codex 5, Qwen Coder 408.
What makes local coding models compelling?

jetsnoc · Answer

Models gpt-oss-120b, Meta Llama 3.2, or Gemma (just depends on what I&rsquo;m doing) Hardware - Apple M4 Max (128 GB RAM) paired with a GPD Win 4 running Ubuntu 24.04 over USB-C networking Software - Claude Code - RA.Aid - llama.cpp For CUDA computing, I use an older NVIDIA RTX 2080 in an old System76 workstation. Process I create a good INSTRUCTIONS.md for Claude/Raid that specifies a task & production process with a task list it maintains. I use Claude Agents with an Agent Organizer that helps determine which agents to use. It creates the architecture, prd and security design, writes the code, and then lints, tests and does a code review.

woile · Answer

I just got a AMD AI 9 HX 370 with 128GB RAM from laptopwithlinux.com and I've started using zed + ollama. I'm super happy with the machine and the service.
Here's my ollama config:
https://github.com/woile/nix-config/blob/main/hosts/aconcagu...
I'm not an AI power user. I like to code, and I like the AI to autocomplete snippets that are "logical", I don't use agents, and for that, it's good enough.

sehugg · Answer

I use "aider --commit" sometimes when I can't think of a comment. I often have to edit it because it's too general or it overstates the impact (e.g. "improved the foo", are you sure you improved the foo?) but that's not limited to local models. I like gemma3:12b or qwen2.5-coder:14b, not much luck with reasoning models.

__mharrison__ · Answer

I have a MBP with 128GB.Here's the pull request I made to Aider for using local models:https://github.com/Aider-AI/aider/issues/4526

dethos · Answer

Ollama, Continue.dev extension for editor/IDE, and Open-WebUI. My hardware is a bit dated, so I only use this setup for some smaller open models.On the laptop, I don't use any local models. Not powerful enough.

gnarlouse · Answer

Omarchy ArchLinux+ollama:deepseek-r1+open-webuiOn an RTX 3080 Ti+Ryzen 9

itake · Answer

Ollama qwen3-coder- auto git commit message- auto jira ticket creation from git diff

finfun234 · Answer

lmstudio with local models

lux_sprwhk · Answer

I use it to analyze my dreams and mind dumps. Just running it on my local machine, cus it&rsquo;s not resource intensive, but building a general solution out of it.I think for stuff that isn&rsquo;t super private like code and such, it&rsquo;s not worth the effort

erikig · Answer

Hardware: MacBook Pro M4 Max, 128GBPlatform: LMStudio (primarily) & OllamaModels:- qwen/qwen3-coder-30b A3B Instruct 8-bit MLX- mlx-community/gpt-oss-120b-MXFP4-Q8For code generation especially for larger projects, these models aren't as good as the cutting edge foundation models. For summarizing local git repos/libraries, generating documentation and simple offline command-line tool-use they do a good job.I find these communities quite vibrant and helpful too:- https://www.reddit.com/r/LocalLLM/- https://www.reddit.com/r/LocalLLaMA/

mwambua · Answer

Tangential question. What do people use for search? What search engines provide the best quality to cost ratios?Also are there good solutions for searching through a local collection of documents?

ghilston · Answer

I have a m4 max mbp with 128 gb. What model would you folks recommend? I'd ideally like to integrate with a tool that can auto read context like Claude code (via a proxy) or cline. I'm open to any advice

garethsprice · Answer

HP G9 Z2 Mini with a 20GB ADA 4000, 96GB RAM, 2TB SSD, Ubuntu. Would get a Macbook with a ton of RAM if I was buying today, a full form factor PC, the mini form factor looks nice but gets real hot and is hard to upgrade.
Tools: LM Studio for playing around with models, the ones I stabilize on for work go into ollama.
Models: Qwen3 Coder 30b is the one I come back to most for coding tasks. It is decent in isolation but not so much at the multi-step, context-heavy agentic work that the hosted frontier models are pushing forward. Which is understandable.
I've found the smaller models (the 7B Qwen coder models, gpt-oss-20B, gemma-7b) extremely useful given they respond so fast (~80t/s for gpt-oss-20B on the above hardware), making them faster to get to an answer than Googling or asking ChatGPT (and fast to see if they're failing to answer so I can move on to something else).
Use cases: Mostly small one-off questions (like 'what is the syntax for X SQL feature on Postgres', 'write a short python script that does Y') where the response comes back quicker than Google, ChatGPT, or even trying to remember it myself.
Doing some coding with Aider and a VS Code plugin (kinda clunky integration), but I quickly end up escalating anything hard to hosted frontier models (Anthropic, OpenAI via their clis or Cursor). I often hit usage limits on the hosted models so it's nice to have a way my dumbest questions don't burn tokens I want to reserve for real work.
Small LLM scripting tasks with dspy (simple categorization, CSV munging type tasks), sometimes larger RAG/agent type things with LangChain but it's a lot of overhead for personal scripts.
My company is building a software product that heavily utilizes LLMs so I often point my local dev environment at my local model (whatever's loaded, usually one of the 7B models), initially I did this not to incur costs but as prices have come down it's now more as it's less latency and I can test interface changes etc faster - especially as new thinking models can take a long time to respond.
It is also helpful to try and build LLM functions that work with small models as it means they run efficiently and portably on larger ones. One technical debt trap I have noticed with building for LLMs is that as large models get better you can get away with stuffing them with crap and still getting good results... up until you don't.
It's remarkable how fast things are moving in the local LLM world, right now the Qwen/gpt-oss models "feel" like gpt-3.5-turbo did a couple of years back which is remarkable given how groundbreaking (and expensive to train) 3.5 was and now you can get similar results on sub-$2k consumer hardware.
However, its very much still in the "tinkerer" phase where it's overall a net productivity loss (and massive financial loss) vs just paying $20/mo for a hosted frontier model.

codingbear · Answer

I use local for code completions only. Which means models supporting FIM tokens.
My current setup is the llama-vscode plugin + llama-server running Qwen/Qwen2.5-Coder-7B-Instruct. It leads to very fast completions, and don't have to worry about internet outages which take me out of the zone.
I do wish qwen-3 released a 7B model supporting FIM tokens. 7B seems to be the sweet spot for fast and usable completions

egberts1 · Answer

Ollama, 16-CPU Xenon E6320 (old), 1.9Ghz, 120GB DDRAM4, 240TB RAID5 SSDs, on Dell Precision T710 ("The Beast"). NO GPU. 20b (n oooooot f aah st at all). Pure CPU bound. Tweaked for 256KB chunking into RAG.
Ingested election laws of 50 states, territories and Federal.
Goal. Mapping out each feature of the election and deal with (in)consistent terminologies sprouted by different university-trained public administration. This is the crux of hallunications: getting a diagram of ballot handling and their terminologies.
Then maybe tackle the multitude ways of election irregularities, or at least point out integrity gaps at various locales.
https://figshare.com/articles/presentation/Election_Frauds_v...

gcr · Answer

For new folks, you can get a local code agent running on your Mac like this:
1. $ npm install -g @openai/codex
2. $ brew install ollama; ollama serve
3. $ ollama pull gpt-oss:20b
4. $ codex --oss -m gpt-oss:20b
This runs locally without Internet. Idk if there’s telemetry for codex, but you should be able to turn that off if so.
You need an M1 Mac or better with at least 24GB of GPU memory. The model is pretty big, about 16GB of disk space in ~/.ollama
Be careful - the 120b model is 1.5× better than this 20b variant, but takes 5× higher requirements.

whitehexagon · Answer

Qwen3:32b on MBP M1 Pro 32GB running Asahi linux. Mainly command line for some help with armv8 assembly, and some SoC stuff (this week explaining I2C protocol). I couldnt find any good intro on the web-of-ads. It's not much help with Zig, but then nothing seems to keep up with Zig at the moment.
I get a steady stream of tokens, slightly slower than my reading pace, which I find is more than fast enough. In fact I´d only replace with exact same, or maybe M2 + Asahi with enough RAM to run the bigger Qwen3 model.
I saw qwen3-coder mentioned here. I didnt know about that one. Anyone got any thoughts on how that compares to qwen3? Will it also fit in 32GB?
I'm not interested in agents, or tool integration, and especially wont use anything cloud. I like to own my env. and code top-to-bottom. Having also switched to Kate and Fossil it feels like my perfect dev environment.
Currently using an older Ollama, but will switch to llama.cpp now that ollama has pivoted away from offline only. I got llama.cpp installed, but not sure how to reuse my models from ollama, I thought ollama was just a wrapper, but they seems to be different model formats?
[edit] be sure to use it powered, linux is a bit battery heavy, but Qwen3 will pull 60W+ and flatten a battery real fast.

platevoltage · Answer

I've been using qwen2.5-coder for code assistant and code completion which has worked pretty well. I recently started trying mistral:7b-instruct. I use Continue with VS Code. It works ok. I'm limited to 16GB on an M2 MacBook Pro. I definitely wish I had more RAM to play with.

mooiedingen · Answer

Vim+ollama-vim Start new file with at the top in the comments the instructions needed to follow to become the solution to the problem and let it work like a sort of auto complete... example: # The Following is a python # Script that uses the # libraries requests and # BeautifullSoup to scrape # url_to_scrape = input( # "what url do i need to # fetch?") import ... """autocompletes from here the rest""" anywhere in a script one can # comment ' Instructions this way i find the most effective instead of asking Write me a script for this or that.. take a coding model and finetune it with commonly used snippets of code... This is completely customizable and will stay coherent to your own writing style.. i made embeddings per language, even md. python javascript vimscript lua php html json(however output is json) xml Css ...

more_corn · Answer

My friend uses a 4 gpu server in her office and hits the ollama api over the local network. If you want it to work from anywhere a free tailscale account.

sharms · Answer

FWIW I bought the M4 max with 128GB and it is useful for local LLMs for OCR, I don't find it as useful for coding (ala Codex / Claude Code) with local LLMs. I find that even with GPT 5 / Claude 4.5 Sonnet that trust is low, and local LLMs can lower that just enough to not be as useful. The heat is also a factor - Apple makes great hardware, but I don't believe it is designed for continuous usage the way a desktop is.

kabes · Answer

Let's say I have a server with an h200 gpu at home. What's the best open model for coding I can run on it today? And is it somewhat competitive with commercial models like sonnet 4.5?

dennemark · Answer

I have AMD Strix Halo (395) on my work laptop (HP Ultrabook G1A) as well as at home with Framework Desktop.
On both i have setup lemonade-server on system start. At work i use Qwen3 Coder 30B-3A with continue.dev. It serves me well in 90% of cases.
At home i have 128GB RAM. I try a bit GPT120B. I host Open WebUI on it and connect via https and wireguard to it, so i can use it as PWA on my phone. I love not needing to think about where my data goes. But i would like to allow parallel requests, so i need to tinker a bit more. Maybe llama-swap is enough.
I just need to see how to deal with context length. My models stop or go into infinite loop after some messages. But then i often start a new chat.
Lemonade-server runs with llama.cpp, vllm seems to be scaling better thoug, but is not so easy to set up.
Unsloth GGUFs are great resource for models.
Also for Strix Halo check out kyuz0 repositorIES! Also has image gen. I didnt try those yet. But the benchmarks are awesome! Lots to learn from. Framework forum can be useful, too.
https://github.com/kyuz0/amd-strix-halo-toolboxes Also nice: https://llm-tracker.info/ It links to some benchmark site with models by size. I prefer such resources, since it is quite easy to see which one fit in my RAM (even though i have this silly thumbrule Billion Token ≈ GB RAM).
Btw. even a AMD HX 370 with non soldered RAM can get some nice t/s for smaller models. Can be helpful enough when disconnected from internet and you dont know how to style a svg :)
Thanks for opening up this topic! Lots of food :)

Who uses open LLMs and coding assistants locally? Share setup and laptop

On a side note I really thing latency is still important. Is there some benefit in choosing location for where you get your responses from? Like with Openrouter f.e.
Also I could think that a local model just for autocomplete could help reducing latency for completion suggestions.

Not my build and not coding, but I've seen some experimental builds (oss 20b on a 32gb mac mini) with Kiwix integration to make what is essentially a highly capable local private search engine.

Good quality still needs more power than what a laptop can do. The local llama subreddit has a lot of people doing well with local rigs, but they are absolutely not laptop size.

I use the abliterated and uncensored models to generate smut. SwarmUI to generate porn. I can only get a few tokens/s on my machine so not fast enough for quick back and forth stuff.

llama.cpp + Qwen3-4B running on older PC with AMD Radeon GPU (Vulcan). Users connect via web UI. Usually around 30 tokens/sec. Usable.

Rtx 3090 24gb. Pretty affordable.
Gos-oss:20b and qwen3 coder/instruct, devstrall are my usual.
Ps. Definitely check out open-web ui.

I am here to hear from folks running LLM on Framework desktop (128GB). Is it usable for agentic coding ?

I think local LLM and laptop is not really compatible, for anything useful. You're gonna want a bigger box and have your laptop connect to that.

I don't, although I'm not a puritan eg. I'll use the AI summary that shows first in browsers

Mac Studio, M4 Max
LM Studio + gpt-oss + aider
Works quite quickly. Sometimes I just chat with it via LM Studio when I need a general idea for how to proceed with an issue. Otherwise, I typically use aider to do some pair programming work. It isn't always accurate, but it's often at least useful.

Those who use these can you compare the quality of code compared to Claude Sonnet 4.5 or Opus 4.1?

Running local LLMs on laptops still feels like early days, but it’s great to see how fast everyone’s improving and sharing real setups.

I have a MBP with 128GB.
Here's the pull request I made to Aider for using local models:
https://github.com/Aider-AI/aider/issues/4526

Ollama, Continue.dev extension for editor/IDE, and Open-WebUI. My hardware is a bit dated, so I only use this setup for some smaller open models.
On the laptop, I don't use any local models. Not powerful enough.

Omarchy ArchLinux+ollama:deepseek-r1+open-webui
On an RTX 3080 Ti+Ryzen 9

Ollama qwen3-coder
- auto git commit message
- auto jira ticket creation from git diff

lmstudio with local models

I use it to analyze my dreams and mind dumps. Just running it on my local machine, cus it’s not resource intensive, but building a general solution out of it.
I think for stuff that isn’t super private like code and such, it’s not worth the effort

Tangential question. What do people use for search? What search engines provide the best quality to cost ratios?
Also are there good solutions for searching through a local collection of documents?

I have a m4 max mbp with 128 gb. What model would you folks recommend? I'd ideally like to integrate with a tool that can auto read context like Claude code (via a proxy) or cline. I'm open to any advice

I've been using qwen2.5-coder for code assistant and code completion which has worked pretty well. I recently started trying mistral:7b-instruct. I use Continue with VS Code. It works ok. I'm limited to 16GB on an M2 MacBook Pro. I definitely wish I had more RAM to play with.

My friend uses a 4 gpu server in her office and hits the ollama api over the local network. If you want it to work from anywhere a free tailscale account.

Let's say I have a server with an h200 gpu at home. What's the best open model for coding I can run on it today? And is it somewhat competitive with commercial models like sonnet 4.5?