Yet outside of this community, local LLMs still don’t seem mainstream. My hunch: *great UX and durable apps are still thin on the ground.*
If you are using local models, I’d love to learn from your setup and workflows. Please be specific so others can calibrate:
Model(s) & size: exact name/version, and quantization (e.g., Q4_K_M).
Runtime/tooling: e.g., Ollama, LM studio, etc.
Hardware: CPU/GPU details (VRAM/RAM), OS. If laptop/edge/home servers, mention that.
Workflows where local wins: privacy/offline, data security, coding, huge amount extraction, RAG over your files, agents/tools, screen capture processing—what’s actually sticking for you?
Pain points: quality on complex reasoning, context management, tool reliability, long‑form coherence, energy/thermals, memory, Windows/Mac/Linux quirks.
Favorite app today: the one you actually open daily (and why).
Wishlist: the app you wish existed.
Gotchas/tips: config flags, quant choices, prompt patterns, or evaluation snippets that made a real difference.
If you’re not using local models yet, what’s the blocker—setup friction, quality, missing integrations, battery/thermals, or just “cloud is easier”? Links are welcome, but what helps most is concrete numbers and anecdotes from real use.
A simple reply template (optional):
``` Model(s): Runtime/tooling: Hardware: Use cases that stick: Pain points: Favorite app: Wishlist: ```
Also curious how people think about privacy and security in practice. Thanks!
Cloud llm are able to run 1 trillion parameters and have all of python knowledge in a transparent rag that's 100gbit or faster. Of course they'll be the bestest on the block.
But when the new GPT coding benchmarks only barely behind grok 4 or gpt5 with high reasoning.
>Model(s) & size: exact name/version, and quantization (e.g., Q4_K_M).
My most reliable setup is Devstral + openhands. unsloth Q6_K_XL, 85,000 context, flash attention, kcache and vcache quant at Q8.
Second most reliable. GPT-OSS-20B + opencode. Default MXFP4, I can only load up 31,000 context or it fails?(still plenty but hoping this bug gets fixed), you cant use flash attention or kv or v quantization or it becomes dumb as rocks. This harmony stuff is annoying.
Still preliminary, just got working today, but testing is really good. Qwen3-30b-a3b-thinking-2507 + roo code or qwencode, 80,000 context, unsloth q4_k_xl, flash attention, kcache and vcache quant at Q8.
>Runtime/tooling: e.g., Ollama, LM studio, etc.
LM studio. I need vulkan for my setup. rocm is just a pain in the ass. They need to support way more linux distros.
24gb vram.
Now for local LLMs? A new renaissance. All that power without having to pinky swear with a cloud provider that they won't just take your generated code and use it for themselves.
Expect to see some awesome Windows and Mac apps being developed in the coming months and years. 100% on device, memory safe, and with a thin resource footprint. The 1990s/2000s are coming back.
Also log into your Claude/OpenAI dashboard and read the logs. Now they log every damn thing that goes through the API and keep it there for a minimum of 30 days without any option to delete (unless you're enterprise). No anonymization or anything. Just raw audit logs.
Currently I am using llama.cpp for an interactive repl chat. I was previously using Alpaca (a GTK GUI), but was annoyed with how slow it was and some random crashes. I am transitioning some of this to self hosted in the cloud for things that can't run on my laptop.
I am looking to get away from my current interface, and write my own. Mostly for experience of deeply integrating agents into a program. If anyone knows a good library for interacting with a local model that doesn't involve standing up a webserver I am interested :)
My daily driver is gemma3n. Its been a nice balance between speed and performance without spinning up my laptop fans.
I am super interested in local models, partially because there is no friction from managed services, but also because I think as small models become more viable we will see an explosion of apps incorporating them.