Now that local LLMs are gaining traction, I’m wondering what the equivalent stack looks like today.
Models, Runtime, hardware and other tools.
That could rival the Claudes, ChatGPTs or Geminis, etc
Thanks
1. For Heavy, Complex Tasks (Summarization, Code Gen, Creative Work): We don't self-host. The performance of top-tier models is still unmatched. We use Gemini-based models via Google's Vertex AI. The reliability and raw power for complex reasoning are worth the API cost for these critical features.
2. For Fast, Specific, Private Tasks (Our Self-Hosted Stack): For smaller, high-frequency tasks like classifying feedback types or extracting specific keywords from a conversation, we use a self-hosted stack for speed and cost-efficiency.
Models: We use fine-tuned versions of smaller, open-source models like Llama 3 8B or Mistral 7B. They are incredibly fast and cost-effective for specific, repetitive tasks. Runtime/Orchestration: We use LangChain for chaining prompts and managing workflows. For serving the model, we're using a simple FastAPI server running in a Docker container. Hardware: We run this on a dedicated GPU instance (like an A10G on AWS/GCP) for inference. The cost is predictable and much lower than using a large model for every small task. My takeaway: The "go-to stack" in 2025 isn't one-size-fits-all. It's a pragmatic, hybrid approach using the bestin class cloud APIs for the heavy lifting, and deploying fast, fine-tuned open-source models for everything else.
Tbh for coding I just use the smaller ones like CodeQwen 7B. way faster and good enough for autocomplete. Only fire up the big model when I actually need it to think.
The annoying part is keeping everything updated, new model drops every week and half don't work with whatever you're already running.
The models vary depending on the task. DeepSeek distilled has been a favorite for the past several months.
I use various smaller (~3B) models for simpler tasks.