HACKER Q&A
📣 harsh020

Is LLM training infra still broken enough to build a company around?


We recently ran into something frustrating while training and fine-tuning open-weight TTS models.

Instead of working on the model itself, we spent days dealing with: - CUDA version mismatches - Driver / PyTorch conflicts - OOM crashes when scaling to multi-GPU - Broken or outdated open-source training scripts - Gluing together tracking + eval + deployment manually

It felt like we were rebuilding the same orchestration layer every team probably rebuilds. - Cloud providers give raw GPUs. - MLOps tools give experiment tracking. - Open-source gives training scripts.

But the end-to-end workflow (dataset → fine-tune → monitor → evaluate → deploy → retrain) still feels stitched together.

We’re exploring building an opinionated platform that:

Lets you select a base model (e.g. Llama/Mistral-style open models) 1. Upload or connect datasets 2. Choose infra tier 3. Launch LoRA/full fine-tuning 4. Monitor loss + cost in real time 5. Run built-in eval 6. Deploy with one click

Basically: abstract away the CUDA + orchestration layer.

Before we go too deep, I’d love honest feedback: - Is this still a painful problem at your company? - Would serious AI teams use this, or do larger companies just build infra in-house? - Is this doomed to be a hobbyist tool? - Where would the real wedge be — training, evaluation, or continuous retraining?

We’ve launched a simple landing page and started building, but we’re still early and trying to validate whether this is a real infra gap or just our own frustration.

Would appreciate blunt feedback.


  👤 genxy Accepted Answer ✓
> CUDA version mismatches - Driver / PyTorch conflicts - OOM crashes when scaling to multi-GPU - Broken or outdated open-source training scripts - Gluing together tracking + eval + deployment manually

This shouldn't take days and CC can already setup all of this using whatever level of rigor you need.

Your business will get replaced with a prompt.