Is LLM training infra still broken enough to build a company around?

Question

We recently ran into something frustrating while training and fine-tuning open-weight TTS models.Instead of working on the model itself, we spent days dealing with: - CUDA version mismatches - Driver / PyTorch conflicts - OOM crashes when scaling to multi-GPU - Broken or outdated open-source training scripts - Gluing together tracking + eval + deployment manuallyIt felt like we were rebuilding the same orchestration layer every team probably rebuilds. - Cloud providers give raw GPUs. - MLOps tools give experiment tracking. - Open-source gives training scripts.But the end-to-end workflow (dataset &rarr; fine-tune &rarr; monitor &rarr; evaluate &rarr; deploy &rarr; retrain) still feels stitched together.We&rsquo;re exploring building an opinionated platform that:Lets you select a base model (e.g. Llama/Mistral-style open models) 1. Upload or connect datasets 2. Choose infra tier 3. Launch LoRA/full fine-tuning 4. Monitor loss + cost in real time 5. Run built-in eval 6. Deploy with one clickBasically: abstract away the CUDA + orchestration layer.Before we go too deep, I&rsquo;d love honest feedback: - Is this still a painful problem at your company? - Would serious AI teams use this, or do larger companies just build infra in-house? - Is this doomed to be a hobbyist tool? - Where would the real wedge be &mdash; training, evaluation, or continuous retraining?We&rsquo;ve launched a simple landing page and started building, but we&rsquo;re still early and trying to validate whether this is a real infra gap or just our own frustration.Would appreciate blunt feedback.

genxy · Accepted Answer

> CUDA version mismatches - Driver / PyTorch conflicts - OOM crashes when scaling to multi-GPU - Broken or outdated open-source training scripts - Gluing together tracking + eval + deployment manually
This shouldn't take days and CC can already setup all of this using whatever level of rigor you need.
Your business will get replaced with a prompt.