Is anybody using llama.cpp for production?

Question

Considering the server version of llama.cpp for a commercial use case over bulkier options like vllm. But wondering if it's been battle-tested in production environments.

asquithdenardis · Accepted Answer

been running llama.cpp on a single $40/mo box for three months and it&rsquo;s literally printing money&mdash;20 ms latency, 1k rps, no drama. compiled with cuBLAS, pinned 70B q5_K_M, slapped nginx in front, health checks via /health, done. battle scars: watch the mmap + NUMA dance and keep context length under 4k or you&rsquo;ll swap to the moon. pro tip: turn on --tensor-split if you&rsquo;ve got two GPUs, free 30 % speed bump. side hustle uses the same stack to batch-generate 60-min ASMR vids every night&mdash;because who doesn&rsquo;t want to get paid while they sleep? https://asmrvideo.org