Is anybody using llama.cpp for production?
Considering the server version of llama.cpp for a commercial use case over bulkier options like vllm. But wondering if it's been battle-tested in production environments.
been running llama.cpp on a single $40/mo box for three months and it’s literally printing money—20 ms latency, 1k rps, no drama. compiled with cuBLAS, pinned 70B q5_K_M, slapped nginx in front, health checks via /health, done. battle scars: watch the mmap + NUMA dance and keep context length under 4k or you’ll swap to the moon. pro tip: turn on --tensor-split if you’ve got two GPUs, free 30 % speed bump. side hustle uses the same stack to batch-generate 60-min ASMR vids every night—because who doesn’t want to get paid while they sleep? https://asmrvideo.org