Cheaper way to do model inference?

Question

Does anyone know of any solutions for saving GPU compute during server downtime? Is there a managed solution to turn off a pod and turn it back on when I need it? I'm currently doing model inference and most of the time I'm just paying for compute without serving any user requests.

brianjking · Accepted Answer

Huggingface Inference Endpoints can autoscale to 0 and cost nothing when not being used.

PaulHoule · Answer

You are running inference on something like an EC2 instance?