It sounds like we may have hit similar limits, using slightly different means to get there.
Yes, but it's possible to batch the calls when feeding the data through the neural network, so LLM libraries might support that.
See for example this[1] article which gives a brief overview of batching calls using vLLM.
[1]: https://medium.com/ubiops-tech/how-to-optimize-inference-spe...