What are you using for LLM response testing and benchmarking?

Question

What are you using to test your LLM responses, benchmark them, maybe compare different versions?I've seen a few YC startups focusing on this but I haven't decided yet if we should build this internally or use an external tool.

tikkun · Accepted Answer

What features are critical for you in your response testing and benchmarking? I think it'll be easier to answer with specifics.

muzani · Answer

This list has been nice: https://lmsys.org/blog/2023-05-25-leaderboard/And you can participate in the arena, which pits them against each other. I'm surprised that I actually voted for GPT-3.5 over GPT-4 for a lot of my use cases.

What are you using for LLM response testing and benchmarking?

What are you using to test your LLM responses, benchmark them, maybe compare different versions?
I've seen a few YC startups focusing on this but I haven't decided yet if we should build this internally or use an external tool.

What features are critical for you in your response testing and benchmarking? I think it'll be easier to answer with specifics.

This list has been nice: https://lmsys.org/blog/2023-05-25-leaderboard/
And you can participate in the arena, which pits them against each other. I'm surprised that I actually voted for GPT-3.5 over GPT-4 for a lot of my use cases.