How do you personally evaluate new LLM models?

Question

Hey folks, how do you personally evaluate new HN models? Vibes? Or do you have some tests you like to run? Or do you just use them in your IDE/text iterface for a bit and see how it feels? I know we could probably trust some more public benchmarks but I'm curious on personal evaluation techniques. Thanks!

incomingpain · Accepted Answer

I have some prompts ive saved and have been expanding as needed. I have say a dozen key features and a bunch of rules that need to be implemented. Not much is left for them to imagine. Then they need to get coding.
I also have 1 seat of my pants tests of 'give me a story' and its themed what my kid likes lately.
Overall from my testing, the good players like claude get it correct in the first go. Amazing. But i dont mind giving it feedback, what matters is how many times i need to recorrect it. qwen-coder was extremely excessive.