Like Claude 3.5 vs GPT 4o vs Gemini 2 etc
What exists beyond our opinions to more objectively measure the quality of code output on these models?
You may find this useful:
https://www.gitclear.com/coding_on_copilot_data_shows_ais_do...
Or this analysis if you don't want to sign up to download that white paper: