I’m aware that all of these measures have limitations and that many are controversial or imperfect. My goal is discovery and understanding, not to defend or attack any particular framework.
I’d love to hear:
- What measures, benchmarks, or methodologies you think belong on this list
- What you see as their key strengths and failure modes
- How (or whether) you personally use them to interpret AI progress
I think this would take the form of something more abstract instead of concrete with raw numbers, like a revised Turing Test.