Why AI benchmark comparisons break down - and how to get reliable answers
https://papaly.com/2/psNd
In a controlled evaluation I ran between 2024-03-01 and 2024-05-30 across 40 production-ready models, only 4 models scored better than a coin flip on a set of deliberately hard questions designed to separate summarization skill from factual knowledge