Logo image
Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests
Conference proceeding   Open access   Peer reviewed

Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests

F Momentè, Alessandro Suglia, Mario Giulianelli, Ambra Ferrari, Alexander Koller, Oliver Lemon, David Schlangen, Raquel Fernández and Raffaella Bernardi
Findings of the Association for Computational Linguistics: EMNLP 2025, pp.20051-20072
Empirical Methods in Natural Language Processing (Suzhou, 04/11/2025–09/11/2025)
2025
Handle:
https://hdl.handle.net/10863/51713

Abstract

We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two—benchmarks or games—is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We propose a new evaluation framework triangulating LLMs progress. Our findings highlight the importance of defining evaluation regimes that consider multiple paradigms. We advocate for the development of new interactive benchmarks and targeted cognitive tasks inspired by human ability assessments but designed specifically for LLMs. The code for running the experiments is released at: https://github.com/momentino/playpen_eval/tree/triangulating.
pdf
2025.findings-emnlp.1092DownloadView
Open Access
url
https://doi.org/10.18653/v1/2025.findings-emnlp.1092View

Details

Metrics

1 Record Views
Logo image