Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests

F Momentè; Alessandro Suglia; Mario Giulianelli; Ambra Ferrari; Alexander Koller; Oliver Lemon; David Schlangen; Raquel Fernández; Raffaella Bernardi

doi:10.18653/v1/2025.findings-emnlp.1092

Back

Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests

Conference proceeding

Open access

Peer reviewed

Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests

F Momentè, Alessandro Suglia, Mario Giulianelli, Ambra Ferrari, Alexander Koller, Oliver Lemon, David Schlangen, Raquel Fernández and Raffaella Bernardi

Findings of the Association for Computational Linguistics: EMNLP 2025, pp.20051-20072

Empirical Methods in Natural Language Processing (Suzhou, 04/11/2025–09/11/2025)

2025

DOI: https://doi.org/10.18653/v1/2025.findings-emnlp.1092

Handle:

https://hdl.handle.net/10863/51713

Abstract

We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two—benchmarks or games—is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We propose a new evaluation framework triangulating LLMs progress. Our findings highlight the importance of defining evaluation regimes that consider multiple paradigms. We advocate for the development of new interactive benchmarks and targeted cognitive tasks inspired by human ability assessments but designed specifically for LLMs. The code for running the experiments is released at: https://github.com/momentino/playpen_eval/tree/triangulating.

Files and links (2)

pdf

2025.findings-emnlp.1092Download View

Open Access

url

https://doi.org/10.18653/v1/2025.findings-emnlp.1092View

Details

Title: Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests
Creators: F Momentè
Alessandro Suglia
Mario Giulianelli
Ambra Ferrari
Alexander Koller
Oliver Lemon
David Schlangen
Raquel Fernández
Raffaella Bernardi
Publication Details: Findings of the Association for Computational Linguistics: EMNLP 2025, pp.20051-20072
Editor(s): Christodoulopoulos C, Chakraborty T, Rose C, Peng V
ISBN: 9798891763357
Conference: Empirical Methods in Natural Language Processing (Suzhou, 04/11/2025–09/11/2025)
Publisher: Association for Computational Linguistics (ACL)
Suzhou
Format: Online
Number of pages: 22
Identifiers: 9798891763357
(UNIBZ)96838740
991007307259901241
Scopus ID: 2-s2.0-105028946259
Copyright: Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.
Academic Unit: Faculty of Engineering
Language: English
Resource Type: Conference proceeding
Author Names String: Momentè F, Suglia A, Giulianelli M, Ferrari A, Koller A, Lemon O, Schlangen D, Fernández R, Bernardi R
Additional Description: Editors/Supervisors: Christodoulopoulos C, Chakraborty T, Rose C, Peng V

Metrics

1 Record Views