LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Anna Bavaresco; Raffaella Bernardi; Leonardo Bertolazzi; Desdmon Elliott; R Fernandez; Albert Gatt; Esam Ghaleb; Mario Giulianelli; Michael Hanna; Alexander Koller; Andre Martins; Philipp Mondorf; Vera Neplenbroek; Sandro Pezzelle; Barbara Plank; David Schlangen; Alessandro Suglia; Aditya K Surikuchi; Ece Takmaz; Alberto Testoni

doi:10.18653/v1/2025.acl-short.20

Back

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Conference proceeding

Open access

Peer reviewed

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desdmon Elliott, R Fernandez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, …

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pp.238-255

63rd Annual Meeting of the Association for Computational Linguistics (Vienna, 27/07/2025–01/07/2025)

2025

DOI: https://doi.org/10.18653/v1/2025.acl-short.20

Handle:

https://hdl.handle.net/10863/51730

Abstract

There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments, raising questions about the validity of these evaluations, as well as their reproducibility in the case of proprietary models. We provide JUDGE-BENCH, an extensible collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show substantial variance across models and datasets. Models are reliable evaluators on some tasks, but overall display substantial variability depending on the property being evaluated, the expertise level of the human judges, and whether the language is human or model-generated. We conclude that LLMs should be carefully validated against human judgments before being used as evaluators.

Files and links (2)

pdf

judge_2025.acl-short.20688.79 kBDownload View

Open Access

url

https://aclanthology.org/2025.acl-short.20/View

Details

Title: LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
Creators: Anna Bavaresco
Raffaella Bernardi
Leonardo Bertolazzi
Desdmon Elliott
R Fernandez
Albert Gatt
Esam Ghaleb
Mario Giulianelli
Michael Hanna
Alexander Koller
Andre Martins
Philipp Mondorf
Vera Neplenbroek
Sandro Pezzelle
Barbara Plank
David Schlangen
Alessandro Suglia
Aditya K Surikuchi
Ece Takmaz
Alberto Testoni
Publication Details: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pp.238-255
Editor(s): Che W, Nabende J, Shutova E, Pilehvar MT
ISBN: 9798891762510
Conference: 63rd Annual Meeting of the Association for Computational Linguistics (Vienna, 27/07/2025–01/07/2025)
Publisher: Association for Computational Linguistics (ACL)
Format: Online
Number of pages: 18
Identifiers: 979-8-89176-251-0
(UNIBZ)96839657
991007307260001241
Web of Science ID: 001596031400020
Scopus ID: n.a.
Copyright: Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.
Academic Unit: Faculty of Engineering
Language: English
Resource Type: Conference proceeding
Author Names String: Bavaresco A, Bernardi R, Bertolazzi L, Elliott D, Fernandez R, Gatt A, Ghaleb E, Giulianelli M, Hanna M, Koller A, Martins AFT, Mondorf P, Neplenbroek V, Pezzelle S, Plank B, Schlangen D, Suglia A, Surikuchi AK, Takmaz E, Testoni A
Additional Description: Editors/Supervisors: Che W, Nabende J, Shutova E, Pilehvar MT

Metrics

1 Record Views