Logo image
Testing ChatGPT for Stability and Reasoning: A Case Study Using Italian Medical Specialty Tests
Conference proceeding   Open access   Peer reviewed

Testing ChatGPT for Stability and Reasoning: A Case Study Using Italian Medical Specialty Tests

S Casola, Tiziano Labruna, A Lavelli and B Magnini
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023), Vol.3596, pp.113-119
CEUR Workshop Proceedings, 3596
9th Italian Conference on Computational Linguistics, CLiC-it 2023 (Venice, 30/11/2023–02/12/2023)
2023
Handle:
https://hdl.handle.net/10863/51391

Abstract

Large language models ChatGPT Stability
Although large language models (LLMs) are achieving impressive performance under zero- and few-shot learning configurations, their reasoning capacities are still poorly understood. As a step in this direction, we present several experiments on multiple-choice question answering, a setting that allows us to evaluate the stability of the model under different prompting, the capacity to understand when none of the provided answers is correct, and to reason on specific answering strategies (e.g., recursively eliminate the worst answer). We use the Italian medical specialty tests yearly administered to admit medical doctors to specialties. Results show that a gpt-3.5-turbo model achieves excellent performance in the absolute score (an average of 108 out of 140) while still suffering in certain reasoning capacities, particularly in failing to understand when none of the provided answers is correct.
pdf
2-s2.0-851811743741,018.93 kBDownloadView
Open Access
url
urn:nbn:de:0074-3596-0 View

Details

Metrics

1 Record Views
Logo image