Abstract
Coherence modeling is an important task in natural language processing (NLP) with potential impact on other NLP tasks such as Natural Language Understanding or Automated Essay Scoring. Automatic approaches in coherence modeling aim to distinguish coherent from incoherent (often synthetically created) texts or to identify the correct continuation for a given sample of texts, as demonstrated for Italian in the DisCoTex task of EVALITA 2023. While early work on coherence modelling has focused on exploring definitions of the phenomenon, exploring the performance of neural models has dominated the field in recent years. However, coherence modelling can also offer interesting linguistic insights with pedagogical implications. In this article, we target coherence modeling for the Italian language in a strongly domain-specific scenario, i.e. education. We use a corpus of student essays collected to analyse students’ text coherence in combination with data perturbation techniques to experiment with the effect of various linguistically informed features of incoherent writing on current coherence modelling strategies used in NLP. Our results show the capabilities of encoder models to capture features of (in)coherence in a domain-specific scenario discerning natural from artificially corrupted texts.