Complexity or complexities? A simulation study on lexical complexity in expert and learner texts through the lens of information theory

P Brasolin; Arianna Bienati

Back

Complexity or complexities? A simulation study on lexical complexity in expert and learner texts through the lens of information theory

Conference presentation

Open access

Complexity or complexities? A simulation study on lexical complexity in expert and learner texts through the lens of information theory

P Brasolin and Arianna Bienati

7th International Conference for Learner Corpus Research (LCR) (Tartu, 26/09/2024 - 28/09/2024)

2024

Handle:

https://hdl.handle.net/10863/44319

Abstract

linguistic complexity

learner corpus research

Gell-Mann complexity

Information Theory

The debate about linguistic complexity in general and lexical complexity in particular has been extremely lively, exploring both the validity of indices in capturing the construct (e.g., McCarthy & Jarvis, 2010; Kyle et al., 2021; Zenker and Kyle, 2021) and the theoretical foundations of the construct itself (e.g., Bulté and Housen, 2012; Jarvis, 2013; Pallotti, 2015). Intuitively, complexity transcends “the number and variety of an item’s constituent elements” to include “the elaboratedness of their interrelational structure” (Rescher, 2020:1). This echoes the concept of Gell-Mann effective complexity in information theory, which emphasizes the amount of non-random information in a system, which peaks in the intermediate stage between order and disorder. Gell-Mann complexity is often opposed to Kolmogorov complexity, i.e., the total amount of information in a system, which monotonically increases from maximum order to maximum disorder. This study explores which information-theoretical notion of complexity (Kolmogorov vs. Gell-Mann) is measured by widely used complexity indices, via a simulation study on four Italian corpora, representing the spectrum from expert to learner texts. New texts are synthesized from the originals by altering them in two directions: increased order is obtained as the repetition of increasingly smaller subsections of the original text, whereas increased disorder is obtained as the shuffling of increasingly smaller fragments of it. Additionally, we generate texts with uniform word distribution, simultaneously altering both the structure and the original word distributions. For each corpus, the synthetic data allow us to explore the spectrum from total order to total disorder. All texts are analyzed using type-token-ratio-based and surprisal-based metrics, including fluctuation complexity (Bates and Shepard 1993). Examining the distribution of the computed values shows that TTR-based metrics, except MATTR, are sensitive to increased order but not disorder. Surprisal-based measures, on the other hand, do show interesting Kolmogorov (entropy) or Gell-Mann behavior (normalized entropy and fluctuation complexity), enhancing their mutual interpretability when combined. Our results indicate that fluctuation complexity in particular could complement linguistic complexity tools, since it captures the intuitive notion of complexity in a text.

Files and links (3)

pdf

Brasolin_Bienati_Complexity-or-complexities681.42 kBDownload View

CC BY V4.0, Open Access

url

https://lcr2024.ut.ee/View

url

https://zenodo.org/records/13842028View

Details

Title: Complexity or complexities? A simulation study on lexical complexity in expert and learner texts through the lens of information theory
Creators: P Brasolin
Arianna Bienati
Conference: 7th International Conference for Learner Corpus Research (LCR) (Tartu, 26/09/2024 - 28/09/2024)
Identifiers: (EURAC)28802587
991006890697401241
Academic Unit: Institute for Applied Linguistics
Language: English
Resource Type: Conference presentation
Description coverage: international
Description audience: Scientific
Local Fields: Scientific
Author Names String: Brasolin P, Bienati A

Metrics

1 Record Views