Abstract
The debate about linguistic complexity in general and lexical complexity in particular has been extremely lively, exploring both the validity of indices in capturing the construct (e.g., McCarthy & Jarvis, 2010; Kyle et al., 2021; Zenker and Kyle, 2021) and the theoretical foundations of the construct itself (e.g., Bulté and Housen, 2012; Jarvis, 2013; Pallotti, 2015). Intuitively, complexity transcends “the number and variety of an item’s constituent elements” to include “the elaboratedness of their interrelational structure” (Rescher, 2020:1). This echoes the concept of Gell-Mann effective complexity in information theory, which emphasizes the amount of non-random information in a system, which peaks in the intermediate stage between order and disorder. Gell-Mann complexity is often opposed to Kolmogorov complexity, i.e., the total amount of information in a system, which monotonically increases from maximum order to maximum disorder.
This study explores which information-theoretical notion of complexity (Kolmogorov vs. Gell-Mann) is measured by widely used complexity indices, via a simulation study on four Italian corpora, representing the spectrum from expert to learner texts. New texts are synthesized from the originals by altering them in two directions: increased order is obtained as the repetition of increasingly smaller subsections of the original text, whereas increased disorder is obtained as the shuffling of increasingly smaller fragments of it. Additionally, we generate texts with uniform word distribution, simultaneously altering both the structure and the original word distributions. For each corpus, the synthetic data allow us to explore the spectrum from total order to total disorder. All texts are analyzed using type-token-ratio-based and surprisal-based metrics, including fluctuation complexity (Bates and Shepard 1993). Examining the distribution of the computed values shows that TTR-based metrics, except MATTR, are sensitive to increased order but not disorder. Surprisal-based measures, on the other hand, do show interesting Kolmogorov (entropy) or Gell-Mann behavior (normalized entropy and fluctuation complexity), enhancing their mutual interpretability when combined. Our results indicate that fluctuation complexity in particular could complement linguistic complexity tools, since it captures the intuitive notion of complexity in a text.