Abstract
The potential to exploit existing data resources is one of the biggest drivers of research and innovation. Recent advances in artificial intelligence, machine learning and data mining have led to a general interest in the possibilities provided when applying these methods to existing datasets of other research fields. A growing number of studies report interesting insights gained from existing data resources. Among those, there are analyses on textual data, giving reason to consider such methods for linguistics as well. However, corpus linguistics, defined as the empirical study of real-life language use (McEnery & Hardie, 2012), usually works with purposefully collected, representative language samples that aim to answer only a limited set of research questions (Hunston, 2009). Methods of data reutilization and exploitation, like data mining and knowledge discovery are rarely considered in corpus linguistics (but see Degaetano-Ortlieb & Fankhauser, 2014; or Pölitz, 2016 for some examples), although the fast advances made in natural language processing and text mining strengthen the case for the use of machine-learning-based data-driven analysis in this field.
In this study I aim to shed some light on the potentials of data-driven analysis based on machine learning and predictive modelling for corpus linguistic studies. In particular, I investigate the possibility to repurpose existing German language corpora for linguistic inquiry by using methodologies developed for data science and computational linguistics. The study focuses on predictive modelling and machine-learning-based data mining and gives a detailed overview and evaluation of currently popular strategies and methods for analysing language corpora with computational methods.
Part I introduces strategies and methods that have already been used on language data, discusses how they can assist corpus linguistic analysis and refers to available toolkits and software as well as to state-of-the-art research and further references whenever possible, in order to allow the reader to use this overview as an entry point for personal endeavours. Part II evaluates the previously introduced methodological toolset by applying it in two differently shaped corpus studies that utilize already existing, readily available language corpora of medium size and primarily composed of German texts. Both studies are based on computational linguistic tasks that have evolved over the last few years and are increasingly used for linguistic analyses. Corpus study one explores linguistic correlates of holistic text quality ratings on student essays and is conceptually similar to automated essay scoring or predicting language competence levels, both common tasks in computation linguistics. Study two deals with age-related language features in computer-mediated communication and interprets age prediction models to answer a set of research questions that are based on previous research in the field. While both studies contribute to the study of German language by giving linguistic insights that integrate into the current understanding of the investigated phenomena, they are also conceptualized to be realistic case studies for testing the methodological toolset introduced in part I, in order to allow a detailed discussion of added values and remaining challenges of machine-learning-based data mining methods in corpus linguistics (cf. part III).
The results show that there are potential added values to using machine-learning-based data mining methods for corpus linguistics. However, the repurposing of available but relatively small corpora is difficult. Although new methodologies have been developed in order to prepare, select, transform, analyse and interpret data more efficiently, many of these techniques are still experimental, require a high background knowledge and technical skills and often depend on tools and resources that were developed for the English language. Furthermore, although strategies exist to extract more information from data or address more complex research questions, small data sizes often do not allow to observe phenomena in higher resolution, revealing few insights besides the main trends of the data. In terms of methodological rigour and efficiency, however, the methods can be an improvement over previous, mainly manual techniques, even when using existing language corpora of small to medium size.