Abstract
The use of vernacular language depicting local dialects is a commonly observed characteristic of written technology-mediated communication of some communities (e.g. Alshutayri & Atwell, 2019, Ueberwasser & Stark 2017, Frey et al. 2015). While it is easy for members (or observers) of the community to recognize and understand local dialect spellings, it is a methodological challenge to empirically assign non-standard spelling variants to a specific local variety, thus separating them from misspellings or supra-regional vernacular.
We present a study that addresses this problem using data-driven methods for the analysis of German Facebook texts from the DiDi corpus of South Tyrolean CMC Data (Frey et al.2016). The DiDi corpus provides access to more than 23.000 mainly German status updates, comments and chat messages of around 120 writers written in the year 2013 (corpus size in tokens: ca 374.000). The corpus provides person-related metadata, such as gender, age and geographic origin, which are relevant variables for language variation (Löffler 2003). By correlating frequently occurring spelling variants of the Standard German -er suffix in the DiDi corpus to geographic, social and situational variables, Glück and Glaznieks (2019) were able to relate one variant (-o) to a specific geographic area (Val Pusteria) with a typical distribution for dialect use confirmed by the variables gender (cf. also Sieburg 1992), age (cf. Vergeiner et al. 2020) and communication type.
In this presentation we extend the approach using methods from natural language processing and social network analysis to look at cooccurring features on grapheme, word and text level. By quantitative and subsequent qualitative analyses of the data we could not only identify more dialect features (e.g. the substitution of by ) but also determine the consistency of writers and whether they clearly distinguish between standard and regional varieties.