Abstract
One commonly observed characteristic of written computer-mediated communication (CMC) is the use of non-standard language. There are different motives for making use of the medium’s more liberal writing conventions, among which the expression of one’s own linguistic identity and/or of one's affiliation to a special social group is one of the most cited when it comes to the use of vernacular languages or the switch between different languages and varieties (e.g. Tagg & Seargeant 2014, Schreiber 2015). However, even though it is easy to identify non-standard spellings and it might also be easy for a member (or observer) of a social group to recognize and understand their local dialect spellings, from a scientific point of view, it is a methodological challenge to clearly assign those non-standard spelling variants to a specific local variety. Although linguists often presume a repetition of spoken features in spelling and interpret non-standard variants as a reproduction of spoken dialect if phonetic features are reflected in spelling (e.g. Ziegler 2005, Tophinke 2008), this approach is based on interpretation rather than on scientific evidence proved by the data.
We present a study that addresses this problem by analysing German Facebook texts coming from the DiDi corpus of South Tyrolean CMC Data (Frey, Glaznieks & Stemle 2016) using data-driven methods. The DiDi corpus provides access to more than 23.000 mainly German status updates, comments and chat messages of around 120 writers written in the year 2013 (corpus size in tokens: ca 374.000). In addition, the corpus provides person-related metadata, such as gender, age and geographic origin, which are relevant variables for language variation (Löffler 2003). By correlating frequently occurring spelling variants of the Standard German -er suffix in the DiDi corpus to geographic, social and situational variables, Glück and Glaznieks (2019) were able to relate one variant (-o) to a specific geographic area (Val Pusteria) with a typical distribution for dialect use confirmed by the variables gender (i.e. more often used by males, cf. also Sieburg 1992), age (i.e. less used by people between 30-60, cf. Vergeiner et al. in press) and communication type (i.e. more often used in chat messages). In this presentation we will extend this approach with new data by relying on cooccurring features on grapheme, word and text level. Using methods from natural language processing and social network analysis, we will investigate variants of the most common words that show the -o suffix in a group of writers from Val Pusteria in order to establish other dialect features in the corpus, that are to date not scientifically identified as such. The network analysis methods will furthermore allow to consider the number of variants a writer shows in his/her text and will enable us to determine the consistency of writers and whether writers clearly distinguish between two (or more) varieties (e.g. standard and regional dialect).