Compiling Socia Media Corpora in the Face of Ever-Changing Social Media Landscapes: A Focus on Reusability

A König; Egon Waldemar Stemle

Social Media communication has become an everyday part of people’s lives across different generations, cultures, geographical areas, and social classes. Shaped by the specific social and technical context in which it is produced, synchronous and asynchronous communication has become increasingly participatory, interactive, and multimodal with services like TikTok, YouTube Shorts and Facebook Reels. The multitude of people, situational contexts and involved technologies and the goal of digital preservation and archiving of such data to guarantee long-term, sustainable access immediately leads to a wide range of considerations wrt. licenses, ethics, meta-data, data and involved technologies. Starting from the FAIR principles [1], we find that Findability & Accessibility are already well covered technologically (whereby the actual implementation by users is still lagging) [2]. Interoperability & Reusability, however, have yet to be addressed to different degrees. For example, since the current version of the TEI encoding framework and guidelines (TEI P5) does not offer any specific models for social media data, the TEI SIG CMC [3] was created, a TEI standard for the representation of the structural and linguistic peculiarities of social media data that includes features and instructions for encoding data a desideratum both in the fields of digital humanities and computer science. This, in turn, is an excellent technical option towards better Interoperability (and partly also Accessibility). In any case, we believe the tasks at large are best addressed within a network of interested stakeholders to help develop and disseminate standards and best practices. Apart from this very conference, a dedicated community exists for “Social Media and CMC Corpora”, which has just celebrated its 10th annual international conference. From (parts of) this community, a CLARIN Knowledge Centre for Social Media and Computer-Mediated Communication Corpora (CKCMC) [4] has been established to act as a focal point for the knowledge developed within the community. The Centre is part of CLARIN, the European research infrastructure for language data, which is trying to help develop standards and best practices for documenting, collecting and archiving language data. For this conference, we want to highlight and further discuss a particular Reusability aspect: The dynamic of the platforms themselves seems particularly interesting for long-term archiving. Twitter (now known as “X”) in 2023 looks and works differently from Twitter in 2013. A social media corpus collected five or ten years ago might describe data from a platform that functionally (or actually) no longer exists or was produced with different technology. For example, older social media corpora like the Dortmund chat corpus [5] document IRC communication, which is virtually non-existent these days and works very differently from modern chat systems like Discord. Other technological changes include, for example, the transition from dial-in internet to (almost) permanent internet, from Desktop to Mobile, from mobile phone to smartphone, from T9 keyboards via small hardware keyboards to smart touch keyboards with predictive writing capabilities.

Compiling Socia Media Corpora in the Face of Ever-Changing Social Media Landscapes: A Focus on Reusability

Abstract

Files and links (3)

Details

Metrics