Towards an infrastructure for FAIR language learner corpora
At: 8th NLP4CALL workshop ; Turku : 30.9.2019 - 30.9.2019 ; In recent years, the reproducibility of scientific research has become increasingly important, both for external stakeholders and for the research communities themselves. They all demand that empirical data collected and used for scientific research is managed and preserved in a way that research results are reproducible. In order to account for this, the FAIR guiding principles for data stewardship have been established as a framework for good data management aiming at the findability, accessibility, interoperability, and reusability of research data. A special role is played by natural language processing and its methods, which are an integral part of many other disciplines working with language data: Language corpora are often living objects – they are constantly being improved and revised, and at the same time the processing tools are also regularly updated, which can lead to different results for the same processing steps. In this presentation I will first investigate CMC corpora, which resemble language learner corpora in some core aspects, with regard to their compliance with the FAIR principles and discuss to what extent the deposit of research data in repositories of data preservation initiatives such as CLARIN, Zenodo or META-SHARE can assist in the provision of FAIR corpora. Second, I will show some modern software technologies and how they make the process of software packaging, installation, and execution and, more importantly, the tracking of corpora throughout their life cycle reproducible. This in turn makes changes to raw data reproducible for many subsequent analyses.
Showing items related by title, author, creator and subject.
Stemle EW; Boyd A; Janssen M; Lindström Tiedemann T; Mikelić Preradović N; Rosen A; Rosén D; Volodina E (2019)In this article we give an overview of first-hand experiences and starting points for best practices from projects in seven European countries dedicated to learner corpus research and the creation of language learner ...
Okinina N; Nicolas L (2018)We present the results of prototypical experiments conducted with the goal of designing a machine translation (MT) based system that assists the annotators of learner corpora in performing orthographic error annotation. ...
Glaznieks A; Anstein S (2011)In this article, we present systematic studies of two kinds of language varieties for their comparison and documentation. To analyse geographical varieties of German, the annotated Korpus Südtirol in the framework of the ...