Abstract
Over the last decades, research data management has become a central task in the scientific enterprise. Research infrastructures such as CLARIN (de Jong et al. 2020) have been developed that provide services and technologies to improve data sustainability and many communities have taken important steps to ensure interoperability and reusability of research data (e.g. the CMC community, see Beißwenger & Lüngen 2020). In learner corpus research (LCR), however, research data management has attracted less attention, with much room for improvement in terms of sustainable use of resources, comparability and interconnectivity of individual studies (Tracy-Ventura et al. 2021; Stemle et al. 2019).
One area that would benefit significantly from standardization is corpus description, which includes metadata at the level of the learner corpus as a whole and metadata used to describe the individual learners and task types/registers the corpus is meant to represent. There are a number of reasons why this is important. First, standardized and well-structured metadata increases the findability and usability of existing learner corpora. Second, it should not only enhance comparability of datasets but also comparability of LCR studies, provided researchers agree on a common set of definitions. Extensive metadata that follow - at best - a standardized vocabulary, and have a strong focus on findability, accessibility, interoperability and reusability (FAIR) are an important aspect of FAIR research data (Wilkinson et al. 2016). Today, however, it is still unclear to what extent standardization of metadata would be possible in the field of Learner Corpus Research and preliminary work on the topic (Granger & Paquot, 2017) shows the complexity of this issue.
To estimate the feasibility of such an approach, we tried to apply the metadata schema proposed by Granger & Paquot (2017) to five learner corpora available for research purposes. In this effort, we identified a set of core metadata fields that we consider necessary to describe learner corpora in a consistent and informative way, while also leaving room for optional information. Although all corpora were collected in the context of school education, they represent a variety of learners and language samples, thus providing a rich testbed.
The main objective of this presentation is to introduce this revised metadata schema for learner corpora, which is the result of extensive collaboration between a research data infrastructure expert and member of CLARIN's metadata taskforce, and data owners for the five resources. In line with Granger & Paquot (2017), our proposed metadata schema is divided into a number of different sections for Corpus metadata (itself divided into administrative metadata (e.g. authors or license) and design metadata (e.g. date and place of collection or type of task)), Text metadata (fine-grained per-text information), Author metadata (details about the learners, e.g. age, languages spoken), Annotator metadata (e.g. professional and language background), Transcriber metadata (e.g. native language or language repertoire) and Task metadata (e.g. instructions, time constraints). While basic information about learners (authors) and language samples (text) are typically found as part of metadata associated with a learner corpus, other aspects such as those related to the annotation or transcription procedure, the specificities of a task, etc. are often found elsewhere (e.g. corpus manual) or are just absent from currently available learner corpora. In our presentation, we argue in favor of a systematic description of all these aspects as part of core metadata.
While the metadata schema was initially created in a simple tab-separated format, it is currently being transformed into the CMDI metadata format (Broeder et al. 2012) using the CMDI Core Components (https://clarin-eric.github.io/cmdi-core-components/). This will serve as a viable use case for the creators of the core components and as an "off-the-shelf"-profile for any researcher seeking one for their learner corpus project.
The schema will be made available as CMDI in the CLARIN Component Registry (https://catalog.clarin.eu/ds/ComponentRegistry/) and as a resource on the research data repository of the Eurac Research Clarin Center (ERCC, https://clarin.eurac.edu/), where the corpora and their accompanying metadata that were used for the development of the metadata schema are also available. Additionally, a detailed schema description will be provided to the research community at the learner corpus portal PORTA (https://www.porta.eurac.edu/).