Abstract
The main objective of this poster is to introduce a core metadata schema for L2 production data, more particularly learner corpora, which is the result of extensive collaboration between learner corpus compilers at the Centre for English Corpus Linguistics (UCLouvain, Belgium) and Eurac Research (Bolzano, Italy), and a research data infrastructure expert and member of CLARIN's metadata taskforce.
The project stems from the recognition that one area that would benefit significantly from standardization is L2 data description, which includes metadata at the level of the dataset as a whole and metadata used to describe the individual learners and task types/registers the corpus is meant to represent. There are a number of reasons why this is important. First, standardized and well-structured metadata increases the findability and usability of existing learner corpora. Second, it should enhance the comparability of datasets and comparability of L2 studies, provided researchers agree on a common set of definitions. Extensive metadata that follow - at best - a standardized vocabulary, and have a strong focus on findability, accessibility, interoperability, and reusability (FAIR) are an essential aspect of FAIR research data (Wilkinson et al. 2016).
In continuation of Granger & Paquot (2017), our proposed metadata schema is divided into a number of different sections for Corpus metadata (itself divided into Administrative metadata (e.g. authors or license) and Corpus design metadata (e.g. date and place of collection or type of task)), Text metadata (fine-grained per-text information), Learner metadata (details about the learners, e.g. age, languages spoken), Annotation metadata (e.g. details about manual or automatic annotation), Annotator metadata (e.g. professional and language background), Transcriber metadata (e.g. native language or language repertoire) and Situational and Task metadata (e.g. instructions, time constraints). While basic information about learners (authors) and language samples (texts) are typically found as part of metadata associated with a learner corpus, other aspects such as those related to the annotation or transcription procedure or the specificities of a task are often found elsewhere (e.g. corpus manual) or are just absent from currently available learner corpora. Our proposal is to provide a systematic description of all these aspects as part of core metadata.
A first version of the core metadata schema was tested on a range of learner corpora representing a variety of learner profiles and language samples (Paquot et al., 2023). It was presented at several conferences (e.g. LCR2022, EUROSLA2023), as part of an extensive feedback collection phase. Additional feedback was gathered via mailing lists and an online form. Based on the comments received, we substantially revised the initial proposal and released LC-meta version 2 in 2024 (Paquot et al., 2024a; 2024b). In 2025, a metadata working group was established under the aegis of the Leaner Corpus Association, in collaboration with the CLARIN K-centre for Learner Corpora. Its mission is to further develop, maintain, and disseminate the metadata schema.