Abstract
In the foreseen presentation, I will present and discuss the objectives of my PhD, the current state of affairs and the ongoing efforts. The research undertaken deals with the combination of language learning activities and implicit crowdsourcing to produce Natural Language Processing (NLP) resources for Italian and the variety of German spoken in South Tyrol. More precisely, my research focuses on creating mechanisms that generate language learning exercises from Linguistic Resources (LRs), crowdsource the answers of language learners to these exercises and use them to correct and extend the LRs.
The continuing efforts comprise the adaptation and extension of an existing prototypical vocabulary trainer providing exercises through a Telegram bot, after having carried out a comprehensive overview of the relevant literature on crowdsourcing NLP LRs.
The adapted trainer combines exercise content automatically-generated from an LR called ConceptNet (Speer et al., 2017) with the automated aggregation and evaluation of the input crowdsourced from learners in order to improve back the LR (Rodosthenous, 2019; Lyding, 2019). For this type of exercise, learners are asked if two words have relation with one another (Rodosthenous, 2020; Nicolas, 2020), and also to provide words in relation with one another (Rodosthenous, 2020). Because it is still a prototype, some aspects are not well covered such as the automatic profiling of the learners, the extension to other languages or the generation of other types of exercises. There are also challenges in terms of aggregating the answers crowdsourced and avoiding bias when collecting them. The establishment of the correct answer to the boolean question addressed is a matter of aggregating enough answers until a quality threshold is met (e.g. a reliability score above 98%). To investigate the bias point, I will calculate the learners’ reliability for the exercises. While exploring and extending this approach within this PhD, I will explain my next planned experiment with an updated version of the prototype in order to teach vocabulary to L2 students learning Italian while crowdsourcing knowledge about synonymy or other similar semantic relations (e.g., hyponymy, hypernymy).
With respect to the overview of the related work, I will briefly discuss a shared effort in collaboration with a team of 6 researchers from 4 countries that aims at compiling a wide scale overview of past efforts aiming at crowdsourcing NLP datasets.