Now showing items 1-10 of 13
Open Corpus Interface for Italian Language Learning
In this article, we present the multi-faceted interface to the open PAISà corpus of Italian. Created within the project PAISà (Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati) , the corpus is designed ...
High-Accuracy Phrase Translation Acquisition Through Battle-Royale Selection
(RANLP 2011 Organising Committee / ACL, 2013)
In this paper, we report on an unsupervised greedy-style process for acquiring phrase translations from sentence-aligned parallel corpora. Thanks to innovative selection strategies, this process can acquire multiple ...
Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Italian and Slovenian projects
The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though there are still many ...
Annotating Archaeological Texts: An Example of Domain-Specific Annotation in the Humanities
(Association for Computational Linguistics, 2012)
Developing content extraction methods for Humanities domains raises a number of chal- lenges, from the abundance of non-standard entity types to their complexity to the scarcity of data. Close collaboration with Humani- ...
The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts
The DiDi corpus of South Tyrolean data of computer-mediated communication (CMC) is a multilingual sociolinguistic language corpus. It consists of around 600,000 tokens collected from 136 proﬁles of Facebook users residing ...
Towards high-accuracy bilingual phrase acquisition from parallel corpora
We report on on-going work to derive translations of phrases from parallel corpora. We describe an unsupervised and knowledge-free greedy-style process relying on innovative strategies for choosing and discarding candidate ...
bot.zen @ EVALITA 2016 - A minimally-deep learning PoS-tagger (trained for Italian Tweets)
This article describes the system that participated in the POS tagging for Italian Social Media Texts (PoSTWITA) task of the 5th periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the ...
Structure-Preserving Pipelines for Digital Libraries
(Association for Computational Linguistics, 2011)
Most existing HLT pipelines assume the input is pure text or, at most, HTML and either ignore (logical) document structure or remove it. We argue that identifying the structure of documents is essential in digital library ...
Collecting language data of non-public social media profiles
(Universitatsverlag Hildesheim, Germany, 2014)
In this paper, we propose an integrated web strategy for mixed sociolinguistic research methodologies in the context of social media corpora. After stating the particular challenges for building corpora of private, non-public ...
The PAISÀ Corpus of Italian Web Texts
(Association for Computational Linguistics, 2014)
PAISÀ is a Creative Commons licensed, large web corpus of contemporary Italian. We describe the design, harvesting, and processing steps involved in its creation.