StirWaC: compiling a diverse corpus based on texts from the web for South Tyrolean German

S Schulz; Verena Lyding; Lionel Nicolas

Back

StirWaC: compiling a diverse corpus based on texts from the web for South Tyrolean German

Conference proceeding

Peer reviewed

StirWaC: compiling a diverse corpus based on texts from the web for South Tyrolean German

S Schulz, Verena Lyding and Lionel Nicolas

Workshopproceedings "Web as Corpus", Corpus Linguistics (WAC) 2013

8th Web as Corpus Workshop (WAC-8) (Lancaster, 22/07/2013 - 22/07/2013)

2013

Handle:

https://hdl.handle.net/10863/8875

Abstract

In this paper, we report on the creation of a web corpus for the variety of German spoken in South Tyrol. We hence provide an example for the compilation of a corpus for a language variety that has neighboring varieties and for which the content on the internet is both sparse and published under various top-level domains. We discuss how we tackled the task of finding a balance between data quantity and quality. Our aim was twofold: to create a web corpus diverse in terms of text types and highly representative of South Tyrolean German. We present our procedure for collecting relevant texts and an approach to enhance diversity by detecting and filling gaps in a corpus.

Files and links (2)

url

https://www.sigwac.org.uk/wiki/WAC8View

url

https://www.sigwac.org.uk/raw-attachment/wiki/WAC8/wac8-proceedings.pdfView

Details

Title: StirWaC: compiling a diverse corpus based on texts from the web for South Tyrolean German
Creators: S Schulz
Verena Lyding
Lionel Nicolas
Publication Details: Workshopproceedings "Web as Corpus", Corpus Linguistics (WAC) 2013
Conference: 8th Web as Corpus Workshop (WAC-8) (Lancaster, 22/07/2013 - 22/07/2013)
Publisher: Lancaster
Identifiers: (EURAC)20025938
991005772153301241
Academic Unit: Institute for Applied Linguistics
Institute for Applied Linguistics
Language: English
Resource Type: Conference proceeding
Local Fields: Scientific
Author Names String: Schulz S, Lyding V, Nicolas L
Additional Description: Projected: 4008

Metrics

26 Record Views