On the Detection of Neologism Candidates as Basis for Language Observation and Lexicographic Endeavours: The STyrLogism Project
The goal of the project STyrLogisms is to semi-automatically extract neologism (new lexemes) candidates for the German standard variety used in South Tyrol. We use a list of manually vetted URLs from news, magazines and blog websites of South Tyrol and regularly crawl their data, clean and process it and compare this new data to reference corpora and additional regional word lists and the formerly crawled data sets. Our reference corpora are DECOW14 with around 60m types, and the South Tyrolean Web Corpus with around 2.4m types; the additional word lists consist of named entities, terminological terms from the region, and specific terms of the German standard variety used in South Tyrol (altogether around 53k unique types). Here, we will report on the employed method, a first round of candidate extraction with an approach for a classification schema for the selected candidates, and some remarks on a second extraction round.