Show simple item record

dc.contributor.authorMurphy B
dc.contributor.authorStemle EW
dc.description.abstractSmall, manually assembled corpora may be available for less dominant languages and dialects, but producing web-scale resources remains a challenge. Even when considerable quantities of text are present on the web, finding this text, and distinguishing it from related languages in the same region can be difficult. For example less dominant variants of English (e.g. New Zealander, Singaporean, Canadian, Irish, South African) may be found under their respective national domains, but will be partially mixed with Englishes of the British and US varieties, perhaps through syndication of journalism, or the local reuse of text by multinational companies. Less formal dialectal usage may be scattered more widely over the internet through mechanisms such as wiki or blog authoring. Here we automatically construct a corpus of Hiberno-English (English as spoken in Ireland) using a variety of methods: filtering by national domain, filtering by orthographic conventions, and bootstrapping from a set of Ireland-specific terms (slang, place names, organisations). We evaluate the national specificity of the resulting corpora by measuring the incidence of topical terms, and several grammatical constructions that are particular to Hiberno-English. The results show that domain filtering is very effective for isolating text that is topic-specific, and orthographic classification can exclude some non-Irish texts, but that selected seeds are necessary to extract considerable quantities of more informal, dialectal text.en_US
dc.publisherAssociation for Computational Linguisticsen_US
dc.relationFirst Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties ; Edinburgh, Scotland : 31.7.2011 - 31.7.2011
dc.titlePaddyWaC: A Minimally-Supervised Web-Corpus of Hiberno-Englishen_US
dc.typeBook chapteren_US

Files in this item


There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record