Now showing items 1-3 of 3
(Association for Computational Linguistics, 2011)Small, manually assembled corpora may be available for less dominant languages and dialects, but producing web-scale resources remains a challenge. Even when considerable quantities of text are present on the web, finding ...
(Association for Computational Linguistics, 2011)Most existing HLT pipelines assume the input is pure text or, at most, HTML and either ignore (logical) document structure or remove it. We argue that identifying the structure of documents is essential in digital library ...