NLP challenges in dealing with OCR-ed documents of derogated quality

Michel Généreux; D Spano

Back

NLP challenges in dealing with OCR-ed documents of derogated quality

Conference proceeding

Open access

Peer reviewed

NLP challenges in dealing with OCR-ed documents of derogated quality

Michel Généreux and D Spano

Workshop proceedings "Replicability and Reproducibility in Natural Language Processing: adaptive methods, resources and software" at IJCAI 2015

Workshop on Replicability and Reproducibility in Natural Language Processing: adaptive methods, resources and software" at IJCAI 2015 (Buenos Aires, 25/07/2015 - 31/07/2015)

2015

Handle:

https://hdl.handle.net/10863/7981

Abstract

In this paper, we present ongoing experiments for correcting and tagging German historical docu-ments that have been scanned and digitized us-ing optical character recognition (OCR). The docu-ments are to be corrected and annotated with named entities (NE) and part-of-speech (POS). As our collection of OCR-ed documents content is seri-ously depreciated, we compare two approaches for the correction of text, both methods sharing fea-tures from spelling correction in context using a probabilistic edit-operation error model, and differ-ing mainly on the candidate selection process and speed of execution. Already existing tools for NE or POS are either retrained or adapted to cater to the specific data condition. An additional challenge we meet is to process data as they are typically made available in an OCR-ed context, and deliver output in the same format, which raises issues in data alignment. We stress conditions in which we can obtain optimal reduction of error rates with real data from our collection.

Files and links (1)

url

https://docs.google.com/viewer?a=vπd=sites&srcid=ZGVmYXVsdGRvbWFpbnxhZGFwdGl2ZW5scDIwMTV8Z3g6Mzc1ZmNjNDg5ZDUzYTY5MAView

Details

Title: NLP challenges in dealing with OCR-ed documents of derogated quality
Creators: Michel Généreux
D Spano
Publication Details: Workshop proceedings "Replicability and Reproducibility in Natural Language Processing: adaptive methods, resources and software" at IJCAI 2015
Conference: Workshop on Replicability and Reproducibility in Natural Language Processing: adaptive methods, resources and software" at IJCAI 2015 (Buenos Aires, 25/07/2015 - 31/07/2015)
Publisher: Buenos Aires
Number of pages: 6
Identifiers: (EURAC)20183507
991005772669601241
Academic Unit: Institute for Applied Linguistics
Language: English
Resource Type: Conference proceeding
Local Fields: Scientific
Author Names String: Généreux M, Spano D

Metrics

29 Record Views