NLP challenges in dealing with OCR-ed documents of derogated quality
In this paper, we present ongoing experiments for correcting and tagging German historical docu-ments that have been scanned and digitized us-ing optical character recognition (OCR). The docu-ments are to be corrected and annotated with named entities (NE) and part-of-speech (POS). As our collection of OCR-ed documents content is seri-ously depreciated, we compare two approaches for the correction of text, both methods sharing fea-tures from spelling correction in context using a probabilistic edit-operation error model, and differ-ing mainly on the candidate selection process and speed of execution. Already existing tools for NE or POS are either retrained or adapted to cater to the specific data condition. An additional challenge we meet is to process data as they are typically made available in an OCR-ed context, and deliver output in the same format, which raises issues in data alignment. We stress conditions in which we can obtain optimal reduction of error rates with real data from our collection.