ABSTRACT
When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For that purpose, we present a method that enables a fine-grained typography classification by training an open source OCR engine both on traditional OCR and typography recognition and show how to map the obtained typography information to the OCR recognized text output. As a test case, we used a German dictionary (Sander's Wörterbuch der Deutschen Sprache) from the 19th century, which comprises a particularly complex semantic function of typography. Despite the very challenging material, we achieved a character error rate below 0.4% and a typography recognition that assigns the correct label to close to 99% of the words. In contrast to many existing methods, our novel approach works with real historical data and can deal with frequent typography changes even within lines.
- T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait. 2013. High-Performance OCR for Printed English and Fraktur Using LSTM Networks. 12th International Conference on Document Analysis and Recognition (2013), 683--687.Google Scholar
- Yasuhisa Fujii, Karel Driesen, Jonathan Baccash, Ash Hurst, and Ashok C Popat. 2017. Sequence-to-Label Script Identification for Multilingual OCR. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, Vol. 1. IEEE, 161--168.Google ScholarCross Ref
- Anguelos Nicolaou, Fouad Slimane, Volker Maergner, and Marcus Liwicki. 2014. Local binary patterns for arabic optical font recognition. In Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on. IEEE, 76--80.Google ScholarDigital Library
- Christian Reul, Uwe Springmann, Christoph Wick, and Frank Puppe. 2018. Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning. JLCL 33, 1 (2018), 3--24.Google Scholar
- Fouad Slimane, Rolf Ingold, and Jean Hennebert. 2017. ICDAR2017 Competition on Multi-Font and Multi-Size Digitally Represented Arabic Text. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, Vol. 1. IEEE, 1466--1472.Google Scholar
- Dapeng Tao, Xu Lin, Lianwen Jin, and Xuelong Li. 2016. Principal component 2-D long short-term memory for font recognition on single Chinese characters. IEEE transactions on cybernetics 46, 3 (2016), 756--765.Google Scholar
- Kurban Ubul, Gulzira Tursun, Alimjan Aysa, Donato Impedovo, Giuseppe Pirlo, and Tuergen Yibulayin. 2017. Script Identification of Multi-Script Documents: A Survey. IEEE Access 5 (2017), 6546--6559.Google Scholar
- Adnan Ul-Hasan, Muhammad Zeshan Afzal, Faisal Shafait, Marcus Liwicki, and Thomas M Breuel. 2015. A sequence learning approach for multiple script identification. In Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 1046--1050.Google ScholarDigital Library
- Christoph Wick, Christian Reul, and Frank Puppe. 2018. Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus. JLCL 33, 1 (2018), 79--96.Google Scholar
- Yong Zhu, Tieniu Tan, and YunhongWang. 2001. Font recognition based on global texture analysis. IEEE Transactions on pattern analysis and machine intelligence 23, 10 (2001), 1192--1200.Google ScholarDigital Library
- Abdelwahab Zramdini and Rolf Ingold. 1993. Optical font recognition from projection profiles. Electronic Publishing 6, 3 (1993), 249--260.Google Scholar
- Abdelwahab Zramdini and Rolf Ingold. 1998. Optical font recognition using typographical features. IEEE Transactions on Pattern Analysis & Machine Intelligence 8 (1998), 877--882.Google ScholarDigital Library
Index Terms
Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache
Recommendations
Combining OCR Models for Reading Early Modern Books
Document Analysis and Recognition - ICDAR 2023AbstractIn this paper, we investigate the usage of fine-grained font recognition on OCR for books printed from the 15th to the 18th century. We used a newly created dataset for OCR of early printed books for which fonts are labeled with bounding boxes. We ...
Neural Networks Pipeline for Offline Machine Printed Arabic OCR
In the context of Arabic optical characters recognition, Arabic poses more challenges because of its cursive nature. We purpose a system for recognizing a document containing Arabic text, using a pipeline of three neural networks. The first network ...
Choice of recognizable units for URDU OCR
DAR '12: Proceeding of the workshop on Document Analysis and RecognitionThere has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a ...
Comments