research-article

Open Access

Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache

Authors:
Christian Reul

Centre for Philology and Digitality, University of Würzburg

Centre for Philology and Digitality, University of Würzburg
View Profile

,
Sebastian Göttel

Berlin-Brandenburg Academy of Sciences and Humanities

Berlin-Brandenburg Academy of Sciences and Humanities
View Profile

,
Uwe Springmann

Center for Information and Language Processing; LMU Munich

Center for Information and Language Processing; LMU Munich
View Profile

,
Christoph Wick

Chair for Artificial Intelligence, University of Würzburg

Chair for Artificial Intelligence, University of Würzburg
View Profile

,
Kay-Michael Würzner

Berlin-Brandenburg Academy of Sciences and Humanities

Berlin-Brandenburg Academy of Sciences and Humanities
View Profile

,
Frank Puppe

Chair for Artificial Intelligence, University of Würzburg

Chair for Artificial Intelligence, University of Würzburg
View Profile

DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural HeritageMay 2019Pages 33–38https://doi.org/10.1145/3322905.3322910

Published:08 May 2019Publication History

DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Pages 33–38

ABSTRACT

When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For that purpose, we present a method that enables a fine-grained typography classification by training an open source OCR engine both on traditional OCR and typography recognition and show how to map the obtained typography information to the OCR recognized text output. As a test case, we used a German dictionary (Sander's Wörterbuch der Deutschen Sprache) from the 19th century, which comprises a particularly complex semantic function of typography. Despite the very challenging material, we achieved a character error rate below 0.4% and a typography recognition that assigns the correct label to close to 99% of the words. In contrast to many existing methods, our novel approach works with real historical data and can deal with frequent typography changes even within lines.

References

T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait. 2013. High-Performance OCR for Printed English and Fraktur Using LSTM Networks. 12th International Conference on Document Analysis and Recognition (2013), 683--687.Google Scholar
Yasuhisa Fujii, Karel Driesen, Jonathan Baccash, Ash Hurst, and Ashok C Popat. 2017. Sequence-to-Label Script Identification for Multilingual OCR. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, Vol. 1. IEEE, 161--168.Google ScholarCross Ref
Anguelos Nicolaou, Fouad Slimane, Volker Maergner, and Marcus Liwicki. 2014. Local binary patterns for arabic optical font recognition. In Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on. IEEE, 76--80.Google ScholarDigital Library
Christian Reul, Uwe Springmann, Christoph Wick, and Frank Puppe. 2018. Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning. JLCL 33, 1 (2018), 3--24.Google Scholar
Fouad Slimane, Rolf Ingold, and Jean Hennebert. 2017. ICDAR2017 Competition on Multi-Font and Multi-Size Digitally Represented Arabic Text. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, Vol. 1. IEEE, 1466--1472.Google Scholar
Dapeng Tao, Xu Lin, Lianwen Jin, and Xuelong Li. 2016. Principal component 2-D long short-term memory for font recognition on single Chinese characters. IEEE transactions on cybernetics 46, 3 (2016), 756--765.Google Scholar
Kurban Ubul, Gulzira Tursun, Alimjan Aysa, Donato Impedovo, Giuseppe Pirlo, and Tuergen Yibulayin. 2017. Script Identification of Multi-Script Documents: A Survey. IEEE Access 5 (2017), 6546--6559.Google Scholar
Adnan Ul-Hasan, Muhammad Zeshan Afzal, Faisal Shafait, Marcus Liwicki, and Thomas M Breuel. 2015. A sequence learning approach for multiple script identification. In Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 1046--1050.Google ScholarDigital Library
Christoph Wick, Christian Reul, and Frank Puppe. 2018. Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus. JLCL 33, 1 (2018), 79--96.Google Scholar
Yong Zhu, Tieniu Tan, and YunhongWang. 2001. Font recognition based on global texture analysis. IEEE Transactions on pattern analysis and machine intelligence 23, 10 (2001), 1192--1200.Google ScholarDigital Library
Abdelwahab Zramdini and Rolf Ingold. 1993. Optical font recognition from projection profiles. Electronic Publishing 6, 3 (1993), 249--260.Google Scholar
Abdelwahab Zramdini and Rolf Ingold. 1998. Optical font recognition using typographical features. IEEE Transactions on Pattern Analysis & Machine Intelligence 8 (1998), 877--882.Google ScholarDigital Library

Index Terms

Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition

Recommendations

Combining OCR Models for Reading Early Modern Books
Document Analysis and Recognition - ICDAR 2023
Abstract
In this paper, we investigate the usage of fine-grained font recognition on OCR for books printed from the 15th to the 18th century. We used a newly created dataset for OCR of early printed books for which fonts are labeled with bounding boxes. We ...
Read More
Neural Networks Pipeline for Offline Machine Printed Arabic OCR

In the context of Arabic optical characters recognition, Arabic poses more challenges because of its cursive nature. We purpose a system for recognizing a document containing Arabic text, using a pipeline of three neural networks. The first network ...
Read More
Choice of recognizable units for URDU OCR
DAR '12: Proceeding of the workshop on Document Analysis and Recognition

There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage
May 2019
163 pages
ISBN:9781450371940
DOI:10.1145/3322905

Copyright © 2019 Owner/Author
This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
OCR
historical lexica
semantic tagging
typography recognition
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate60of86submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 454
  Total Downloads
- Downloads (Last 12 months)97
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache

DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

Combining OCR Models for Reading Early Modern Books

Neural Networks Pipeline for Offline Machine Printed Arabic OCR

Choice of recognizable units for URDU OCR

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache

DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

Combining OCR Models for Reading Early Modern Books

Neural Networks Pipeline for Offline Machine Printed Arabic OCR

Choice of recognizable units for URDU OCR

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media