skip to main content
10.1145/3322905.3322910acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdatechConference Proceedingsconference-collections
research-article
Open Access

Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache

Authors Info & Claims
Published:08 May 2019Publication History

ABSTRACT

When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For that purpose, we present a method that enables a fine-grained typography classification by training an open source OCR engine both on traditional OCR and typography recognition and show how to map the obtained typography information to the OCR recognized text output. As a test case, we used a German dictionary (Sander's Wörterbuch der Deutschen Sprache) from the 19th century, which comprises a particularly complex semantic function of typography. Despite the very challenging material, we achieved a character error rate below 0.4% and a typography recognition that assigns the correct label to close to 99% of the words. In contrast to many existing methods, our novel approach works with real historical data and can deal with frequent typography changes even within lines.

References

  1. T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait. 2013. High-Performance OCR for Printed English and Fraktur Using LSTM Networks. 12th International Conference on Document Analysis and Recognition (2013), 683--687.Google ScholarGoogle Scholar
  2. Yasuhisa Fujii, Karel Driesen, Jonathan Baccash, Ash Hurst, and Ashok C Popat. 2017. Sequence-to-Label Script Identification for Multilingual OCR. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, Vol. 1. IEEE, 161--168.Google ScholarGoogle ScholarCross RefCross Ref
  3. Anguelos Nicolaou, Fouad Slimane, Volker Maergner, and Marcus Liwicki. 2014. Local binary patterns for arabic optical font recognition. In Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on. IEEE, 76--80.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Christian Reul, Uwe Springmann, Christoph Wick, and Frank Puppe. 2018. Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning. JLCL 33, 1 (2018), 3--24.Google ScholarGoogle Scholar
  5. Fouad Slimane, Rolf Ingold, and Jean Hennebert. 2017. ICDAR2017 Competition on Multi-Font and Multi-Size Digitally Represented Arabic Text. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, Vol. 1. IEEE, 1466--1472.Google ScholarGoogle Scholar
  6. Dapeng Tao, Xu Lin, Lianwen Jin, and Xuelong Li. 2016. Principal component 2-D long short-term memory for font recognition on single Chinese characters. IEEE transactions on cybernetics 46, 3 (2016), 756--765.Google ScholarGoogle Scholar
  7. Kurban Ubul, Gulzira Tursun, Alimjan Aysa, Donato Impedovo, Giuseppe Pirlo, and Tuergen Yibulayin. 2017. Script Identification of Multi-Script Documents: A Survey. IEEE Access 5 (2017), 6546--6559.Google ScholarGoogle Scholar
  8. Adnan Ul-Hasan, Muhammad Zeshan Afzal, Faisal Shafait, Marcus Liwicki, and Thomas M Breuel. 2015. A sequence learning approach for multiple script identification. In Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 1046--1050.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christoph Wick, Christian Reul, and Frank Puppe. 2018. Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus. JLCL 33, 1 (2018), 79--96.Google ScholarGoogle Scholar
  10. Yong Zhu, Tieniu Tan, and YunhongWang. 2001. Font recognition based on global texture analysis. IEEE Transactions on pattern analysis and machine intelligence 23, 10 (2001), 1192--1200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Abdelwahab Zramdini and Rolf Ingold. 1993. Optical font recognition from projection profiles. Electronic Publishing 6, 3 (1993), 249--260.Google ScholarGoogle Scholar
  12. Abdelwahab Zramdini and Rolf Ingold. 1998. Optical font recognition using typographical features. IEEE Transactions on Pattern Analysis & Machine Intelligence 8 (1998), 877--882.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage
      May 2019
      163 pages
      ISBN:9781450371940
      DOI:10.1145/3322905

      Copyright © 2019 Owner/Author

      This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 May 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate60of86submissions,70%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader