skip to main content
10.1145/2494266.2494304acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

Early modern OCR project (eMOP) at Texas A&M University: using Aletheia to train Tesseract

Published:10 September 2013Publication History

ABSTRACT

Great effort is being made to collect and preserve historic manuscripts from the early modern and eighteenth-century periods; unfortunately, searching the Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) collections can be extremely difficult for researchers because current Optical Character Recognition (OCR) engines struggle to read and recognize various historic fonts, especially in manuscripts of declining quality. To address this problem, the Early Modern OCR Project (eMOP) at the Initiative for the Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University seeks to train OCR engines to read historic documents more effectively in order to make the entirety of these collections accessible to searching. The first step in this project involves using Aletheia Desktop Tool, developed by PRImA Research Lab at the University of Salford, to use documents from the EEBO and ECCO collections to create training sets to aid OCR engines, such as Google's Tesseract, in recognizing the special characters such as ligatures, italics, and blackletter found within early modern fonts. In the year that the Aletheia team has been working to create these font training libraries, we have overcome several problems, including learning how to select, extract, and deliver the data that best suits Tesseract training requirements. This work with Aletheia is part of a larger scholarly project that endeavors to not only make the EEBO and ECCO collections more accessible for data mining purposes for researchers, but also seeks to make available to the public the methodologies, workflow, and digital tools developed during the eMOP project to aid libraries, museums, and scholars in other fields in their efforts to preserve and study our combined cultural history.

References

  1. Antonacopoulos, Apostolos, Clausner, Christian, and Pletschacher, Stefan. 2011. Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments. Eleventh International Conference on Document Analysis and Recognition (Beijing, China, September 18-21, 2011). DOI = http://www.icdar2011.org/fileup/PDF/4520a048.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Mandell, Laura. 2012. Mellon Foundation Grant Proposal: "OCR'ing Early ModerdernTexts". Grant Proposal. DOI= http://idhmc.tamu.edu/emop/.Google ScholarGoogle Scholar
  3. MUFI (Medieval Unicode Font Initiative). DOI= http://www.mufi.info/Google ScholarGoogle Scholar
  4. Smith, Ray. 2007. An Overview of the Tesseract OCR Engine. Ninth International Conference on Document Analysis and Recognition (Curitiba, Brazil, September 23-26, 2007). DOI= http://www.informatik.unitrier.de/~ley/db/conf/icdar/. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Early modern OCR project (eMOP) at Texas A&M University: using Aletheia to train Tesseract

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      DocEng '13: Proceedings of the 2013 ACM symposium on Document engineering
      September 2013
      582 pages
      ISBN:9781450317894
      DOI:10.1145/2494266

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 September 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Author Tags

      Qualifiers

      • research-article

      Acceptance Rates

      DocEng '13 Paper Acceptance Rate16of50submissions,32%Overall Acceptance Rate178of537submissions,33%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader