ABSTRACT
Great effort is being made to collect and preserve historic manuscripts from the early modern and eighteenth-century periods; unfortunately, searching the Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) collections can be extremely difficult for researchers because current Optical Character Recognition (OCR) engines struggle to read and recognize various historic fonts, especially in manuscripts of declining quality. To address this problem, the Early Modern OCR Project (eMOP) at the Initiative for the Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University seeks to train OCR engines to read historic documents more effectively in order to make the entirety of these collections accessible to searching. The first step in this project involves using Aletheia Desktop Tool, developed by PRImA Research Lab at the University of Salford, to use documents from the EEBO and ECCO collections to create training sets to aid OCR engines, such as Google's Tesseract, in recognizing the special characters such as ligatures, italics, and blackletter found within early modern fonts. In the year that the Aletheia team has been working to create these font training libraries, we have overcome several problems, including learning how to select, extract, and deliver the data that best suits Tesseract training requirements. This work with Aletheia is part of a larger scholarly project that endeavors to not only make the EEBO and ECCO collections more accessible for data mining purposes for researchers, but also seeks to make available to the public the methodologies, workflow, and digital tools developed during the eMOP project to aid libraries, museums, and scholars in other fields in their efforts to preserve and study our combined cultural history.
- Antonacopoulos, Apostolos, Clausner, Christian, and Pletschacher, Stefan. 2011. Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments. Eleventh International Conference on Document Analysis and Recognition (Beijing, China, September 18-21, 2011). DOI = http://www.icdar2011.org/fileup/PDF/4520a048.pdf. Google ScholarDigital Library
- Mandell, Laura. 2012. Mellon Foundation Grant Proposal: "OCR'ing Early ModerdernTexts". Grant Proposal. DOI= http://idhmc.tamu.edu/emop/.Google Scholar
- MUFI (Medieval Unicode Font Initiative). DOI= http://www.mufi.info/Google Scholar
- Smith, Ray. 2007. An Overview of the Tesseract OCR Engine. Ninth International Conference on Document Analysis and Recognition (Curitiba, Brazil, September 23-26, 2007). DOI= http://www.informatik.unitrier.de/~ley/db/conf/icdar/. Google ScholarDigital Library
Index Terms
- Early modern OCR project (eMOP) at Texas A&M University: using Aletheia to train Tesseract
Recommendations
Combining OCR Models for Reading Early Modern Books
Document Analysis and Recognition - ICDAR 2023AbstractIn this paper, we investigate the usage of fine-grained font recognition on OCR for books printed from the 15th to the 18th century. We used a newly created dataset for OCR of early printed books for which fonts are labeled with bounding boxes. We ...
Nastalique segmentation-based approach for Urdu OCR
Much work on Arabic language optical character recognition (OCR) has been on Naskh writing style. Nastalique style, used for most of languages using Arabic script across Southern Asia, is much more challenging to process due to its compactness, ...
Development of an Assamese OCR using Bangla OCR
DAR '12: Proceeding of the workshop on Document Analysis and RecognitionThis paper refers to the development of an OCR for the Assamese language by modifying an existing OCR for the Bangla language. This modification is feasible because the Assamese script is similar, except for a few characters, to the Bangla script. The ...
Comments