research-article

Early modern OCR project (eMOP) at Texas A&M University: using Aletheia to train Tesseract

Authors:
Katayoun Torabi

Texas A&M University, College Station, TX, USA

Texas A&M University, College Station, TX, USA
View Profile

,
Jessica Durgan

Texas A&M University, College Station, TX, USA

Texas A&M University, College Station, TX, USA
View Profile

,
Bryan Tarpley

Texas A&M University, College Station, TX, USA

Texas A&M University, College Station, TX, USA
View Profile

DocEng '13: Proceedings of the 2013 ACM symposium on Document engineeringSeptember 2013Pages 23–26https://doi.org/10.1145/2494266.2494304

Published:10 September 2013Publication History

DocEng '13: Proceedings of the 2013 ACM symposium on Document engineering

Pages 23–26

ABSTRACT

Great effort is being made to collect and preserve historic manuscripts from the early modern and eighteenth-century periods; unfortunately, searching the Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) collections can be extremely difficult for researchers because current Optical Character Recognition (OCR) engines struggle to read and recognize various historic fonts, especially in manuscripts of declining quality. To address this problem, the Early Modern OCR Project (eMOP) at the Initiative for the Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University seeks to train OCR engines to read historic documents more effectively in order to make the entirety of these collections accessible to searching. The first step in this project involves using Aletheia Desktop Tool, developed by PRImA Research Lab at the University of Salford, to use documents from the EEBO and ECCO collections to create training sets to aid OCR engines, such as Google's Tesseract, in recognizing the special characters such as ligatures, italics, and blackletter found within early modern fonts. In the year that the Aletheia team has been working to create these font training libraries, we have overcome several problems, including learning how to select, extract, and deliver the data that best suits Tesseract training requirements. This work with Aletheia is part of a larger scholarly project that endeavors to not only make the EEBO and ECCO collections more accessible for data mining purposes for researchers, but also seeks to make available to the public the methodologies, workflow, and digital tools developed during the eMOP project to aid libraries, museums, and scholars in other fields in their efforts to preserve and study our combined cultural history.

References

Antonacopoulos, Apostolos, Clausner, Christian, and Pletschacher, Stefan. 2011. Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments. Eleventh International Conference on Document Analysis and Recognition (Beijing, China, September 18-21, 2011). DOI = http://www.icdar2011.org/fileup/PDF/4520a048.pdf. Google ScholarDigital Library
Mandell, Laura. 2012. Mellon Foundation Grant Proposal: "OCR'ing Early ModerdernTexts". Grant Proposal. DOI= http://idhmc.tamu.edu/emop/.Google Scholar
MUFI (Medieval Unicode Font Initiative). DOI= http://www.mufi.info/Google Scholar
Smith, Ray. 2007. An Overview of the Tesseract OCR Engine. Ninth International Conference on Document Analysis and Recognition (Curitiba, Brazil, September 23-26, 2007). DOI= http://www.informatik.unitrier.de/~ley/db/conf/icdar/. Google ScholarDigital Library

Index Terms

Early modern OCR project (eMOP) at Texas A&M University: using Aletheia to train Tesseract
1. Software and its engineering

Recommendations

Combining OCR Models for Reading Early Modern Books
Document Analysis and Recognition - ICDAR 2023
Abstract
In this paper, we investigate the usage of fine-grained font recognition on OCR for books printed from the 15th to the 18th century. We used a newly created dataset for OCR of early printed books for which fonts are labeled with bounding boxes. We ...
Read More
Nastalique segmentation-based approach for Urdu OCR

Much work on Arabic language optical character recognition (OCR) has been on Naskh writing style. Nastalique style, used for most of languages using Arabic script across Southern Asia, is much more challenging to process due to its compactness, ...
Read More
Development of an Assamese OCR using Bangla OCR
DAR '12: Proceeding of the workshop on Document Analysis and Recognition

This paper refers to the development of an OCR for the Assamese language by modifying an existing OCR for the Bangla language. This modification is feasible because the Assamese script is similar, except for a few characters, to the Bangla script. The ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '13: Proceedings of the 2013 ACM symposium on Document engineering
September 2013
582 pages
ISBN:9781450317894
DOI:10.1145/2494266
Conference Chair:
Simone Marinai
University of Florence, Italy
,
Program Chair:
Kim Marriott
Monash University, Australia
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 September 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
c#
sql
xml
xslt
Qualifiers
- research-article
Conference

Acceptance Rates
DocEng '13 Paper Acceptance Rate16of50submissions,32%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 249
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Early modern OCR project (eMOP) at Texas A&M University: using Aletheia to train Tesseract

DocEng '13: Proceedings of the 2013 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Combining OCR Models for Reading Early Modern Books

Nastalique segmentation-based approach for Urdu OCR

Development of an Assamese OCR using Bangla OCR

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Early modern OCR project (eMOP) at Texas A&M University: using Aletheia to train Tesseract

DocEng '13: Proceedings of the 2013 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Combining OCR Models for Reading Early Modern Books

Nastalique segmentation-based approach for Urdu OCR

Development of an Assamese OCR using Bangla OCR

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media