skip to main content
Length Normalization in Degraded Text CollectionsApril 1995
1995 Technical Report
Publisher:
  • Cornell University
  • PO Box 250, 124 Roberts Place Ithaca, NY
  • United States
Published:04 April 1995
Bibliometrics
Skip Abstract Section
Abstract

Optical character recognition (OCR) is the most commonly used technique to convert printed material into electronic form. Using OCR, large repositories of machine readable text can be created in a short time. An information retrieval system can then be used to search through large information bases thus created. Many information retrieval systems use sophisticated term weighting functions to improve the effectiveness of a search. Term weighting schemes can be highly sensitive to the errors in the input text, introduced by the OCR process. This study examines the effects of the well known cosine normalization method in the presence of OCR errors and proposes a new, more robust, normalization method. Experiments show that the new scheme is less sensitive to OCR errors and facilitates use of more diverse basic weighting schemes. It also yields significant improvements in retrieval effectiveness over cosine normalization.

Contributors
  • Cornell University
  • Cornell University
  • Cornell University

Recommendations