Length Normalization in Degraded Text Collections

Length Normalization in Degraded Text CollectionsApril 1995

April 1995

1995 Technical Report

Publisher:

Cornell University
PO Box 250, 124 Roberts Place Ithaca, NY
United States

Published:04 April 1995

Bibliometrics

Abstract

Optical character recognition (OCR) is the most commonly used technique to convert printed material into electronic form. Using OCR, large repositories of machine readable text can be created in a short time. An information retrieval system can then be used to search through large information bases thus created. Many information retrieval systems use sophisticated term weighting functions to improve the effectiveness of a search. Term weighting schemes can be highly sensitive to the errors in the input text, introduced by the OCR process. This study examines the effects of the well known cosine normalization method in the presence of OCR errors and proposes a new, more robust, normalization method. Experiments show that the new scheme is less sensitive to OCR errors and facilitates use of more diverse basic weighting schemes. It also yields significant improvements in retrieval effectiveness over cosine normalization.

Cited By

Contributors

Amit Singhal
Cornell University
- Publication Years1994 - 2017
- Publication counts34
- Citation count1,888
- Available for Download16
- Downloads (cumulative)18,136
- Downloads (12 months)896
- Downloads (6 weeks)117
- Average Downloads per Article1,134
- Average Citation per Article56
View Full Profile
Gerard M Salton
Cornell University
- Publication Years1959 - 2003
- Publication counts164
- Citation count14,357
- Available for Download78
- Downloads (cumulative)85,994
- Downloads (12 months)10,506
- Downloads (6 weeks)1,639
- Average Downloads per Article1,102
- Average Citation per Article88
View Full Profile
Chris Alan Buckley
Cornell University
- Publication Years1982 - 2017
- Publication counts76
- Citation count6,246
- Available for Download36
- Downloads (cumulative)37,858
- Downloads (12 months)1,861
- Downloads (6 weeks)281
- Average Downloads per Article1,052
- Average Citation per Article82
View Full Profile

Recommendations

Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

The vector space model (VSM) is one of the most widely used information retrieval (IR) models in both academia and industry. It was less effective at the Chinese ad hoc retrieval tasks than other retrieval models in the NTCIR-3 evaluation workshop, but ...
Read More
Probabilistic methods for searching ocr-degraded arabic text
Read More
Pivoted Document Length Normalization
SIGIR Test-of-Time Awardees 1978-2001

Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we ohserve that a normalization scheme that ...
Read More

Comments

Browse Reports

Sections

Cited By

Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

Probabilistic methods for searching ocr-degraded arabic text

Pivoted Document Length Normalization

Save to Binder

Sections

Cited By

Save to Binder

Recommendations

Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

Probabilistic methods for searching ocr-degraded arabic text

Pivoted Document Length Normalization