Optical character recognition (OCR) is the most commonly used technique to convert printed material into electronic form. Using OCR, large repositories of machine readable text can be created in a short time. An information retrieval system can then be used to search through large information bases thus created. Many information retrieval systems use sophisticated term weighting functions to improve the effectiveness of a search. Term weighting schemes can be highly sensitive to the errors in the input text, introduced by the OCR process. This study examines the effects of the well known cosine normalization method in the presence of OCR errors and proposes a new, more robust, normalization method. Experiments show that the new scheme is less sensitive to OCR errors and facilitates use of more diverse basic weighting schemes. It also yields significant improvements in retrieval effectiveness over cosine normalization.
Cited By
- Lertnattee V, Chomya S and Lueviphan C Using a Normalized Score Multi-Label KNN to Classify Multi-label Herbal Formulae Proceedings of the First International Conference on Mining Intelligence and Knowledge Exploration - Volume 8284, (50-61)
- Lertnattee V, Chomya S and Sornlertlamvanich V Using a Normalized Score Centroid-Based Classifier to Classify Multi-label Herbal Formulae Proceedings of the 7th International Workshop on Multi-disciplinary Trends in Artificial Intelligence - Volume 8271, (119-130)
- Paltoglou G and Thelwall M A study of information retrieval weighting schemes for sentiment analysis Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, (1386-1395)
- Ruch P Information retrieval and spelling correction Proceedings of the 2002 ACM symposium on Applied computing, (699-703)
- Ruch P Using contextual spelling correction to improve retrieval effectiveness in degraded text collections Proceedings of the 19th international conference on Computational linguistics - Volume 1, (1-7)
- Shin D, Jang H and Jin H BUS Proceedings of the third ACM conference on Digital libraries, (235-243)
- Singhal A, Buckley C and Mitra M Pivoted document length normalization Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, (21-29)
Recommendations
Adapting pivoted document-length normalization for query size: Experiments in Chinese and English
The vector space model (VSM) is one of the most widely used information retrieval (IR) models in both academia and industry. It was less effective at the Chinese ad hoc retrieval tasks than other retrieval models in the NTCIR-3 evaluation workshop, but ...
Pivoted Document Length Normalization
SIGIR Test-of-Time Awardees 1978-2001Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we ohserve that a normalization scheme that ...