Abstract
Stemming is a key step in most text mining and information retrieval applications. Information extraction, semantic annotation, as well as ontology learning are but a few examples where using a stemmer is a must. While the use of light stemmers in Arabic texts has proven highly effective for the task of information retrieval, this class of stemmers falls short of providing the accuracy required by many text mining applications. This can be attributed to the fact that light stemmers employ a set of rules that they apply indiscriminately and that they do not address stemming of broken plurals at all, even though this class of plurals is very commonly used in Arabic texts. The goal of this work is to overcome these limitations. The evaluation of the work shows that it significantly improves stemming accuracy. It also shows that by improving stemming accuracy, tasks such as automatic annotation and keyphrase extraction can also be significantly improved.
- Al Ameed, H. K., Al Ketbi, S. O., Al Kaabi, A. A., Al Shebli, K. S., Al Shamsi, N. F., Al Nuaimi, N. H., and Al Muhairi, S. S. 2005. Arabic light stemmer: A new enhanced approach. In Proceedings of the 2nd International Conference on Innovations in Information Technology (IIT'05).Google Scholar
- Al Kharashi, I. A. and Al Sughaiyer, I. A. 2004. Performance evaluation of an Arabic rule-based stemmer. In Proceedings of the 17th National Computer Conference.Google Scholar
- Aljlayl, M. and Frieder, O. 2002. On Arabic search: Improving the retrieval effectiveness via light stemming approach. In Proceedings of the ACM 11th Conference on Information and Knowledge Management. 340--347. Google ScholarDigital Library
- Al-Shammari, E. and Lin, J. 2008a. A novel Arabic lemmatization algorithm. In Proceedings of Conference AND'08. 113--118. Google ScholarDigital Library
- Al-Shammari, E. and Lin, J. 2008b. Towards an error-free Arabic stemming. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM-iNEWS'08). 9--16. Google ScholarDigital Library
- Beesley, K. R. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th Conference on Computational Linguistics. 89--94. Google ScholarDigital Library
- Buckwalter, T. 2003. Qamus: Arabic Lexicography. http://www.qamus.org/Google Scholar
- Chen, A. and Gey, F. 2002. Building an Arabic stemmer for information retrieval. In Proceedings of the Text Retrieval Conference (TREC'02). 631--639.Google Scholar
- Darwish, K. 2002. Building a shallow morphological analyzer in one day. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. Google ScholarDigital Library
- Darwish, K. and Oard, D. W. 2002. CLIR experiments at Maryland for TREC-2002: Evidence combination for Arabic-English retrieval. In Proceedings of the Text Retrieval Conference (TREC'02). 703--710.Google Scholar
- Diab, M., Hacioglu, K., and Jurafsky, D. 2004. Automatic tagging of Arabic text: From raw test to base phrase chunks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Human Language Technologies North American Chapter of the Association for Computational Linguistics (HLT-NAACL'04). Google ScholarDigital Library
- El-Beltagy, S. R. and Rafea, A. 2009a. A framework for the rapid development of list based domain specific Arabic stemmers. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools.Google Scholar
- El-Beltagy, S. R. and Rafea, A. 2009b. KP-Miner: A keyphrase extraction system for English and Arabic documents. Inform. Syst. 34, 132--144. Google ScholarDigital Library
- El-Beltagy, S. R., Hazmam, M., and Rafea, A. 2007. Ontology based annotation of Web document segments. In Proceedings of the 22nd Annual ACM Symposium on Applied Computing. 1362--1367. Google ScholarDigital Library
- Flores, F. N., Moreira, V. P., and Heuser, C. A. 2010. Assessing the impact of stemming accuracy on information retrieval. In Proceedings of the International Conference on Computational Processing of Portuguese Language. Lecture Notes in Computer Science, vol. 6001. Springer, 11--20. Google ScholarDigital Library
- Goldsmith, J. A., Higgins, D., and Soglasnova, S. 2000. Automatic language-specific stemming in information retrieval. In Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation. 273--284. Google ScholarDigital Library
- Goweder, A., Poesio, M., and De Roeck, A. 2004a. Broken plural detection for arabic information retrieval. In Proceedings of the Annual ACM Conference on Reseurch and Development in Information Retvieval (SIGIR'04). Google ScholarDigital Library
- Goweder, A., Poesio, M., De Roeck, A., and Reynolds, J. 2004b. Identifying broken plurals in unvowelised Arabic text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
- Harmanani, H. M., Keirouz, W. T., and Raheel, S. 2006. A rule-based extensible stemmer for information retrieval with application to Arabic. Int. Arab J. Inform. Technol. 3, 3, 265--272.Google Scholar
- Hazman, M., El-Beltagy, S. R., and Rafea, A. 2009. Ontology learning from domain specific Web documents. Int. J. Metadata Semant. Ontol. 4, 1--2, 24--33. Google ScholarDigital Library
- Khoja, S. and Garside, R. 1999. Stemming Arabic text. Tech. rep. Computing Department, Lancaster University, Lancaster, U.K.Google Scholar
- Laclavik, M., Seleng, M., Gatial, E., Balogh, Z., and Hluchy, L. 2007. Ontology based text annotation—OnTeA. In Proceedings of the Conference on Information Modeling and Knowledge Bases. 311--315. Google ScholarDigital Library
- Larkey, L. S. and Connell, M. E. 2001. Arabic information retrieval at UMass in TREC-10. In Proceedings of the Text Retrieval Conference (TREC'01).Google Scholar
- Larkey, L. S., Ballesteros, L., and Connell, M. E. 2002. Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceeedings of the Annual ACM Conference on Research and Development in Information Retrieval (SIGIR'02). Google ScholarDigital Library
- Larkey, L. S., Ballesteros, L., and Connell, M. E. 2007. Light stemming for Arabic information retrieval. In Arabic Computational Morphology, A. Soudi, A. van der Bosch, and G. Neumann, Eds. 221--243.Google Scholar
- Lee, Y., Papineni, K., Roukos, S., Emam, O., and Hassan, H. 2003. Language model based Arabic word segmentation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 399--406. Google ScholarDigital Library
- Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Cambridge, U.K. Google ScholarDigital Library
- Moukdad, H. 2006. Stemming and root-based approaches to the retrieval of Arabic documents on the Web. Webology 3, 1, Article 22. http://www.webology.ir/2006/v3n1/a22.html.Google Scholar
- Nwesri, A., Tahaghoghi, S. M. M., and Scholer, F. 2005. Stemming Arabic conjunctions and prepositions. In Proceedings of the 12th International Symposium on String Processing and Information Retrieval (SPIRE'05). Lecture Notes in Computer Science, vol. 3772, Springer, 206--217. Google ScholarDigital Library
- Paice, C. D. 1996. Method for evaluation of stemming algorithms based on error counting. J. Amer. Soc. Inform. Sci. 47, 632--649. Google ScholarDigital Library
- Rafea, A. and Shaalan, K. 1993. Lexical analysis of inflected Arabic words using exhaustive search of an augmented transition network. Softw. Pract. Exper. 23, 6, 567--588. Google ScholarDigital Library
- Rogati, M., Mccarley, S., and Yang, Y. 2003. Unsupervised learning of Arabic stemming using a parallel corpus. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 391--398. Google ScholarDigital Library
- Šnajder, J., Bašic, B. D., and Tadic, M. 2008. Automatic acquisition of inflectional lexica for morphological normalization. Inform. Process. Manag. 44, 1720--1731. Google ScholarDigital Library
- Taghva, K., Elkhoury, R., and Coombs, J. S. 2005. Arabic stemming without a root dictionary. ITCC 1, 152--157. Google ScholarDigital Library
- Wikipedia. 2008. Wikipedia, the free encyclopedia. http://ar.wikipedia.org/wiki/Main_Page.Google Scholar
- Xu, J. and Croft, W. B. 1998. Corpus, based stemming using co, occurrence of word variants. ACM Trans. Inform. Syst. 16, 1, 61--81. Google ScholarDigital Library
- Zitouni, I., Sorensen, J., Luo, X., and Florian, R. 2005. The impact of morphological stemming on Arabic mention detection and coreference resolution. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. 63--70. Google ScholarDigital Library
Recommendations
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrievalArabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stemming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence ...
A novel Arabic lemmatization algorithm
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataTokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, ...
Empirical studies in strategies for Arabic retrieval
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrievalThis work evaluates a few search strategies for Arabic monolingual and cross-lingual retrieval, using the TREC Arabic corpus as the test-bed. The release by NIST in 2001 of an Arabic corpus of nearly 400k documents with both monolingual and cross-...
Comments