skip to main content
research-article

An accuracy-enhanced light stemmer for arabic text

Published:24 February 2010Publication History
Skip Abstract Section

Abstract

Stemming is a key step in most text mining and information retrieval applications. Information extraction, semantic annotation, as well as ontology learning are but a few examples where using a stemmer is a must. While the use of light stemmers in Arabic texts has proven highly effective for the task of information retrieval, this class of stemmers falls short of providing the accuracy required by many text mining applications. This can be attributed to the fact that light stemmers employ a set of rules that they apply indiscriminately and that they do not address stemming of broken plurals at all, even though this class of plurals is very commonly used in Arabic texts. The goal of this work is to overcome these limitations. The evaluation of the work shows that it significantly improves stemming accuracy. It also shows that by improving stemming accuracy, tasks such as automatic annotation and keyphrase extraction can also be significantly improved.

References

  1. Al Ameed, H. K., Al Ketbi, S. O., Al Kaabi, A. A., Al Shebli, K. S., Al Shamsi, N. F., Al Nuaimi, N. H., and Al Muhairi, S. S. 2005. Arabic light stemmer: A new enhanced approach. In Proceedings of the 2nd International Conference on Innovations in Information Technology (IIT'05).Google ScholarGoogle Scholar
  2. Al Kharashi, I. A. and Al Sughaiyer, I. A. 2004. Performance evaluation of an Arabic rule-based stemmer. In Proceedings of the 17th National Computer Conference.Google ScholarGoogle Scholar
  3. Aljlayl, M. and Frieder, O. 2002. On Arabic search: Improving the retrieval effectiveness via light stemming approach. In Proceedings of the ACM 11th Conference on Information and Knowledge Management. 340--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Al-Shammari, E. and Lin, J. 2008a. A novel Arabic lemmatization algorithm. In Proceedings of Conference AND'08. 113--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Al-Shammari, E. and Lin, J. 2008b. Towards an error-free Arabic stemming. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM-iNEWS'08). 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Beesley, K. R. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th Conference on Computational Linguistics. 89--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Buckwalter, T. 2003. Qamus: Arabic Lexicography. http://www.qamus.org/Google ScholarGoogle Scholar
  8. Chen, A. and Gey, F. 2002. Building an Arabic stemmer for information retrieval. In Proceedings of the Text Retrieval Conference (TREC'02). 631--639.Google ScholarGoogle Scholar
  9. Darwish, K. 2002. Building a shallow morphological analyzer in one day. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Darwish, K. and Oard, D. W. 2002. CLIR experiments at Maryland for TREC-2002: Evidence combination for Arabic-English retrieval. In Proceedings of the Text Retrieval Conference (TREC'02). 703--710.Google ScholarGoogle Scholar
  11. Diab, M., Hacioglu, K., and Jurafsky, D. 2004. Automatic tagging of Arabic text: From raw test to base phrase chunks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Human Language Technologies North American Chapter of the Association for Computational Linguistics (HLT-NAACL'04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. El-Beltagy, S. R. and Rafea, A. 2009a. A framework for the rapid development of list based domain specific Arabic stemmers. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools.Google ScholarGoogle Scholar
  13. El-Beltagy, S. R. and Rafea, A. 2009b. KP-Miner: A keyphrase extraction system for English and Arabic documents. Inform. Syst. 34, 132--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. El-Beltagy, S. R., Hazmam, M., and Rafea, A. 2007. Ontology based annotation of Web document segments. In Proceedings of the 22nd Annual ACM Symposium on Applied Computing. 1362--1367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Flores, F. N., Moreira, V. P., and Heuser, C. A. 2010. Assessing the impact of stemming accuracy on information retrieval. In Proceedings of the International Conference on Computational Processing of Portuguese Language. Lecture Notes in Computer Science, vol. 6001. Springer, 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Goldsmith, J. A., Higgins, D., and Soglasnova, S. 2000. Automatic language-specific stemming in information retrieval. In Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation. 273--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Goweder, A., Poesio, M., and De Roeck, A. 2004a. Broken plural detection for arabic information retrieval. In Proceedings of the Annual ACM Conference on Reseurch and Development in Information Retvieval (SIGIR'04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Goweder, A., Poesio, M., De Roeck, A., and Reynolds, J. 2004b. Identifying broken plurals in unvowelised Arabic text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google ScholarGoogle Scholar
  19. Harmanani, H. M., Keirouz, W. T., and Raheel, S. 2006. A rule-based extensible stemmer for information retrieval with application to Arabic. Int. Arab J. Inform. Technol. 3, 3, 265--272.Google ScholarGoogle Scholar
  20. Hazman, M., El-Beltagy, S. R., and Rafea, A. 2009. Ontology learning from domain specific Web documents. Int. J. Metadata Semant. Ontol. 4, 1--2, 24--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Khoja, S. and Garside, R. 1999. Stemming Arabic text. Tech. rep. Computing Department, Lancaster University, Lancaster, U.K.Google ScholarGoogle Scholar
  22. Laclavik, M., Seleng, M., Gatial, E., Balogh, Z., and Hluchy, L. 2007. Ontology based text annotation—OnTeA. In Proceedings of the Conference on Information Modeling and Knowledge Bases. 311--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Larkey, L. S. and Connell, M. E. 2001. Arabic information retrieval at UMass in TREC-10. In Proceedings of the Text Retrieval Conference (TREC'01).Google ScholarGoogle Scholar
  24. Larkey, L. S., Ballesteros, L., and Connell, M. E. 2002. Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceeedings of the Annual ACM Conference on Research and Development in Information Retrieval (SIGIR'02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Larkey, L. S., Ballesteros, L., and Connell, M. E. 2007. Light stemming for Arabic information retrieval. In Arabic Computational Morphology, A. Soudi, A. van der Bosch, and G. Neumann, Eds. 221--243.Google ScholarGoogle Scholar
  26. Lee, Y., Papineni, K., Roukos, S., Emam, O., and Hassan, H. 2003. Language model based Arabic word segmentation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 399--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Cambridge, U.K. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Moukdad, H. 2006. Stemming and root-based approaches to the retrieval of Arabic documents on the Web. Webology 3, 1, Article 22. http://www.webology.ir/2006/v3n1/a22.html.Google ScholarGoogle Scholar
  29. Nwesri, A., Tahaghoghi, S. M. M., and Scholer, F. 2005. Stemming Arabic conjunctions and prepositions. In Proceedings of the 12th International Symposium on String Processing and Information Retrieval (SPIRE'05). Lecture Notes in Computer Science, vol. 3772, Springer, 206--217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Paice, C. D. 1996. Method for evaluation of stemming algorithms based on error counting. J. Amer. Soc. Inform. Sci. 47, 632--649. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Rafea, A. and Shaalan, K. 1993. Lexical analysis of inflected Arabic words using exhaustive search of an augmented transition network. Softw. Pract. Exper. 23, 6, 567--588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Rogati, M., Mccarley, S., and Yang, Y. 2003. Unsupervised learning of Arabic stemming using a parallel corpus. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 391--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Šnajder, J., Bašic, B. D., and Tadic, M. 2008. Automatic acquisition of inflectional lexica for morphological normalization. Inform. Process. Manag. 44, 1720--1731. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Taghva, K., Elkhoury, R., and Coombs, J. S. 2005. Arabic stemming without a root dictionary. ITCC 1, 152--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Wikipedia. 2008. Wikipedia, the free encyclopedia. http://ar.wikipedia.org/wiki/Main_Page.Google ScholarGoogle Scholar
  36. Xu, J. and Croft, W. B. 1998. Corpus, based stemming using co, occurrence of word variants. ACM Trans. Inform. Syst. 16, 1, 61--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Zitouni, I., Sorensen, J., Luo, X., and Florian, R. 2005. The impact of morphological stemming on Arabic mention detection and coreference resolution. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. 63--70. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Transactions on Speech and Language Processing
    ACM Transactions on Speech and Language Processing   Volume 7, Issue 2
    February 2011
    22 pages
    ISSN:1550-4875
    EISSN:1550-4883
    DOI:10.1145/1921656
    Issue’s Table of Contents

    Copyright © 2010 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Revised: 1 December 2010
    • Accepted: 1 December 2010
    • Received: 1 March 2010
    • Published: 24 February 2010
    Published in tslp Volume 7, Issue 2

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader