research-article

An accuracy-enhanced light stemmer for arabic text

Authors:
Samhaa R. El-Beltagy

Cairo University, Giza, Egypt

Cairo University, Giza, Egypt
View Profile

,
Ahmed Rafea

The American University in Cairo

The American University in Cairo
View Profile

ACM Transactions on Speech and Language Processing Volume 7 Issue 2Article No.: 2pp 1–22https://doi.org/10.1145/1921656.1921657

Published:24 February 2010Publication History

ACM Transactions on Speech and Language Processing

Abstract

Stemming is a key step in most text mining and information retrieval applications. Information extraction, semantic annotation, as well as ontology learning are but a few examples where using a stemmer is a must. While the use of light stemmers in Arabic texts has proven highly effective for the task of information retrieval, this class of stemmers falls short of providing the accuracy required by many text mining applications. This can be attributed to the fact that light stemmers employ a set of rules that they apply indiscriminately and that they do not address stemming of broken plurals at all, even though this class of plurals is very commonly used in Arabic texts. The goal of this work is to overcome these limitations. The evaluation of the work shows that it significantly improves stemming accuracy. It also shows that by improving stemming accuracy, tasks such as automatic annotation and keyphrase extraction can also be significantly improved.

References

Al Ameed, H. K., Al Ketbi, S. O., Al Kaabi, A. A., Al Shebli, K. S., Al Shamsi, N. F., Al Nuaimi, N. H., and Al Muhairi, S. S. 2005. Arabic light stemmer: A new enhanced approach. In Proceedings of the 2nd International Conference on Innovations in Information Technology (IIT'05).Google Scholar
Al Kharashi, I. A. and Al Sughaiyer, I. A. 2004. Performance evaluation of an Arabic rule-based stemmer. In Proceedings of the 17th National Computer Conference.Google Scholar
Aljlayl, M. and Frieder, O. 2002. On Arabic search: Improving the retrieval effectiveness via light stemming approach. In Proceedings of the ACM 11th Conference on Information and Knowledge Management. 340--347. Google ScholarDigital Library
Al-Shammari, E. and Lin, J. 2008a. A novel Arabic lemmatization algorithm. In Proceedings of Conference AND'08. 113--118. Google ScholarDigital Library
Al-Shammari, E. and Lin, J. 2008b. Towards an error-free Arabic stemming. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM-iNEWS'08). 9--16. Google ScholarDigital Library
Beesley, K. R. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th Conference on Computational Linguistics. 89--94. Google ScholarDigital Library
Buckwalter, T. 2003. Qamus: Arabic Lexicography. http://www.qamus.org/Google Scholar
Chen, A. and Gey, F. 2002. Building an Arabic stemmer for information retrieval. In Proceedings of the Text Retrieval Conference (TREC'02). 631--639.Google Scholar
Darwish, K. 2002. Building a shallow morphological analyzer in one day. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. Google ScholarDigital Library
Darwish, K. and Oard, D. W. 2002. CLIR experiments at Maryland for TREC-2002: Evidence combination for Arabic-English retrieval. In Proceedings of the Text Retrieval Conference (TREC'02). 703--710.Google Scholar
Diab, M., Hacioglu, K., and Jurafsky, D. 2004. Automatic tagging of Arabic text: From raw test to base phrase chunks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Human Language Technologies North American Chapter of the Association for Computational Linguistics (HLT-NAACL'04). Google ScholarDigital Library
El-Beltagy, S. R. and Rafea, A. 2009a. A framework for the rapid development of list based domain specific Arabic stemmers. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools.Google Scholar
El-Beltagy, S. R. and Rafea, A. 2009b. KP-Miner: A keyphrase extraction system for English and Arabic documents. Inform. Syst. 34, 132--144. Google ScholarDigital Library
El-Beltagy, S. R., Hazmam, M., and Rafea, A. 2007. Ontology based annotation of Web document segments. In Proceedings of the 22nd Annual ACM Symposium on Applied Computing. 1362--1367. Google ScholarDigital Library
Flores, F. N., Moreira, V. P., and Heuser, C. A. 2010. Assessing the impact of stemming accuracy on information retrieval. In Proceedings of the International Conference on Computational Processing of Portuguese Language. Lecture Notes in Computer Science, vol. 6001. Springer, 11--20. Google ScholarDigital Library
Goldsmith, J. A., Higgins, D., and Soglasnova, S. 2000. Automatic language-specific stemming in information retrieval. In Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation. 273--284. Google ScholarDigital Library
Goweder, A., Poesio, M., and De Roeck, A. 2004a. Broken plural detection for arabic information retrieval. In Proceedings of the Annual ACM Conference on Reseurch and Development in Information Retvieval (SIGIR'04). Google ScholarDigital Library
Goweder, A., Poesio, M., De Roeck, A., and Reynolds, J. 2004b. Identifying broken plurals in unvowelised Arabic text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
Harmanani, H. M., Keirouz, W. T., and Raheel, S. 2006. A rule-based extensible stemmer for information retrieval with application to Arabic. Int. Arab J. Inform. Technol. 3, 3, 265--272.Google Scholar
Hazman, M., El-Beltagy, S. R., and Rafea, A. 2009. Ontology learning from domain specific Web documents. Int. J. Metadata Semant. Ontol. 4, 1--2, 24--33. Google ScholarDigital Library
Khoja, S. and Garside, R. 1999. Stemming Arabic text. Tech. rep. Computing Department, Lancaster University, Lancaster, U.K.Google Scholar
Laclavik, M., Seleng, M., Gatial, E., Balogh, Z., and Hluchy, L. 2007. Ontology based text annotation—OnTeA. In Proceedings of the Conference on Information Modeling and Knowledge Bases. 311--315. Google ScholarDigital Library
Larkey, L. S. and Connell, M. E. 2001. Arabic information retrieval at UMass in TREC-10. In Proceedings of the Text Retrieval Conference (TREC'01).Google Scholar
Larkey, L. S., Ballesteros, L., and Connell, M. E. 2002. Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceeedings of the Annual ACM Conference on Research and Development in Information Retrieval (SIGIR'02). Google ScholarDigital Library
Larkey, L. S., Ballesteros, L., and Connell, M. E. 2007. Light stemming for Arabic information retrieval. In Arabic Computational Morphology, A. Soudi, A. van der Bosch, and G. Neumann, Eds. 221--243.Google Scholar
Lee, Y., Papineni, K., Roukos, S., Emam, O., and Hassan, H. 2003. Language model based Arabic word segmentation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 399--406. Google ScholarDigital Library
Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Cambridge, U.K. Google ScholarDigital Library
Moukdad, H. 2006. Stemming and root-based approaches to the retrieval of Arabic documents on the Web. Webology 3, 1, Article 22. http://www.webology.ir/2006/v3n1/a22.html.Google Scholar
Nwesri, A., Tahaghoghi, S. M. M., and Scholer, F. 2005. Stemming Arabic conjunctions and prepositions. In Proceedings of the 12th International Symposium on String Processing and Information Retrieval (SPIRE'05). Lecture Notes in Computer Science, vol. 3772, Springer, 206--217. Google ScholarDigital Library
Paice, C. D. 1996. Method for evaluation of stemming algorithms based on error counting. J. Amer. Soc. Inform. Sci. 47, 632--649. Google ScholarDigital Library
Rafea, A. and Shaalan, K. 1993. Lexical analysis of inflected Arabic words using exhaustive search of an augmented transition network. Softw. Pract. Exper. 23, 6, 567--588. Google ScholarDigital Library
Rogati, M., Mccarley, S., and Yang, Y. 2003. Unsupervised learning of Arabic stemming using a parallel corpus. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 391--398. Google ScholarDigital Library
Šnajder, J., Bašic, B. D., and Tadic, M. 2008. Automatic acquisition of inflectional lexica for morphological normalization. Inform. Process. Manag. 44, 1720--1731. Google ScholarDigital Library
Taghva, K., Elkhoury, R., and Coombs, J. S. 2005. Arabic stemming without a root dictionary. ITCC 1, 152--157. Google ScholarDigital Library
Wikipedia. 2008. Wikipedia, the free encyclopedia. http://ar.wikipedia.org/wiki/Main_Page.Google Scholar
Xu, J. and Croft, W. B. 1998. Corpus, based stemming using co, occurrence of word variants. ACM Trans. Inform. Syst. 16, 1, 61--81. Google ScholarDigital Library
Zitouni, I., Sorensen, J., Luo, X., and Florian, R. 2005. The impact of morphological stemming on Arabic mention detection and coreference resolution. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. 63--70. Google ScholarDigital Library

Recommendations

Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stemming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence ...
Read More
A novel Arabic lemmatization algorithm
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, ...
Read More
Empirical studies in strategies for Arabic retrieval
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

This work evaluates a few search strategies for Arabic monolingual and cross-lingual retrieval, using the TREC Arabic corpus as the test-bed. The release by NIST in 2001 of an Arabic corpus of nearly 400k documents with both monolingual and cross-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Speech and Language Processing Volume 7, Issue 2
February 2011
22 pages
ISSN:1550-4875
EISSN:1550-4883
DOI:10.1145/1921656
Issue’s Table of Contents

Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Revised: 1 December 2010
- Accepted: 1 December 2010
- Received: 1 March 2010
- Published: 24 February 2010
Published in tslp Volume 7, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Arabic
Stemming
broken plurals
heuristic rules
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 788
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An accuracy-enhanced light stemmer for arabic text

ACM Transactions on Speech and Language Processing

Abstract

References

Cited By

Recommendations

Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

A novel Arabic lemmatization algorithm

Empirical studies in strategies for Arabic retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An accuracy-enhanced light stemmer for arabic text

ACM Transactions on Speech and Language Processing

Abstract

References

Cited By

Recommendations

Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

A novel Arabic lemmatization algorithm

Empirical studies in strategies for Arabic retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media