ABSTRACT
Both Inflectional and derivational morphology lead to multiple surface forms of a word. Stemming reduces these forms back to its stem or root, and is a very useful tool for many applications. There has not been any work reported on Urdu stemming. The current work develops an Urdu stemmer or Assas-Band and improves the performance using more precise affix based exception lists, instead of the conventional lexical lookup employed for developing stemmers in other languages. Testing shows an accuracy of 91.2%. Further enhancements are also suggested.
- Croft, W. B. and Xu, J. 1995. Corpus-Specific Stemming using Word Form Co-occurrences. In Fourth Annual Symposium on Document Analysis and Information Retrieval.Google Scholar
- Krovetz, R. 1993. View Morphology as an Inference Process. In the Proceedings of 5th International Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- Porter, M. 1980. An Algorithm for Suffix Stripping. Program, 14(3): 130--137.Google ScholarDigital Library
- Thabet, N. 2004. Stemming the Qur'an. In the Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages. Google ScholarDigital Library
- Hussain, Sara. 2004. Finite-State Morphological Analyzer for Urdu. Unpublished MS thesis, Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan.Google Scholar
- Sajjad, H. 2007. Statistical Part-of-Speech for Urdu. Unpublished MS Thesis, Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan.Google Scholar
- Ijaz, M and Hussain, S. 2007. Corpus Based Urdu Lexicon Development. In the Proceedings of Conference on Language Technology (CLT07), Pakistan.Google Scholar
- Naseem, T., Hussain, S. 2007. Spelling Error Trends in Urdu. In the Proceedings of Conference on Language Technology (CLT07), Pakistan.Google Scholar
- Kumar, M. S. and Murthy, K. N. 2007. Corpus Based Statistical Approach for Stemming Telugu. Creation of Lexical Resources for Indian Language Computing and Processing (LRIL), C-DAC, Mumbai, India.Google Scholar
- Paik, J. H. and Parui, S. K. 2008. A Simple Stemmer for Inflectional Languages. Forum for Information Retrieval Evaluation.Google Scholar
- Islam, M. Z., Uddin, M. N. and Khan, M. 2007. A Light Weight Stemmer for Bengali and Its Use in Spelling Checker. In the Proceedings of 1st Intl. Conf. on Digital Comm. and Computer, Amman, Jordan.Google Scholar
- Sharifloo, A. A. and Shamsfard, M. 2008. A Bottom up Approach to Persian Stemming. In the Proceedings of the Third International Joint Conference on Natural Language Processing. Hyderabad, India.Google Scholar
- Kumar, A. and Siddiqui, T. 2008. An Unsupervised Hindi Stemmer with Heuristics Improvements. In Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data. Google ScholarDigital Library
Index Terms
- Assas-Band, an affix-exception-list based Urdu stemmer
Recommendations
A Fast Corpus-Based Stemmer
Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. ...
The Rule-Based Sundanese Stemmer
Our research proposed an iterative Sundanese stemmer by removing the derivational affixes prior to the inflexional. This scheme was chosen because, in the Sundanese affixation, a confix (one of derivational affix) is applied in the last phase of a ...
Hindi Stemmer @ FIRE-2013
FIRE '12 & '13: Proceedings of the 4th and 5th Annual Meetings of the Forum for Information Retrieval EvaluationThis paper describes a language independent approach for extracting Hindi morpheme from a given list of Hindi words of Morpheme Extraction Task (MET) at FIRE 2013. In this approach list of Hindi word is submitted to the system and it generates stemmed ...
Comments