Abstract
Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments conducted using an Igbo corpus as a test bed for identifying the POS taggers and the Machine Learning (ML) methods that can achieve a good performance with the small dataset available for the language. Experiments have been conducted using different well-known POS taggers developed for English or European languages, and different training data styles and sizes. Igbo has a number of language-specific characteristics that present a challenge for effective POS tagging. One interesting case is the wide use of verbs (and nominalizations thereof) that have an inherent noun complement, which form “linked pairs” in the POS tagging scheme, but which may appear discontinuously. Another issue is Igbo’s highly productive agglutinative morphology, which can produce many variant word forms from a given root. This productivity is a key cause of the out-of-vocabulary (OOV) words observed during Igbo tagging. We report results of experiments on a promising direction for improving tagging performance on such morphologically-inflected OOV words.
- Mohammed A. Attia. 2008. Handling Arabic morphological and syntactic ambiguity within the LFG framework with a view to machine translation. Ph.D. Dissertation. University of Manchester.Google Scholar
- E. S. Atwell. 2008. Development of tag sets for part-of-speech tagging. Walter de Gruyter.Google Scholar
- Cheikh M. Bamba Dione, Jonas Kuhn, and Sina Zarrieß. 2010. Design and development of part-of-speech-tagging resources for Wolof (Niger-Congo, spoken in Senegal). In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA).Google Scholar
- Laurent Besacier, V.-B. Le, Christian Boitet, and Vincent Berment. 2006. ASR and translation for under-resourced language. In Proceedings of 2006 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 5. IEEE, V--V.Google ScholarCross Ref
- Thorsten Brants. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing. ACL, 224--231. Google ScholarDigital Library
- Eric Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Comput. Ling. 21, 4 (1995), 543--565. Google ScholarDigital Library
- Eric Brill. 1995. Unsupervised learning of disambiguation rules for part of speech tagging. In Proceedings of the 3rd Workshop on Very Large Corpora, vol. 30, Somerset, New Jersey. ACL, 1--13.Google Scholar
- Sandipan Brill, EricDandapat, Sudeshna Sarkar, and Anupam Basu. 2007. Automatic part-of-speech tagging for Bengali: An approach for morphologically rich languages in a poor resource scenario. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Prague, Czech Republic, Vol. 30. ACL, 221--224. Google ScholarDigital Library
- Nicoletta Calzolari, Riccardo Del Gratta, Gil Francopoulo, Joseph Mariani, Francesco Rubino, Irene Russo, and Claudia Soria. 2012. The LRE map. Harmonising community descriptions of resources. In LREC. 1084--1089.Google Scholar
- Walter Daelemans, Jakub Zavrel, Peter Berck, and Steven Gillis. 1996. MBT: A memory-based part of speech tagger-generator. In Arxiv Preprint Cmp-lg/9607012.Google Scholar
- G. De Pauw, Gilles-Maurice de Schryverz, and J. ṽan de Looy. 2012. Resource-light Bantu part-of-speech tagging. In Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages, SaLTMiL 8--AfLaT2012. European Language Resources Association (ELRA), 85--92.Google Scholar
- Nọlue E. Emenanjo. 1978. Elements of Modern Igbo Grammar: A Descriptive Approach. Ibadan Oxford University Press.Google Scholar
- Péter Halácsy, András Kornai, and Csaba Oravecz. 2007. HunPos: An open source trigram tagger. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. ACL, 209--212. Google ScholarDigital Library
- U. Heid, E. Taljard, and D. &Jtilde;. Prinsloo. 2006. Grammar-based tools for the creation of tagging resources for an unresourced language: The case of Northern Sotho. In 5th Edition of International Conference on Language Resources and Evaluations.Google Scholar
- Mark Hepple. 2000. Independence and commitment: Assumptions for rapid training and execution of rule-based PoS taggers. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. ACL, 278--277. Google ScholarDigital Library
- Daniel Jurafsky and James H. Martin. 2016. Part of Speech Tagging. Speech and language processing, Draft of November 7, 2016, Academic Press Limited. Retrieved from https://web.stanford.edu/∼jurafsky/slp3/9.pdf.Google Scholar
- F. Karlsson. 1995. Designing a parser for unrestricted text. In Constraint Grammar—A Language-Independent System for Parsing Unrestricted Text. F. Karlsson, A. Voutilainen, J. Heikkila, and A. Anttila, (Eds). Mouton de Gruyter, Berlin, New York, 1--40.Google Scholar
- Steven Krauwer. 2003. The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. Proceedings of SPECOM 2003 (2003), 8--15.Google Scholar
- Grace Ngai and Radu Florian. 2001. Transformation-based learning in the fast lane. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies. ACL, 1--8. Google ScholarDigital Library
- Ikechukwu Onyenwe, Mark Hepple, and Uchechukwu Chinedu. 2016. Améliorer la précision d’annotation d’un corpus Igbo par reconstruction morphologique et l’apprentissage basé sur la transformation. In Atelier Traitement Automatique des Langues Africaines (TALAF). JEP-TALN 2016, Vol. 11.Google Scholar
- Ikechukwu Ekene Onyenwe. 2017. Developing Methods and Resources for Automated Processing of the African Language Igbo. Ph.D. Dissertation. University of Sheffield.Google Scholar
- Ikechukwu E. Onyenwe and Mark Hepple. 2016. Predicting Morphologically-Complex Unknown Words in Igbo. In International Conference on Text, Speech, and Dialogue, Vol. 9924. Springer, 206--214.Google Scholar
- Ikechukwu E. Onyenwe, Mark Hepple, Uchechukwu Chinedu, and Ignatius Ezeani. 2018. A Basic Language Resource Kit Implementation for the Igbo NLP Project. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP) 17, 2 (2018), 10. Google ScholarDigital Library
- Braja Gopal Patra, Khumbar Debbarma, Dipankar Das, and Sivaji Bandyopadhyay. 2012. Part of speech (POS) tagger for Kokborok. Proceedings of COLING 2012: Posters (2012), 923--932.Google Scholar
- Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA, Vol. 1. 133--142.Google Scholar
- Navanath Saharia, Dhrubajyoti Das, Utpal Sharma, and Jugal Kalita. 2009. Part of speech tagger for Assamese text. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. ACL, 33--36. Google ScholarDigital Library
- Christer Samuelsson. 1994. Morphological tagging based entirely on Bayesian inference. In Proceedings of the 9th Nordic Conference of Computational Linguistics (NODALIDA 1993). 225--238.Google Scholar
- Smriti Singh, Kuhoo Gupta, Manish Shrivastava, and Pushpak Bhattacharyya. 2006. Morphological richness offsets resource demand-experiences in constructing a POS tagger for Hindi. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. ACL, 779--786. Google ScholarDigital Library
- Scott M. Thede and Mary P. Harper. 1999. A second-order hidden Markov model for part-of-speech tagging. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. ACL, 175--182. Google ScholarDigital Library
- Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology—Volume 1. ACL, 173--180. Google ScholarDigital Library
Index Terms
- Toward an Effective Igbo Part-of-Speech Tagger
Recommendations
Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus
Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a ...
A Basic Language Resource Kit Implementation for the IgboNLP Project
Igbo, an African language with around 32 million speakers worldwide, is one of the many languages having few or none of the language processing resources needed for advanced language technology applications. In this article, we describe the approach ...
SVM Based Part of Speech Tagger for Malayalam
ITC '10: Proceedings of the 2010 International Conference on Recent Trends in Information, Telecommunication and ComputingThis paper presents the building of part-of-speech Tagger for Malayalam Language using Support Vector Machine (SVM). POS tagger plays an important role in Natural language applications like speech recognition, natural language parsing, information ...
Comments