skip to main content
research-article

Toward an Effective Igbo Part-of-Speech Tagger

Published:21 May 2019Publication History
Skip Abstract Section

Abstract

Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments conducted using an Igbo corpus as a test bed for identifying the POS taggers and the Machine Learning (ML) methods that can achieve a good performance with the small dataset available for the language. Experiments have been conducted using different well-known POS taggers developed for English or European languages, and different training data styles and sizes. Igbo has a number of language-specific characteristics that present a challenge for effective POS tagging. One interesting case is the wide use of verbs (and nominalizations thereof) that have an inherent noun complement, which form “linked pairs” in the POS tagging scheme, but which may appear discontinuously. Another issue is Igbo’s highly productive agglutinative morphology, which can produce many variant word forms from a given root. This productivity is a key cause of the out-of-vocabulary (OOV) words observed during Igbo tagging. We report results of experiments on a promising direction for improving tagging performance on such morphologically-inflected OOV words.

References

  1. Mohammed A. Attia. 2008. Handling Arabic morphological and syntactic ambiguity within the LFG framework with a view to machine translation. Ph.D. Dissertation. University of Manchester.Google ScholarGoogle Scholar
  2. E. S. Atwell. 2008. Development of tag sets for part-of-speech tagging. Walter de Gruyter.Google ScholarGoogle Scholar
  3. Cheikh M. Bamba Dione, Jonas Kuhn, and Sina Zarrieß. 2010. Design and development of part-of-speech-tagging resources for Wolof (Niger-Congo, spoken in Senegal). In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA).Google ScholarGoogle Scholar
  4. Laurent Besacier, V.-B. Le, Christian Boitet, and Vincent Berment. 2006. ASR and translation for under-resourced language. In Proceedings of 2006 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 5. IEEE, V--V.Google ScholarGoogle ScholarCross RefCross Ref
  5. Thorsten Brants. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing. ACL, 224--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Eric Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Comput. Ling. 21, 4 (1995), 543--565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Eric Brill. 1995. Unsupervised learning of disambiguation rules for part of speech tagging. In Proceedings of the 3rd Workshop on Very Large Corpora, vol. 30, Somerset, New Jersey. ACL, 1--13.Google ScholarGoogle Scholar
  8. Sandipan Brill, EricDandapat, Sudeshna Sarkar, and Anupam Basu. 2007. Automatic part-of-speech tagging for Bengali: An approach for morphologically rich languages in a poor resource scenario. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Prague, Czech Republic, Vol. 30. ACL, 221--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nicoletta Calzolari, Riccardo Del Gratta, Gil Francopoulo, Joseph Mariani, Francesco Rubino, Irene Russo, and Claudia Soria. 2012. The LRE map. Harmonising community descriptions of resources. In LREC. 1084--1089.Google ScholarGoogle Scholar
  10. Walter Daelemans, Jakub Zavrel, Peter Berck, and Steven Gillis. 1996. MBT: A memory-based part of speech tagger-generator. In Arxiv Preprint Cmp-lg/9607012.Google ScholarGoogle Scholar
  11. G. De Pauw, Gilles-Maurice de Schryverz, and J. ṽan de Looy. 2012. Resource-light Bantu part-of-speech tagging. In Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages, SaLTMiL 8--AfLaT2012. European Language Resources Association (ELRA), 85--92.Google ScholarGoogle Scholar
  12. Nọlue E. Emenanjo. 1978. Elements of Modern Igbo Grammar: A Descriptive Approach. Ibadan Oxford University Press.Google ScholarGoogle Scholar
  13. Péter Halácsy, András Kornai, and Csaba Oravecz. 2007. HunPos: An open source trigram tagger. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. ACL, 209--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. U. Heid, E. Taljard, and D. &Jtilde;. Prinsloo. 2006. Grammar-based tools for the creation of tagging resources for an unresourced language: The case of Northern Sotho. In 5th Edition of International Conference on Language Resources and Evaluations.Google ScholarGoogle Scholar
  15. Mark Hepple. 2000. Independence and commitment: Assumptions for rapid training and execution of rule-based PoS taggers. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. ACL, 278--277. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Daniel Jurafsky and James H. Martin. 2016. Part of Speech Tagging. Speech and language processing, Draft of November 7, 2016, Academic Press Limited. Retrieved from https://web.stanford.edu/∼jurafsky/slp3/9.pdf.Google ScholarGoogle Scholar
  17. F. Karlsson. 1995. Designing a parser for unrestricted text. In Constraint Grammar—A Language-Independent System for Parsing Unrestricted Text. F. Karlsson, A. Voutilainen, J. Heikkila, and A. Anttila, (Eds). Mouton de Gruyter, Berlin, New York, 1--40.Google ScholarGoogle Scholar
  18. Steven Krauwer. 2003. The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. Proceedings of SPECOM 2003 (2003), 8--15.Google ScholarGoogle Scholar
  19. Grace Ngai and Radu Florian. 2001. Transformation-based learning in the fast lane. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies. ACL, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ikechukwu Onyenwe, Mark Hepple, and Uchechukwu Chinedu. 2016. Améliorer la précision d’annotation d’un corpus Igbo par reconstruction morphologique et l’apprentissage basé sur la transformation. In Atelier Traitement Automatique des Langues Africaines (TALAF). JEP-TALN 2016, Vol. 11.Google ScholarGoogle Scholar
  21. Ikechukwu Ekene Onyenwe. 2017. Developing Methods and Resources for Automated Processing of the African Language Igbo. Ph.D. Dissertation. University of Sheffield.Google ScholarGoogle Scholar
  22. Ikechukwu E. Onyenwe and Mark Hepple. 2016. Predicting Morphologically-Complex Unknown Words in Igbo. In International Conference on Text, Speech, and Dialogue, Vol. 9924. Springer, 206--214.Google ScholarGoogle Scholar
  23. Ikechukwu E. Onyenwe, Mark Hepple, Uchechukwu Chinedu, and Ignatius Ezeani. 2018. A Basic Language Resource Kit Implementation for the Igbo NLP Project. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP) 17, 2 (2018), 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Braja Gopal Patra, Khumbar Debbarma, Dipankar Das, and Sivaji Bandyopadhyay. 2012. Part of speech (POS) tagger for Kokborok. Proceedings of COLING 2012: Posters (2012), 923--932.Google ScholarGoogle Scholar
  25. Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA, Vol. 1. 133--142.Google ScholarGoogle Scholar
  26. Navanath Saharia, Dhrubajyoti Das, Utpal Sharma, and Jugal Kalita. 2009. Part of speech tagger for Assamese text. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. ACL, 33--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Christer Samuelsson. 1994. Morphological tagging based entirely on Bayesian inference. In Proceedings of the 9th Nordic Conference of Computational Linguistics (NODALIDA 1993). 225--238.Google ScholarGoogle Scholar
  28. Smriti Singh, Kuhoo Gupta, Manish Shrivastava, and Pushpak Bhattacharyya. 2006. Morphological richness offsets resource demand-experiences in constructing a POS tagger for Hindi. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. ACL, 779--786. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Scott M. Thede and Mary P. Harper. 1999. A second-order hidden Markov model for part-of-speech tagging. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. ACL, 175--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology—Volume 1. ACL, 173--180. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Toward an Effective Igbo Part-of-Speech Tagger

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Asian and Low-Resource Language Information Processing
                ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 4
                December 2019
                305 pages
                ISSN:2375-4699
                EISSN:2375-4702
                DOI:10.1145/3327969
                Issue’s Table of Contents

                Copyright © 2019 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 21 May 2019
                • Accepted: 1 February 2019
                • Revised: 1 October 2018
                • Received: 1 May 2018
                Published in tallip Volume 18, Issue 4

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              HTML Format

              View this article in HTML Format .

              View HTML Format