skip to main content
10.5555/1564508.1564522dlproceedingsArticle/Chapter ViewAbstractPublication PagesaflatConference Proceedingsconference-collections
research-article
Free Access

Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography

Authors Info & Claims
Published:31 March 2009Publication History

ABSTRACT

Setswana, a Bantu language in the Sotho group, is one of the eleven official languages of South Africa. The language is characterised by a disjunctive orthography, mainly affecting the important word category of verbs. In particular, verbal prefixal morphemes are usually written disjunctively, while suffixal morphemes follow a conjunctive writing style. Therefore, Setswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two tokeniser transducers and a finite-state (rule-based) morphological analyser may be combined to effectively solve the Setswana tokenisation problem. The approach has the important advantage of bringing the processing of Setswana beyond the morphological analysis level in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. Indeed, the approach ensures that an aspect such as orthography does not obfuscate sound linguistics and, ultimately, proper semantic analysis, which remains the ultimate aim of linguistic analysis and therefore also computational linguistic analysis.

References

  1. Anderson, W. N. and Kotzé, P. M. Finite state tokenisation of an orthographical disjunctive agglutinative language: The verbal segment of Northern Sotho. In Proceedings of the 5th International Conference on Language Resources and Evalution, Genoa, Italy, May 22--28, 2006.Google ScholarGoogle Scholar
  2. Bosch, S. E. and Pretorius, L. 2002. The significance of computational morphology for Zulu lexicography. South African Journal of African Languages, 22(1):11--20.Google ScholarGoogle ScholarCross RefCross Ref
  3. Cole, D. T. 1955. An Introduction to Tswana Grammar. Longman, Cape Town, South Africa.Google ScholarGoogle Scholar
  4. Dixon, R. M. W. and Aikhenvald, A. Y. 2002. Word: A Cross-linguistic Typology. Cambridge University Press, Cambridge, UK.Google ScholarGoogle Scholar
  5. Forst, M. and Kaplan, R. M. 2006. The importance of precise tokenization for deep grammars. In Proceedings of the 5th International Conference on Language Resources and Evalution, Genoa, Italy, May 22--28, 2006.Google ScholarGoogle Scholar
  6. Hurskeinen, A., Louwrens, L. and Poulos, G. 2005 Computational description of verbs in disjoining writing systems. Nordic Journal of African Studies, 14(4): 438--451.Google ScholarGoogle Scholar
  7. Kosch, I. M. 2006. Topics in Morphology in the African Language Context. Unisa Press, Pretoria, South Africa.Google ScholarGoogle Scholar
  8. Krüger, C. J. H. 2006. Introduction to the Morphology of Setswana. Lincom Europe, München, Germany.Google ScholarGoogle Scholar
  9. Megerdoomian, K. 2003. Text mining, corpus building and testing. In Handbook for Language Engineers, Farghaly, A. (Ed.). CSLI Publications, California, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mikheev, A. 2003. Text segmentation. In The Oxford Handbook of Computational Linguistics, Mitkov, R. (Ed.) Oxford University Press, Oxford, UK.Google ScholarGoogle Scholar
  11. Otlogetswe, T. J. 2007. Corpus Design for Setswana Lexicography. PhD thesis. University of Pretoria, Pretoria, South Africa.Google ScholarGoogle Scholar
  12. Palmer, D. D. 2000. Tokenisation and sentence segmentation. In Handbook of natural Language Processing, Dale, R., Moisl, H. And Somers, H. (Eds.). Marcel Dekker, Inc., New York, USA.Google ScholarGoogle Scholar
  13. Pretorius, R. S. 1997. Auxiliary Verbs as a Sub-category of the Verb in Tswana. PhD thesis. PU for CHE, Potchefstroom, South Africa.Google ScholarGoogle Scholar
  14. Pretorius, L and Bosch, S. E. 2003. Computational aids for Zulu natural language processing. South African Linguistics and Applied Language Studies, 21(4):267--281.Google ScholarGoogle ScholarCross RefCross Ref
  15. Pretorius, R., Viljoen, B. and Pretorius, L. 2005. A finite-state morphological analysis of Setswana nouns. South African Journal of African Languages, 25(1):48--58.Google ScholarGoogle ScholarCross RefCross Ref
  16. Pretorius, L., Viljoen, B., Pretorius, R. and Berg, A. 2008. Towards a computational morphological analysis of Setswana compounds. Literator, 29(1):1--20.Google ScholarGoogle ScholarCross RefCross Ref
  17. Schiller, A. 1996. Multilingual finite-state noun-phrase extraction. In Proceedings of the ECAI 96 Workshop on Extended Finite State Models of Language, Kornai, A. (Ed.).Google ScholarGoogle Scholar
  18. Taljard, E. 2006 Corpus based linguistic investigation for the South African Bantu languages: a Northern Sotho case study. South African journal of African languages, 26(4):165--183.Google ScholarGoogle Scholar
  19. Taljard, E. and Bosch, S. E. 2006. A Comparison of Approaches towards Word Class Tagging: Disjunctively versus Conjunctively Written Bantu Languages. Nordic Journal of African Studies, 15(4): 428--442.Google ScholarGoogle Scholar
  20. Van Wyk, E. B. 1958. Woordverdeling in Noord-Sotho en Zoeloe. 'n Bydrae tot die Vraagstuk van Woordidentifikasie in die Bantoetale. University of Pretoria, Pretoria, South Africa.Google ScholarGoogle Scholar
  21. Van Wyk, E. B. 1967. The word classes of Northern Sotho. Lingua, 17(2):230--261.Google ScholarGoogle Scholar

Index Terms

  1. Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image DL Hosted proceedings
          AfLaT '09: Proceedings of the First Workshop on Language Technologies for African Languages
          March 2009
          131 pages
          ISBN:1932432256
          • Editors:
          • Guy De Pauw,
          • Gilles-Maurice de Schryver,
          • Lori Levin

          Publisher

          Association for Computational Linguistics

          United States

          Publication History

          • Published: 31 March 2009

          Qualifiers

          • research-article

          Acceptance Rates

          AfLaT '09 Paper Acceptance Rate9of24submissions,38%Overall Acceptance Rate9of24submissions,38%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader