ABSTRACT
Setswana, a Bantu language in the Sotho group, is one of the eleven official languages of South Africa. The language is characterised by a disjunctive orthography, mainly affecting the important word category of verbs. In particular, verbal prefixal morphemes are usually written disjunctively, while suffixal morphemes follow a conjunctive writing style. Therefore, Setswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two tokeniser transducers and a finite-state (rule-based) morphological analyser may be combined to effectively solve the Setswana tokenisation problem. The approach has the important advantage of bringing the processing of Setswana beyond the morphological analysis level in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. Indeed, the approach ensures that an aspect such as orthography does not obfuscate sound linguistics and, ultimately, proper semantic analysis, which remains the ultimate aim of linguistic analysis and therefore also computational linguistic analysis.
- Anderson, W. N. and Kotzé, P. M. Finite state tokenisation of an orthographical disjunctive agglutinative language: The verbal segment of Northern Sotho. In Proceedings of the 5th International Conference on Language Resources and Evalution, Genoa, Italy, May 22--28, 2006.Google Scholar
- Bosch, S. E. and Pretorius, L. 2002. The significance of computational morphology for Zulu lexicography. South African Journal of African Languages, 22(1):11--20.Google ScholarCross Ref
- Cole, D. T. 1955. An Introduction to Tswana Grammar. Longman, Cape Town, South Africa.Google Scholar
- Dixon, R. M. W. and Aikhenvald, A. Y. 2002. Word: A Cross-linguistic Typology. Cambridge University Press, Cambridge, UK.Google Scholar
- Forst, M. and Kaplan, R. M. 2006. The importance of precise tokenization for deep grammars. In Proceedings of the 5th International Conference on Language Resources and Evalution, Genoa, Italy, May 22--28, 2006.Google Scholar
- Hurskeinen, A., Louwrens, L. and Poulos, G. 2005 Computational description of verbs in disjoining writing systems. Nordic Journal of African Studies, 14(4): 438--451.Google Scholar
- Kosch, I. M. 2006. Topics in Morphology in the African Language Context. Unisa Press, Pretoria, South Africa.Google Scholar
- Krüger, C. J. H. 2006. Introduction to the Morphology of Setswana. Lincom Europe, München, Germany.Google Scholar
- Megerdoomian, K. 2003. Text mining, corpus building and testing. In Handbook for Language Engineers, Farghaly, A. (Ed.). CSLI Publications, California, USA. Google ScholarDigital Library
- Mikheev, A. 2003. Text segmentation. In The Oxford Handbook of Computational Linguistics, Mitkov, R. (Ed.) Oxford University Press, Oxford, UK.Google Scholar
- Otlogetswe, T. J. 2007. Corpus Design for Setswana Lexicography. PhD thesis. University of Pretoria, Pretoria, South Africa.Google Scholar
- Palmer, D. D. 2000. Tokenisation and sentence segmentation. In Handbook of natural Language Processing, Dale, R., Moisl, H. And Somers, H. (Eds.). Marcel Dekker, Inc., New York, USA.Google Scholar
- Pretorius, R. S. 1997. Auxiliary Verbs as a Sub-category of the Verb in Tswana. PhD thesis. PU for CHE, Potchefstroom, South Africa.Google Scholar
- Pretorius, L and Bosch, S. E. 2003. Computational aids for Zulu natural language processing. South African Linguistics and Applied Language Studies, 21(4):267--281.Google ScholarCross Ref
- Pretorius, R., Viljoen, B. and Pretorius, L. 2005. A finite-state morphological analysis of Setswana nouns. South African Journal of African Languages, 25(1):48--58.Google ScholarCross Ref
- Pretorius, L., Viljoen, B., Pretorius, R. and Berg, A. 2008. Towards a computational morphological analysis of Setswana compounds. Literator, 29(1):1--20.Google ScholarCross Ref
- Schiller, A. 1996. Multilingual finite-state noun-phrase extraction. In Proceedings of the ECAI 96 Workshop on Extended Finite State Models of Language, Kornai, A. (Ed.).Google Scholar
- Taljard, E. 2006 Corpus based linguistic investigation for the South African Bantu languages: a Northern Sotho case study. South African journal of African languages, 26(4):165--183.Google Scholar
- Taljard, E. and Bosch, S. E. 2006. A Comparison of Approaches towards Word Class Tagging: Disjunctively versus Conjunctively Written Bantu Languages. Nordic Journal of African Studies, 15(4): 428--442.Google Scholar
- Van Wyk, E. B. 1958. Woordverdeling in Noord-Sotho en Zoeloe. 'n Bydrae tot die Vraagstuk van Woordidentifikasie in die Bantoetale. University of Pretoria, Pretoria, South Africa.Google Scholar
- Van Wyk, E. B. 1967. The word classes of Northern Sotho. Lingua, 17(2):230--261.Google Scholar
Index Terms
- Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography
Recommendations
Tswana finite state tokenisation
Tswana, a Bantu language in the Sotho group, is characterised by an agglutinative morphology and a disjunctive orthography, which mainly affects the verb category. In particular, verbal prefixes are usually written disjunctively, while suffixes follow a ...
A finite state approach to setswana verb morphology
FSMNLP'09: Proceedings of the 8th international conference on Finite-state methods and natural language processingSetswana is characterised by a disjunctive orthography according to which verbal prefixal morphemes are usually written disjunctively, while suffixal morphemes to the verb root follow a conjunctive writing style. This article specifically focusses on a ...
Exploiting cross-linguistic similarities in Zulu and Xhosa computational morphology
AfLaT '09: Proceedings of the First Workshop on Language Technologies for African LanguagesThis paper investigates the possibilities that cross-linguistic similarities and dissimilarities between related languages offer in terms of bootstrapping a morphological analyser. In this case an existing Zulu morphological analyser prototype (ZulMorph)...
Comments