ABSTRACT
As the development of information technologies makes progress, large morphologically annotated corpora become a necessity, as they are necessary for moving onto higher levels of language computerisation (e. g. automatic syntactic and semantic analysis, information extraction, machine translation). Research of morphological disambiguation and morphological annotation of the 100 million word Lithuanian corpus are presented in the article. Statistical methods have enabled to develop the automatic tool of morphological annotation for Lithuanian, with the disambiguation precision of 94%. Statistical data about the distribution of parts of speech, most frequent wordforms, and lemmas, in the annotated Corpus of The Contemporary Lithuanian Language is also presented.
- Arulmozhi Palanisamy and Sobha Lalitha Devi. 2006. HMM based POS Tagger for a Relatively Free Word Order Language. Research in Computing Science 18, pp. 37--48Google Scholar
- Barbora Vidová-Hladká. 2000. Czech language tagging. Ph.D. thesis, ÚFAL MFF UK, Prague.Google Scholar
- Daniel Jurafsky, James H. Martin. 2000. Speech and Language Processing, Prentice-Hall, Upper Saddle River, NJ. Google ScholarDigital Library
- Erika Rimkutė. 2006. Morfologinio daugiareikšmiškumo ribojimas kompiuteriniame tekstyne (Morphological Disambiguation of the Corpus of Lithuanian Language). Doctoral dissertation, Vytautas Magnus University, Kaunas.Google Scholar
- Jan Hajič. 2004. Disambiguation of rich inflection. Computational morphology of Czech. Karolinum Charles University, Prague.Google Scholar
- Jan Hajič, Pavel Krbec, Pavel Květoň, Karel Oliva, Vladimír Petkevič. 2001. Serial Combination of Rules and Statistics: A Case Study in Czech Tagging. In Proceedings of the 39 Annual Meeting of the ACL (ACL-EACL 2001). Université de Sciences Sociales, Toulouse, France. Google ScholarDigital Library
- Łukasz Dębowski. 2004. Trigram morphosyntactic tagger for Polish. In Proceedings of the International IIS: IIPWM'04 Conference, pp. 409--413, Zakopane.Google ScholarCross Ref
- Vytautas Zinkevičius. 2000. Lemuoklis -- morfologinei analizei (A tool for morphological analysis - Lemuoklis). Darbai ir Dienos, 24, pp. 246--273. Vytautas Magnus University, Kaunas.Google Scholar
- Vytautas Zinkevičius, Vidas Daudaravičius, and Erika Rimkutė. 2005. The Morphologically annotated Lithuanian Corpus. In Proceedings of The Second Baltic Conference on Human Language Technologies, pp. 365--370. Tallinn.Google Scholar
Recommendations
Bulgarian-Polish-Lithuanian corpus: current development
MRTECEEL '09: Proceedings of the Workshop on Multilingual Resources, Technologies and Evaluation for Central and Eastern European LanguagesThis paper discusses the building of the first Bulgarian---Polish---Lithuanian (for short, BG---PL---LT) experimental corpus. The BG---PL---LT corpus (currently under development only for research) contains more than 3 million words and comprises two ...
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition
This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
Experiments in cross-language morphological annotation transfer
CICLing'06: Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text ProcessingAnnotated corpora are valuable resources for NLP which are often costly to create. We introduce a method for transferring annotation from a morphologically annotated corpus of a source language to a target language. Our approach assumes only that an ...
Comments