ABSTRACT
We present the annotation architecture of the National Corpus of Polish and discuss problems identified in the TEI stand-off annotation system, which, in its current version, is still very much unfinished and untested, due to both technical reasons (lack of tools implementing the TEI-defined XPointer schemes) and certain problems concerning data representation. We concentrate on two features that a stand-off system should possess and that are conspicuously missing in the current TEI Guidelines.
- Ide, N. and L. Romary. (2007). Towards International Standards for Language Resources. In Dybkjaer, L., Hemsen, H., Minker, W. (eds.), Evaluation of Text and Speech Systems, Springer, 263--84.Google ScholarCross Ref
- Przepiórkowski, A., R. L. Górski, B. Lewandowska-Tomaszczyk and M. Laziński. (2008). Towards the National Corpus of Polish. In the proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakesh, Morocco.Google Scholar
- TEI Consortium, eds. 2007. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 1.2.0. Last updated on February 1st 2009. TEI Consortium.Google Scholar
Index Terms
- Stand-off TEI annotation: the case of the National Corpus of Polish
Recommendations
TEI-friendly annotation scheme for medieval named entities: a case on a Spanish medieval corpus
AbstractMedieval documents are a rich source of historical data. Performing named-entity recognition (NER) on this genre of texts can provide us with valuable historical evidence. However, traditional NER categories and schemes are usually designed with ...
Encoding biomedical resources in TEI: the case of the GENIA corpus
BioMed '03: Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13It is well known that standardising the annotation of language resources significantly raises their potential, as it enables re-use and spurs the development of common technologies. Despite the fact that increasingly complex linguistic information is ...
Incorporating GENETAG-style annotation to GENIA corpus
BioNLP '09: Proceedings of the Workshop on Current Trends in Biomedical Natural Language ProcessingProteins and genes are the most important entities in molecular biology, and their automated recognition in text is the most widely studied task in biomedical information extraction (IE). Several corpora containing annotation for these entities have ...
Comments