ABSTRACT
In this paper we present a thorough evaluation of a corpus resource for Portuguese, CETEMPúblico, a 180-million word newspaper corpus free for R&D in Portuguese processing. We provide information that should be useful to those using the resource, and to considerable improvement for later versions. In addition, we think that the procedures presented can be of interest for the larger NLP community, since corpus evaluation and description is unfortunately not a common exercise.
- Susana Cavadas Afonso and Ana Raquel Marchi. 2001. Critérios de separação de sentenças/frases, cgi.portugues.mct.pt/treebank/CriteriosSeparacao.htmlGoogle Scholar
- J. J. Almeida and Ulisses Pinto. 1994. Jspell --- um módulo para análise léxica genérica de linguagem natural. Actas do Congresso da Associação Portuguesa de Linguística (Évora, 1994), www.di.uminho.pt~jj/pln/jspell.ps.gz.Google Scholar
- Susan Armstrong, Masja Kempen, David McKelvie, Dominique Petitpierre, Reinhard Rapp, and Henry S. Thompson. 1998. Multilingual Corpora for Cooperation. In Antonio Rubio et al. (eds.), Proceedings of The First International Conference on Language Resources and Evaluation (Granada, 28-30 May 1998), Vol. 2, pp. 975--80.Google Scholar
- Oliver Christ, Bruno M. Schulze, Anja Hofmann and Esther Koenig. 1999. The IMS Corpus Workbench: Corpus Query Processor (CQP): User's Manual, Institute for Natural Language Processing, University of Stuttgart http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/COPUserManualGoogle Scholar
- Gregory Grefenstette and Pasi Tapanainen. 1994. What is a word, What is a sentence? Problems of Tokenization. Proceedings of the 3rd International Conference on Computational Lexicography (COMPLEX'94), pp. 79--87Google Scholar
- Stig Johansson, Jarle Ebeling and Knut Hofland. 1996. Coding and aligning the English-Norwegian Parallel Corpus. In Karin Aijmer, Bengt Altenberg & Mats Johansson (eds.), Languages in Contrast: Papers from a Symposium on Text-based Cross-linguistic Studies (Lund, 4-5 March 1994), Lund University Press, pp. 87--112.Google Scholar
- Mei Kobayashi and Koichi Takeda. 1999. Information retrieval on the web: Selected topics. IBM Research, Tokyo Research Laboratory, IBM Japan, Dec. 16, 1999.Google Scholar
- Geoffrey Nunberg. 1990. The linguistics of punctuation. CSLI Lecture Notes, Number 18.Google Scholar
- Paulo Alexandre Rocha and Diana Santos. 2000. CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa. In Graça Nunes (ed.), Actas do V Encontro para o processamento computacional da língua portuguesa escrita e falada (PROPOR'2000), (São Paulo, 19-22 November 2000), pp. 131--140.Google Scholar
- Diana Santos. 1998. Punctuation and multilinguality: Reflections from a language engineering perspective. In Jo Terje Ydstie and Anne C. Wollebæk (eds.), Working Papers in Applied Linguistics 4/98. Oslo: Department of Linguistics, Faculty of Arts, University of Oslo, pp. 138--60.Google Scholar
- Diana Santos. 1999. Comparação de corpora em português: algumas experiências. www.portugues.mct.pt/Diana/download/CCP.psGoogle Scholar
- Diana Santos. 2001. Resultado da revisão do primeiro milhão de palavras do CETEMPúblico c gi.portugues.mct.pt/treebank/RevisaoMilhao.htmlGoogle Scholar
- Diana Santos and Eckhard Bick. 2000. Providing Internet access to Portuguese corpora: the AC/DC project. In Maria Gavriladou et al. (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation, LREC2000 (Athens, 31 May-2 June 2000), pp. 205--210.Google Scholar
Recommendations
Light stemming approaches for the French, Portuguese, German and Hungarian languages
SAC '06: Proceedings of the 2006 ACM symposium on Applied computingThis paper describes and evaluates various general stemming approaches for the French, Portuguese (Brazilian), German and Hungarian languages. Based on the CLEF test-collections, we demonstrate that light stemmers for the French, Portuguese and ...
Bootstrapping a Lexicon of Multiword Adverbs for Brazilian Portuguese
Computational and Corpus-Based PhraseologyAbstractThis paper presents the process for bootstrapping a computational lexicon of multiword adverbs for Brazilian Portuguese (PT-BR) from an already existing lexicon built for the European variety of the language (PT-PT). This ongoing work aims to ...
Groundwork for the Development of the Brazilian Portuguese Wordnet
PorTAL '02: Proceedings of the Third International Conference on Advances in Natural Language ProcessingConsidering the Princeton WordNet built for English as a reference, new Wordnets in other languages are being built, such as the ones for European Portuguese, Galician, Basque, Catalan, and Spanish, just to mention some Iberian languages. In this paper ...
Comments