skip to main content
10.3115/1073012.1073070dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Article
Free Access

Evaluating CETEMPúblico, a free resource for Portuguese

Published:06 July 2001Publication History

ABSTRACT

In this paper we present a thorough evaluation of a corpus resource for Portuguese, CETEMPúblico, a 180-million word newspaper corpus free for R&D in Portuguese processing. We provide information that should be useful to those using the resource, and to considerable improvement for later versions. In addition, we think that the procedures presented can be of interest for the larger NLP community, since corpus evaluation and description is unfortunately not a common exercise.

References

  1. Susana Cavadas Afonso and Ana Raquel Marchi. 2001. Critérios de separação de sentenças/frases, cgi.portugues.mct.pt/treebank/CriteriosSeparacao.htmlGoogle ScholarGoogle Scholar
  2. J. J. Almeida and Ulisses Pinto. 1994. Jspell --- um módulo para análise léxica genérica de linguagem natural. Actas do Congresso da Associação Portuguesa de Linguística (Évora, 1994), www.di.uminho.pt~jj/pln/jspell.ps.gz.Google ScholarGoogle Scholar
  3. Susan Armstrong, Masja Kempen, David McKelvie, Dominique Petitpierre, Reinhard Rapp, and Henry S. Thompson. 1998. Multilingual Corpora for Cooperation. In Antonio Rubio et al. (eds.), Proceedings of The First International Conference on Language Resources and Evaluation (Granada, 28-30 May 1998), Vol. 2, pp. 975--80.Google ScholarGoogle Scholar
  4. Oliver Christ, Bruno M. Schulze, Anja Hofmann and Esther Koenig. 1999. The IMS Corpus Workbench: Corpus Query Processor (CQP): User's Manual, Institute for Natural Language Processing, University of Stuttgart http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/COPUserManualGoogle ScholarGoogle Scholar
  5. Gregory Grefenstette and Pasi Tapanainen. 1994. What is a word, What is a sentence? Problems of Tokenization. Proceedings of the 3rd International Conference on Computational Lexicography (COMPLEX'94), pp. 79--87Google ScholarGoogle Scholar
  6. Stig Johansson, Jarle Ebeling and Knut Hofland. 1996. Coding and aligning the English-Norwegian Parallel Corpus. In Karin Aijmer, Bengt Altenberg & Mats Johansson (eds.), Languages in Contrast: Papers from a Symposium on Text-based Cross-linguistic Studies (Lund, 4-5 March 1994), Lund University Press, pp. 87--112.Google ScholarGoogle Scholar
  7. Mei Kobayashi and Koichi Takeda. 1999. Information retrieval on the web: Selected topics. IBM Research, Tokyo Research Laboratory, IBM Japan, Dec. 16, 1999.Google ScholarGoogle Scholar
  8. Geoffrey Nunberg. 1990. The linguistics of punctuation. CSLI Lecture Notes, Number 18.Google ScholarGoogle Scholar
  9. Paulo Alexandre Rocha and Diana Santos. 2000. CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa. In Graça Nunes (ed.), Actas do V Encontro para o processamento computacional da língua portuguesa escrita e falada (PROPOR'2000), (São Paulo, 19-22 November 2000), pp. 131--140.Google ScholarGoogle Scholar
  10. Diana Santos. 1998. Punctuation and multilinguality: Reflections from a language engineering perspective. In Jo Terje Ydstie and Anne C. Wollebæk (eds.), Working Papers in Applied Linguistics 4/98. Oslo: Department of Linguistics, Faculty of Arts, University of Oslo, pp. 138--60.Google ScholarGoogle Scholar
  11. Diana Santos. 1999. Comparação de corpora em português: algumas experiências. www.portugues.mct.pt/Diana/download/CCP.psGoogle ScholarGoogle Scholar
  12. Diana Santos. 2001. Resultado da revisão do primeiro milhão de palavras do CETEMPúblico c gi.portugues.mct.pt/treebank/RevisaoMilhao.htmlGoogle ScholarGoogle Scholar
  13. Diana Santos and Eckhard Bick. 2000. Providing Internet access to Portuguese corpora: the AC/DC project. In Maria Gavriladou et al. (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation, LREC2000 (Athens, 31 May-2 June 2000), pp. 205--210.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image DL Hosted proceedings
    ACL '01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
    July 2001
    562 pages

    Publisher

    Association for Computational Linguistics

    United States

    Publication History

    • Published: 6 July 2001

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate85of443submissions,19%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader