ABSTRACT
Traditional isolated monolingual name taggers tend to yield inconsistent results across two languages. In this paper, we propose two novel approaches to jointly and consistently extract names from parallel corpora. The first approach uses standard linear-chain Conditional Random Fields (CRFs) as the learning framework, incorporating cross-lingual features propagated between two languages. The second approach is based on a joint CRFs model to jointly decode sentence pairs, incorporating bilingual factors based on word alignment. Experiments on Chinese-English parallel corpora demonstrated that the proposed methods significantly outperformed monolingual name taggers, were robust to automatic alignment noise and achieved state-of-the-art performance. With only 20%of the training data, our proposed methods can already achieve better performance compared to the baseline learned from the whole training set.1
- P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. Class-based n-gram models of natural language. Computational Linguistics, pages 467--479, 1992. Google ScholarDigital Library
- P.-C. Chang, M. Galley, and C. D. Manning. Optimizing chinese word segmentation for machine translation performance. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 224--232, June 2008. Google ScholarDigital Library
- Y. R. Chao. The efficiency of the chinese language. In Proc. the General Conference of UNESCO, 1946.Google Scholar
- H.-H. Chen, S.-J. Huang, Y.-W. Ding, and S.-C. Tsai. Proper Name Translation in Cross-Language Information Retrieval. In Proc. ACL, 1998. Google ScholarDigital Library
- Y. Chen, C. Zong, and K.-Y. Su. On jointly recognizing and aligning bilingual named entities. In ACL, 2010. Google ScholarDigital Library
- Y. Deng and Y. Gao. Guiding Statistical Word Alignment Models With Prior Knowledge. In Proc. ACL, 2007.Google Scholar
- D. Feng, Y. Lv, and M. Zhou. A new approach for english-chinese named entity alignment. In Proc. PACLIC, 2004.Google Scholar
- U. Hermjakob, K. Knight, and H. D. III. Name translation in statistical machine translation: Learning when to transliterate. In Proc. ACL, 2008.Google Scholar
- F. Huang and S. Vogel. Improved named entity translation and bilingual named entity extraction. In Proc. 2002 International Conference on Multimodal Interfaces, 2002. Google ScholarDigital Library
- H. Ji and R. Grishman. Analysis and repair of name tagger errors. In Proc. COLING-ACL, 2006. Google ScholarDigital Library
- H. Ji and R. Grishman. Collaborative entity extraction and translation. In Proc. RANLP, 2007.Google Scholar
- J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google ScholarDigital Library
- R. C. Moore. Learning translations of named-entity phrases from parallel corpora. In Proc. EACL, 2003. Google ScholarDigital Library
- F. J. Och and H. Ney. Improved statistical alignment models. In ACL, 2000. Google ScholarDigital Library
- K. Parton and K. McKeown. Mt error detection for cross-lingual question answering. Proc. COLING2010, 2010. Google ScholarDigital Library
- M. Snover, X. Li, W.-P. Lin, Z. Chen, S. Tamang, M. Ge, A. Lee, Q. Li, H. Li, S. Anzaroot, and H. Ji. Cross-lingual slot filling from comparable corpora. In Proc. ACL2011 Worshop on Building and Using Comparable Corpora, 2011. Google ScholarDigital Library
- C. A. Sutton, A. McCallum, and K. Rohanimanesh. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. Journal of Machine Learning Research, 8:693--723, 2007. Google ScholarDigital Library
- K. Tsuji. Automatic extraction of translational japanese-katakana and english word pairs from bilingual corpora. 15(3), 2002.Google Scholar
- A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.Google Scholar
- M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization for approximate inference on loopy graphs. In NIPS, pages 1001--1008, 2001.Google Scholar
Index Terms
- Joint bilingual name tagging for parallel corpora
Recommendations
Joint bilingual sentiment classification with unlabeled parallel corpora
HLT '11: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1Most previous work on multilingual sentiment analysis has focused on methods to adapt sentiment resources from resource-rich languages to resource-poor languages. We present a novel approach for joint bilingual sentiment classification at the sentence ...
Self-organizing semantic maps and its application to word alignment in Japanese-Chinese parallel corpora
2004 Special issue: New developments in self-organizing systemsThis paper presents a method involving self-organizing monolingual semantic maps that are visible and continuous representations where Chinese or Japanese words with similar meanings are placed at the same or neighboring points so that the distance ...
Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora
An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...
Comments