short-paper

Joint bilingual name tagging for parallel corpora

Authors:
Qi Li

City University of New York, New York City, NY, USA

City University of New York, New York City, NY, USA
View Profile

,
Haibo Li

City University of New York, New York City, NY, USA

City University of New York, New York City, NY, USA
View Profile

,
Heng Ji

City University of New York, New York City, NY, USA

City University of New York, New York City, NY, USA
View Profile

,
Wen Wang

SRI International, Menlo Park, CA, USA

SRI International, Menlo Park, CA, USA
View Profile

,
Jing Zheng

SRI International, Menlo Park, CA, USA

SRI International, Menlo Park, CA, USA
View Profile

,
Fei Huang

IBM T.J. Watson Research Center, Yorktown Heights, USA

IBM T.J. Watson Research Center, Yorktown Heights, USA
View Profile

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementOctober 2012Pages 1727–1731https://doi.org/10.1145/2396761.2398506

Published:29 October 2012Publication History

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Pages 1727–1731

ABSTRACT

Traditional isolated monolingual name taggers tend to yield inconsistent results across two languages. In this paper, we propose two novel approaches to jointly and consistently extract names from parallel corpora. The first approach uses standard linear-chain Conditional Random Fields (CRFs) as the learning framework, incorporating cross-lingual features propagated between two languages. The second approach is based on a joint CRFs model to jointly decode sentence pairs, incorporating bilingual factors based on word alignment. Experiments on Chinese-English parallel corpora demonstrated that the proposed methods significantly outperformed monolingual name taggers, were robust to automatic alignment noise and achieved state-of-the-art performance. With only 20%of the training data, our proposed methods can already achieve better performance compared to the baseline learned from the whole training set.¹

References

P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. Class-based n-gram models of natural language. Computational Linguistics, pages 467--479, 1992. Google ScholarDigital Library
P.-C. Chang, M. Galley, and C. D. Manning. Optimizing chinese word segmentation for machine translation performance. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 224--232, June 2008. Google ScholarDigital Library
Y. R. Chao. The efficiency of the chinese language. In Proc. the General Conference of UNESCO, 1946.Google Scholar
H.-H. Chen, S.-J. Huang, Y.-W. Ding, and S.-C. Tsai. Proper Name Translation in Cross-Language Information Retrieval. In Proc. ACL, 1998. Google ScholarDigital Library
Y. Chen, C. Zong, and K.-Y. Su. On jointly recognizing and aligning bilingual named entities. In ACL, 2010. Google ScholarDigital Library
Y. Deng and Y. Gao. Guiding Statistical Word Alignment Models With Prior Knowledge. In Proc. ACL, 2007.Google Scholar
D. Feng, Y. Lv, and M. Zhou. A new approach for english-chinese named entity alignment. In Proc. PACLIC, 2004.Google Scholar
U. Hermjakob, K. Knight, and H. D. III. Name translation in statistical machine translation: Learning when to transliterate. In Proc. ACL, 2008.Google Scholar
F. Huang and S. Vogel. Improved named entity translation and bilingual named entity extraction. In Proc. 2002 International Conference on Multimodal Interfaces, 2002. Google ScholarDigital Library
H. Ji and R. Grishman. Analysis and repair of name tagger errors. In Proc. COLING-ACL, 2006. Google ScholarDigital Library
H. Ji and R. Grishman. Collaborative entity extraction and translation. In Proc. RANLP, 2007.Google Scholar
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google ScholarDigital Library
R. C. Moore. Learning translations of named-entity phrases from parallel corpora. In Proc. EACL, 2003. Google ScholarDigital Library
F. J. Och and H. Ney. Improved statistical alignment models. In ACL, 2000. Google ScholarDigital Library
K. Parton and K. McKeown. Mt error detection for cross-lingual question answering. Proc. COLING2010, 2010. Google ScholarDigital Library
M. Snover, X. Li, W.-P. Lin, Z. Chen, S. Tamang, M. Ge, A. Lee, Q. Li, H. Li, S. Anzaroot, and H. Ji. Cross-lingual slot filling from comparable corpora. In Proc. ACL2011 Worshop on Building and Using Comparable Corpora, 2011. Google ScholarDigital Library
C. A. Sutton, A. McCallum, and K. Rohanimanesh. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. Journal of Machine Learning Research, 8:693--723, 2007. Google ScholarDigital Library
K. Tsuji. Automatic extraction of translational japanese-katakana and english word pairs from bilingual corpora. 15(3), 2002.Google Scholar
A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.Google Scholar
M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization for approximate inference on loopy graphs. In NIPS, pages 1001--1008, 2001.Google Scholar

Index Terms

Joint bilingual name tagging for parallel corpora
1. Information systems
  1. Information systems applications

Recommendations

Joint bilingual sentiment classification with unlabeled parallel corpora
HLT '11: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1

Most previous work on multilingual sentiment analysis has focused on methods to adapt sentiment resources from resource-rich languages to resource-poor languages. We present a novel approach for joint bilingual sentiment classification at the sentence ...
Read More
Self-organizing semantic maps and its application to word alignment in Japanese-Chinese parallel corpora
2004 Special issue: New developments in self-organizing systems

This paper presents a method involving self-organizing monolingual semantic maps that are visible and continuous representations where Chinese or Japanese words with similar meanings are placed at the same or neighboring points so that the distance ...
Read More
Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
October 2012
2840 pages
ISBN:9781450311564
DOI:10.1145/2396761
General Chair:
Xuewen Chen
Wayne State University, USA
,
Program Chairs:
Guy Lebanon
Georgia Institute of Technology
,
Haixun Wang
Microsoft Research Asia
,
Mohammed J. Zaki
Rensselaer Polytechnic Institute
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bilingual
joint crfs
name tagging
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 186
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Joint bilingual name tagging for parallel corpora

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Joint bilingual sentiment classification with unlabeled parallel corpora

Self-organizing semantic maps and its application to word alignment in Japanese-Chinese parallel corpora

Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Joint bilingual name tagging for parallel corpora

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Joint bilingual sentiment classification with unlabeled parallel corpora

Self-organizing semantic maps and its application to word alignment in Japanese-Chinese parallel corpora

Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media