article

Collective entity resolution in relational data

Authors:
Indrajit Bhattacharya

University of Maryland, College Park, MD

University of Maryland, College Park, MD
View Profile

,
Lise Getoor

University of Maryland, College Park, MD

University of Maryland, College Park, MD
View Profile

Authors Info & Claims

ACM Transactions on Knowledge Discovery from Data Volume 1 Issue 1pp 5–eshttps://doi.org/10.1145/1217299.1217304

Published:01 March 2007Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.

References

Adamic, L. and Adar, E. 2003. Friends and neighbors on the Web. Social Networ. 25, 3 (July), 211--230.Google ScholarCross Ref
Ananthakrishna, R., Chaudhuri, S., and Ganti, V. 2002. Eliminating fuzzy duplicates in data warehouses. In The International Conference on Very Large Databases (VLDB). Hong Kong, China. Google ScholarDigital Library
Benjelloun, O., Garcia-Molina, H., Su, Q., and Widom, J. 2005. Swoosh: A generic approach to entity resolution. Tech. rep., Stanford University. (March)Google Scholar
Bhattacharya, I. and Getoor, L. 2004. Iterative record linkage for cleaning and integration. In The ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Paris, France. Google ScholarDigital Library
Bhattacharya, I. and Getoor, L. 2006a. Mining graph data. In Entity Resolution in Graphs. L. Holder and D. Cook, Eds. John Wiley.Google Scholar
Bhattacharya, I. and Getoor, L. 2006b. A latent dirichlet model for unsupervised entity resolution. In The SIAM Conference on Data Mining (SIAM-SDM). Bethesda, MD.Google Scholar
Bhattacharya, I. and Getoor, L. 2006c. Query-time entity resolution. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Philadelphia, PA. Google ScholarDigital Library
Bilenko, M. and Mooney, R. 2003. Adaptive duplicate detection using learnable string similarity measures. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC. Google ScholarDigital Library
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and Fienberg, S. 2003. Adaptive name matching in information integration. IEEE Intellig. Syst. 18, 5, 16--23. Google ScholarDigital Library
Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In The ACM International Conference on Management of Data (SIGMOD). San Diego, CA. Google ScholarDigital Library
Cohen, W. 2000. Data integration using similarity joins and a word-based information representation language. ACM Trans. Inform. Syst. 18, 288--321. Google ScholarDigital Library
Cohen, W., Ravikumar, P., and Fienberg, S. 2003. A comparison of string distance metrics for name-matching tasks. In The IJCAI Workshop on Information Integration on the Web (IIWeb). Acapulco, Mexico.Google Scholar
Cohen, W. and Richman, J. 2002. Learning to match and cluster large high-dimensional data sets for data integration. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Edmonton, Canada. Google ScholarDigital Library
Dong, X., Halevy, A., and Madhavan, J. 2005. Reference reconciliation in complex information spaces. In The ACM International Conference on Management of Data (SIGMOD). Baltimore, MD. Google ScholarDigital Library
Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Statis. Assoc. 64, 1183--1210.Google ScholarCross Ref
Giles, C. L., Bollacker, K., and Lawrence, S. 1998. CiteSeer: An automatic citation indexing system. In The ACM Conference on Digital Libraries. Pittsburgh, PA. Google ScholarDigital Library
Gravano, L., Ipeirotis, P., Koudas, N., and Srivastava, D. 2003. Text joins for data cleansing and integration in an RDBMS. In The IEEE International Conference on Data Engineering (ICDE). Bangalore, India.Google Scholar
Hernández, M. and Stolfo, S. 1995. The merge/purge problem for large databases. In The ACM International Conference on Management of Data (SIGMOD). San Jose, CA. Google ScholarDigital Library
Kalashnikov, D., Mehrotra, S., and Chen, Z. 2005. Exploiting relationships for domain-independent data cleaning. In The SIAM International Conference on Data Mining (SIAM SDM). Newport Beach, CA.Google Scholar
Li, X., Morie, P., and Roth, D. 2005. Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine. Special Issue on Semantic Integration 26, 1, 45--58. Google ScholarDigital Library
Liben-Nowell, D. and Kleinberg, J. 2003. The link prediction problem for social networks. In The International Conference on Information and Knowledge Management (CIKM). New Orleans, LA. Google ScholarDigital Library
McCallum, A., Nigam, K., and Ungar, L. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In The International Conference On Knowledge Discovery and Data Mining (SIGKDD). Boston, MA. Google ScholarDigital Library
McCallum, A. and Wellner, B. 2004. Conditional models of identity uncertainty with application to noun coreference. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.Google Scholar
Monge, A. and Elkan, C. 1996. The field matching problem: Algorithms and applications. In The International Conference on Knowledge Discovery and Data Mining (SIGKDD). Portland, ME.Google Scholar
Monge, A. and Elkan, C. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Tuscon, AZ.Google Scholar
Navarro, G. 2001. A guided tour to approximate string matching. ACM Comp. Sur. 33, 1, 31--88. Google ScholarDigital Library
Newcombe, H., Kennedy, J., Axford, S., and James, A. 1959. Automatic linkage of vital records. Science 130, 954--959.Google ScholarCross Ref
Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. 2003. Identity uncertainty and citation matching. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.Google Scholar
Ravikumar, P. and Cohen, W. 2004. A hierarchical graphical model for record linkage. In The Conference on Uncertainty in Artificial Intelligence (UAI). Banff, Canada. Google ScholarDigital Library
Ristad, E. and Yianilos, P. 1998. Learning string edit distance. IEEE Trans. Patt. Anal. Mach. Intell. 20, 5, 522--532. Google ScholarDigital Library
Sarawagi, S. and Bhamidipaty, A. 2002. Interactive deduplication using active learning. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Edmonton, Canada. Google ScholarDigital Library
Singla, P. and Domingos, P. 2004. Multi-relational record linkage. In The ACM SIGKDD Workshop on Multi-Relational Data Mining (MRDM). Seattle, WA.Google Scholar
Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules for information integration. Inform. Syst. J. 26, 8, 635--656. Google ScholarDigital Library
Winkler, W. 1999. The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.Google Scholar
Winkler, W. 2002. Methods for record linkage and Bayesian networks. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.Google Scholar

Index Terms

Collective entity resolution in relational data
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Unsupervised Graph-Based Entity Resolution for Complex Entities
Entity resolution (ER) is the process of linking records that refer to the same entity. Traditionally, this process compares attribute values of records to calculate similarities and then classifies pairs of records as referring to the same entity or not ...
Read More
Pay-As-You-Go Entity Resolution

Entity resolution (ER) is the problem of identifying which records in a database refer to the same entity. In practice, many applications need to resolve large data sets efficiently, but do not require the ER result to be exact. For example, people data ...
Read More
A Graduate-Level Course on Entity Resolution and Information Quality: A Step toward ER Education
Special Issue on Entity Resolution

This article discusses the topics, approaches, and lessons learned in teaching a graduate-level course covering entity resolution (ER) and its relationship to information quality (IQ). The course surveys a broad spectrum of ER topics and activities ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 1, Issue 1
March 2007
161 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1217299
Issue’s Table of Contents

Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2007
Published in tkdd Volume 1, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Entity resolution
data cleaning
graph clustering
record linkage
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 410
  Total Citations
  View Citations
- 4,898
  Total Downloads
- Downloads (Last 12 months)148
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Unsupervised Graph-Based Entity Resolution for Complex Entities

Pay-As-You-Go Entity Resolution

A Graduate-Level Course on Entity Resolution and Information Quality: A Step toward ER Education

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Unsupervised Graph-Based Entity Resolution for Complex Entities

Pay-As-You-Go Entity Resolution

A Graduate-Level Course on Entity Resolution and Information Quality: A Step toward ER Education

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media