Abstract
Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.
- Adamic, L. and Adar, E. 2003. Friends and neighbors on the Web. Social Networ. 25, 3 (July), 211--230.Google ScholarCross Ref
- Ananthakrishna, R., Chaudhuri, S., and Ganti, V. 2002. Eliminating fuzzy duplicates in data warehouses. In The International Conference on Very Large Databases (VLDB). Hong Kong, China. Google ScholarDigital Library
- Benjelloun, O., Garcia-Molina, H., Su, Q., and Widom, J. 2005. Swoosh: A generic approach to entity resolution. Tech. rep., Stanford University. (March)Google Scholar
- Bhattacharya, I. and Getoor, L. 2004. Iterative record linkage for cleaning and integration. In The ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Paris, France. Google ScholarDigital Library
- Bhattacharya, I. and Getoor, L. 2006a. Mining graph data. In Entity Resolution in Graphs. L. Holder and D. Cook, Eds. John Wiley.Google Scholar
- Bhattacharya, I. and Getoor, L. 2006b. A latent dirichlet model for unsupervised entity resolution. In The SIAM Conference on Data Mining (SIAM-SDM). Bethesda, MD.Google Scholar
- Bhattacharya, I. and Getoor, L. 2006c. Query-time entity resolution. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Philadelphia, PA. Google ScholarDigital Library
- Bilenko, M. and Mooney, R. 2003. Adaptive duplicate detection using learnable string similarity measures. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC. Google ScholarDigital Library
- Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and Fienberg, S. 2003. Adaptive name matching in information integration. IEEE Intellig. Syst. 18, 5, 16--23. Google ScholarDigital Library
- Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In The ACM International Conference on Management of Data (SIGMOD). San Diego, CA. Google ScholarDigital Library
- Cohen, W. 2000. Data integration using similarity joins and a word-based information representation language. ACM Trans. Inform. Syst. 18, 288--321. Google ScholarDigital Library
- Cohen, W., Ravikumar, P., and Fienberg, S. 2003. A comparison of string distance metrics for name-matching tasks. In The IJCAI Workshop on Information Integration on the Web (IIWeb). Acapulco, Mexico.Google Scholar
- Cohen, W. and Richman, J. 2002. Learning to match and cluster large high-dimensional data sets for data integration. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Edmonton, Canada. Google ScholarDigital Library
- Dong, X., Halevy, A., and Madhavan, J. 2005. Reference reconciliation in complex information spaces. In The ACM International Conference on Management of Data (SIGMOD). Baltimore, MD. Google ScholarDigital Library
- Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Statis. Assoc. 64, 1183--1210.Google ScholarCross Ref
- Giles, C. L., Bollacker, K., and Lawrence, S. 1998. CiteSeer: An automatic citation indexing system. In The ACM Conference on Digital Libraries. Pittsburgh, PA. Google ScholarDigital Library
- Gravano, L., Ipeirotis, P., Koudas, N., and Srivastava, D. 2003. Text joins for data cleansing and integration in an RDBMS. In The IEEE International Conference on Data Engineering (ICDE). Bangalore, India.Google Scholar
- Hernández, M. and Stolfo, S. 1995. The merge/purge problem for large databases. In The ACM International Conference on Management of Data (SIGMOD). San Jose, CA. Google ScholarDigital Library
- Kalashnikov, D., Mehrotra, S., and Chen, Z. 2005. Exploiting relationships for domain-independent data cleaning. In The SIAM International Conference on Data Mining (SIAM SDM). Newport Beach, CA.Google Scholar
- Li, X., Morie, P., and Roth, D. 2005. Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine. Special Issue on Semantic Integration 26, 1, 45--58. Google ScholarDigital Library
- Liben-Nowell, D. and Kleinberg, J. 2003. The link prediction problem for social networks. In The International Conference on Information and Knowledge Management (CIKM). New Orleans, LA. Google ScholarDigital Library
- McCallum, A., Nigam, K., and Ungar, L. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In The International Conference On Knowledge Discovery and Data Mining (SIGKDD). Boston, MA. Google ScholarDigital Library
- McCallum, A. and Wellner, B. 2004. Conditional models of identity uncertainty with application to noun coreference. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.Google Scholar
- Monge, A. and Elkan, C. 1996. The field matching problem: Algorithms and applications. In The International Conference on Knowledge Discovery and Data Mining (SIGKDD). Portland, ME.Google Scholar
- Monge, A. and Elkan, C. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Tuscon, AZ.Google Scholar
- Navarro, G. 2001. A guided tour to approximate string matching. ACM Comp. Sur. 33, 1, 31--88. Google ScholarDigital Library
- Newcombe, H., Kennedy, J., Axford, S., and James, A. 1959. Automatic linkage of vital records. Science 130, 954--959.Google ScholarCross Ref
- Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. 2003. Identity uncertainty and citation matching. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.Google Scholar
- Ravikumar, P. and Cohen, W. 2004. A hierarchical graphical model for record linkage. In The Conference on Uncertainty in Artificial Intelligence (UAI). Banff, Canada. Google ScholarDigital Library
- Ristad, E. and Yianilos, P. 1998. Learning string edit distance. IEEE Trans. Patt. Anal. Mach. Intell. 20, 5, 522--532. Google ScholarDigital Library
- Sarawagi, S. and Bhamidipaty, A. 2002. Interactive deduplication using active learning. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Edmonton, Canada. Google ScholarDigital Library
- Singla, P. and Domingos, P. 2004. Multi-relational record linkage. In The ACM SIGKDD Workshop on Multi-Relational Data Mining (MRDM). Seattle, WA.Google Scholar
- Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules for information integration. Inform. Syst. J. 26, 8, 635--656. Google ScholarDigital Library
- Winkler, W. 1999. The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.Google Scholar
- Winkler, W. 2002. Methods for record linkage and Bayesian networks. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.Google Scholar
Index Terms
- Collective entity resolution in relational data
Recommendations
Unsupervised Graph-Based Entity Resolution for Complex Entities
Entity resolution (ER) is the process of linking records that refer to the same entity. Traditionally, this process compares attribute values of records to calculate similarities and then classifies pairs of records as referring to the same entity or not ...
Pay-As-You-Go Entity Resolution
Entity resolution (ER) is the problem of identifying which records in a database refer to the same entity. In practice, many applications need to resolve large data sets efficiently, but do not require the ER result to be exact. For example, people data ...
A Graduate-Level Course on Entity Resolution and Information Quality: A Step toward ER Education
Special Issue on Entity ResolutionThis article discusses the topics, approaches, and lessons learned in teaching a graduate-level course covering entity resolution (ER) and its relationship to information quality (IQ). The course surveys a broad spectrum of ER topics and activities ...
Comments