skip to main content
article

Collective entity resolution in relational data

Published:01 March 2007Publication History
Skip Abstract Section

Abstract

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.

References

  1. Adamic, L. and Adar, E. 2003. Friends and neighbors on the Web. Social Networ. 25, 3 (July), 211--230.Google ScholarGoogle ScholarCross RefCross Ref
  2. Ananthakrishna, R., Chaudhuri, S., and Ganti, V. 2002. Eliminating fuzzy duplicates in data warehouses. In The International Conference on Very Large Databases (VLDB). Hong Kong, China. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Benjelloun, O., Garcia-Molina, H., Su, Q., and Widom, J. 2005. Swoosh: A generic approach to entity resolution. Tech. rep., Stanford University. (March)Google ScholarGoogle Scholar
  4. Bhattacharya, I. and Getoor, L. 2004. Iterative record linkage for cleaning and integration. In The ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Paris, France. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bhattacharya, I. and Getoor, L. 2006a. Mining graph data. In Entity Resolution in Graphs. L. Holder and D. Cook, Eds. John Wiley.Google ScholarGoogle Scholar
  6. Bhattacharya, I. and Getoor, L. 2006b. A latent dirichlet model for unsupervised entity resolution. In The SIAM Conference on Data Mining (SIAM-SDM). Bethesda, MD.Google ScholarGoogle Scholar
  7. Bhattacharya, I. and Getoor, L. 2006c. Query-time entity resolution. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bilenko, M. and Mooney, R. 2003. Adaptive duplicate detection using learnable string similarity measures. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and Fienberg, S. 2003. Adaptive name matching in information integration. IEEE Intellig. Syst. 18, 5, 16--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In The ACM International Conference on Management of Data (SIGMOD). San Diego, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cohen, W. 2000. Data integration using similarity joins and a word-based information representation language. ACM Trans. Inform. Syst. 18, 288--321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cohen, W., Ravikumar, P., and Fienberg, S. 2003. A comparison of string distance metrics for name-matching tasks. In The IJCAI Workshop on Information Integration on the Web (IIWeb). Acapulco, Mexico.Google ScholarGoogle Scholar
  13. Cohen, W. and Richman, J. 2002. Learning to match and cluster large high-dimensional data sets for data integration. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Edmonton, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dong, X., Halevy, A., and Madhavan, J. 2005. Reference reconciliation in complex information spaces. In The ACM International Conference on Management of Data (SIGMOD). Baltimore, MD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Statis. Assoc. 64, 1183--1210.Google ScholarGoogle ScholarCross RefCross Ref
  16. Giles, C. L., Bollacker, K., and Lawrence, S. 1998. CiteSeer: An automatic citation indexing system. In The ACM Conference on Digital Libraries. Pittsburgh, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Gravano, L., Ipeirotis, P., Koudas, N., and Srivastava, D. 2003. Text joins for data cleansing and integration in an RDBMS. In The IEEE International Conference on Data Engineering (ICDE). Bangalore, India.Google ScholarGoogle Scholar
  18. Hernández, M. and Stolfo, S. 1995. The merge/purge problem for large databases. In The ACM International Conference on Management of Data (SIGMOD). San Jose, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kalashnikov, D., Mehrotra, S., and Chen, Z. 2005. Exploiting relationships for domain-independent data cleaning. In The SIAM International Conference on Data Mining (SIAM SDM). Newport Beach, CA.Google ScholarGoogle Scholar
  20. Li, X., Morie, P., and Roth, D. 2005. Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine. Special Issue on Semantic Integration 26, 1, 45--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Liben-Nowell, D. and Kleinberg, J. 2003. The link prediction problem for social networks. In The International Conference on Information and Knowledge Management (CIKM). New Orleans, LA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. McCallum, A., Nigam, K., and Ungar, L. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In The International Conference On Knowledge Discovery and Data Mining (SIGKDD). Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. McCallum, A. and Wellner, B. 2004. Conditional models of identity uncertainty with application to noun coreference. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.Google ScholarGoogle Scholar
  24. Monge, A. and Elkan, C. 1996. The field matching problem: Algorithms and applications. In The International Conference on Knowledge Discovery and Data Mining (SIGKDD). Portland, ME.Google ScholarGoogle Scholar
  25. Monge, A. and Elkan, C. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Tuscon, AZ.Google ScholarGoogle Scholar
  26. Navarro, G. 2001. A guided tour to approximate string matching. ACM Comp. Sur. 33, 1, 31--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Newcombe, H., Kennedy, J., Axford, S., and James, A. 1959. Automatic linkage of vital records. Science 130, 954--959.Google ScholarGoogle ScholarCross RefCross Ref
  28. Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. 2003. Identity uncertainty and citation matching. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.Google ScholarGoogle Scholar
  29. Ravikumar, P. and Cohen, W. 2004. A hierarchical graphical model for record linkage. In The Conference on Uncertainty in Artificial Intelligence (UAI). Banff, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ristad, E. and Yianilos, P. 1998. Learning string edit distance. IEEE Trans. Patt. Anal. Mach. Intell. 20, 5, 522--532. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sarawagi, S. and Bhamidipaty, A. 2002. Interactive deduplication using active learning. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Edmonton, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Singla, P. and Domingos, P. 2004. Multi-relational record linkage. In The ACM SIGKDD Workshop on Multi-Relational Data Mining (MRDM). Seattle, WA.Google ScholarGoogle Scholar
  33. Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules for information integration. Inform. Syst. J. 26, 8, 635--656. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Winkler, W. 1999. The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.Google ScholarGoogle Scholar
  35. Winkler, W. 2002. Methods for record linkage and Bayesian networks. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.Google ScholarGoogle Scholar

Index Terms

  1. Collective entity resolution in relational data

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 1, Issue 1
        March 2007
        161 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/1217299
        Issue’s Table of Contents

        Copyright © 2007 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 March 2007
        Published in tkdd Volume 1, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader