skip to main content
10.1145/1066157.1066168acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Reference reconciliation in complex information spaces

Published:14 June 2005Publication History

ABSTRACT

Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one's desktop.Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark.

References

  1. R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In Proc. of VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In DMKD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In SIGKDD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems Special Issue on Information Integration on the Web, September 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. Bush. As we may think. The Atlantic Monthly, 1945.Google ScholarGoogle Scholar
  6. S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and Efficient Fuzzy Match for Online Data Cleaning. In Proc. of SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Computer and information science papers citeseer publications researchindex. http://citeseer.ist.psu.edu/.Google ScholarGoogle Scholar
  8. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration, 2002.Google ScholarGoogle Scholar
  9. W. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In SIGKDD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWEB, pages 73--78, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. http://www.cs.umass.edu/~mccallum/data/cora-refs.tar.gz.Google ScholarGoogle Scholar
  12. A. Doan, Y. Lu, Y. Lee, and J. Han. Object matching for information integration: a profiler-based approach. In IIWeb, 2003.Google ScholarGoogle Scholar
  13. X. Dong and A. Halevy. A Platform for Personal Information Management and Integration. In Proc. of CIDR, 2005.Google ScholarGoogle Scholar
  14. X. Dong, A. Halevy, and J. Madhavan. Reference Reconciliation in Complex Information Spaces. Technical Report 2005-03-04, Univ. of Washington, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Dong, A. Halevy, E. Nemes, S. Sigurdsson, and P. Domingos. Semex: Toward on-the-fly personal information integration. In IIWeb, 2004.Google ScholarGoogle Scholar
  16. S. Dumais, E. Cutrell, J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins. Stuff i've seen: A system for personal information retrieval and re-use. In SIGIR, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. I. P. Fellegi and A. B. Sunter. A theory for record linkage. In Journal of the American Statistical Association, 1969.Google ScholarGoogle Scholar
  18. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: language, model, and algorithms. In VLDB, pages 371--380, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Google. http://desktop.google.com/, 2004.Google ScholarGoogle Scholar
  20. L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: current practice and future directions. http://www.act.cmis.csiro.au/rohanb/PAPERS/record.linkage.pdf.Google ScholarGoogle Scholar
  21. M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Jin, C. Li, and S. Mehrotra. Efficient Record Linkage in Large Data Sets. In DASFAA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM Data Mining (SDM), 2005.Google ScholarGoogle ScholarCross RefCross Ref
  24. M. L. Lee, T. W. Ling, and W. L. Low. Intelliclean: a knowledge-based intelligent data cleaner. In SIGKDD, pages 290--294, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In IIWEB, 2003.Google ScholarGoogle Scholar
  27. A. K. McCallum, K. Nigam, and L. H. Ungar. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In SIGKDD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Michalowski, S. Thakkar, and C. A. Knoblock. Exploiting secondary sources for unsupervised record linkage. In IIWeb, 2004.Google ScholarGoogle Scholar
  29. H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. In Science 130 (1959), no. 3381, pages 954--959, 1959.Google ScholarGoogle Scholar
  30. Parag and P. Domingos. Multi-relational record linkage. In MRDM, 2004.Google ScholarGoogle Scholar
  31. H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002.Google ScholarGoogle Scholar
  32. J. C. Pinheiro and D. X. Sun. Methods for linking and mining massive heterogeneous databases. In SIGKDD, 1998.Google ScholarGoogle Scholar
  33. D. Quan, D. Huynh, and D. R. Karger. Haystack: A platform for authoring end user semantic web applications. In ISWC, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Tejada, C. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. W. E. Winkler. Using the em algorithm for weight computation in the fellegi-sunter model of record linkage. In Section on Survey Research Methods, 1988.Google ScholarGoogle Scholar
  37. W. E. Winkler. The state of record linkage and current research problems. Technical report, U. S. Bureau of the Census, Wachington, DC, 1999.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data
    June 2005
    990 pages
    ISBN:1595930604
    DOI:10.1145/1066157
    • Conference Chair:
    • Fatma Ozcan

    Copyright © 2005 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 14 June 2005

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate785of4,003submissions,20%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader