ABSTRACT
Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one's desktop.Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark.
- R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In Proc. of VLDB, 2002. Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In DMKD, 2004. Google ScholarDigital Library
- M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In SIGKDD, 2003. Google ScholarDigital Library
- M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems Special Issue on Information Integration on the Web, September 2003. Google ScholarDigital Library
- V. Bush. As we may think. The Atlantic Monthly, 1945.Google Scholar
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and Efficient Fuzzy Match for Online Data Cleaning. In Proc. of SIGMOD, 2003. Google ScholarDigital Library
- Computer and information science papers citeseer publications researchindex. http://citeseer.ist.psu.edu/.Google Scholar
- W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration, 2002.Google Scholar
- W. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In SIGKDD, 2000. Google ScholarDigital Library
- W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWEB, pages 73--78, 2003.Google ScholarDigital Library
- http://www.cs.umass.edu/~mccallum/data/cora-refs.tar.gz.Google Scholar
- A. Doan, Y. Lu, Y. Lee, and J. Han. Object matching for information integration: a profiler-based approach. In IIWeb, 2003.Google Scholar
- X. Dong and A. Halevy. A Platform for Personal Information Management and Integration. In Proc. of CIDR, 2005.Google Scholar
- X. Dong, A. Halevy, and J. Madhavan. Reference Reconciliation in Complex Information Spaces. Technical Report 2005-03-04, Univ. of Washington, 2005.Google ScholarDigital Library
- X. Dong, A. Halevy, E. Nemes, S. Sigurdsson, and P. Domingos. Semex: Toward on-the-fly personal information integration. In IIWeb, 2004.Google Scholar
- S. Dumais, E. Cutrell, J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins. Stuff i've seen: A system for personal information retrieval and re-use. In SIGIR, 2003. Google ScholarDigital Library
- I. P. Fellegi and A. B. Sunter. A theory for record linkage. In Journal of the American Statistical Association, 1969.Google Scholar
- H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: language, model, and algorithms. In VLDB, pages 371--380, 2001. Google ScholarDigital Library
- Google. http://desktop.google.com/, 2004.Google Scholar
- L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: current practice and future directions. http://www.act.cmis.csiro.au/rohanb/PAPERS/record.linkage.pdf.Google Scholar
- M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. Google ScholarDigital Library
- L. Jin, C. Li, and S. Mehrotra. Efficient Record Linkage in Large Data Sets. In DASFAA, 2003. Google ScholarDigital Library
- D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM Data Mining (SDM), 2005.Google ScholarCross Ref
- M. L. Lee, T. W. Ling, and W. L. Low. Intelliclean: a knowledge-based intelligent data cleaner. In SIGKDD, pages 290--294, 2000. Google ScholarDigital Library
- A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 2000. Google ScholarDigital Library
- A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In IIWEB, 2003.Google Scholar
- A. K. McCallum, K. Nigam, and L. H. Ungar. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In SIGKDD, 2000. Google ScholarDigital Library
- M. Michalowski, S. Thakkar, and C. A. Knoblock. Exploiting secondary sources for unsupervised record linkage. In IIWeb, 2004.Google Scholar
- H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. In Science 130 (1959), no. 3381, pages 954--959, 1959.Google Scholar
- Parag and P. Domingos. Multi-relational record linkage. In MRDM, 2004.Google Scholar
- H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002.Google Scholar
- J. C. Pinheiro and D. X. Sun. Methods for linking and mining massive heterogeneous databases. In SIGKDD, 1998.Google Scholar
- D. Quan, D. Huynh, and D. R. Karger. Haystack: A platform for authoring end user semantic web applications. In ISWC, 2003.Google ScholarDigital Library
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarDigital Library
- S. Tejada, C. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In SIGKDD, 2002. Google ScholarDigital Library
- W. E. Winkler. Using the em algorithm for weight computation in the fellegi-sunter model of record linkage. In Section on Survey Research Methods, 1988.Google Scholar
- W. E. Winkler. The state of record linkage and current research problems. Technical report, U. S. Bureau of the Census, Wachington, DC, 1999.Google Scholar
Recommendations
A graphical method for reference reconciliation
DASFAA'10: Proceedings of the 15th international conference on Database systems for advanced applicationsIn many applications several references may refer to one real entity, the task of reference reconciliation is to group those references into several clusters so that each cluster associates with only one real entity. In this paper we propose a new ...
Multi-attribute spaces: Calibration for attribute fusion and similarity search
CVPR '12: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Recent work has shown that visual attributes are a powerful approach for applications such as recognition, image description and retrieval. However, fusing multiple attribute scores — as required during multi-attribute queries or similarity searches — ...
A Mutual-Information-Based Approach to Entity Reconciliation in Heterogeneous Databases
CSSE '08: Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 01Entity reconciliation is crucial to data interoperability in heterogeneous databases. In our previous research works, we proposed an entities matching algorithm based on attribute entropy to identify the corresponding entities, which can resolve the ...
Comments