skip to main content
10.1145/1321440.1321484acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

A strategy for allowing meaningful and comparable scores in approximate matching

Published:06 November 2007Publication History

ABSTRACT

The goal of approximate data matching is to assess whether two distinct data instances represent the same real world object. This is usually achieved through the use of a similarity function, which returns a score that defines how similar two data instances are. If this score surpasses a given threshold, both data instances are considered as representing the same real world object. The score values returned by a similarity function depend on the algorithm that implements the function and have no meaning to the user (apart from the fact that a higher similarity value means that two data instances are more similar). In this paper, we propose that instead of defining the threshold in terms of the scores returned by a similarity function, the user specifies the precision that is expected from the matching process. Precision is a well known quality measure and has a clear interpretation from the user's point of view. Our approach relies on mapping between similarity scores and precision values based on a training data set. Experimental results show the training may be executed against a representative data set, and reused for other databases from the same domain.

References

  1. R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16--23, September/October 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Bruno, S. Chaudhuri, and L. Gravano. Top-k selection queries over relational databases: Mapping strategies and performance evaluation. ACM Trans. Database Syst., 27(2):153--187, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Buckley and E. M. Voorhees. Evaluating evaluation measure stability. In ACM SIGIR 2000, pages 33--40, New York, NY, USA, 2000. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD 2003, pages 313--324, New York, NY, USA, 2003. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Christen, T. Churches, and M. Hegland. Febrl - a parallel open source data linkage system. In PAKDD 2004 (LNAI 3056), pages 638--647. Springer, 2004.Google ScholarGoogle Scholar
  7. W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), August 9-10, 2003, Acapulco, Mexico, pages 73--78, 2003.Google ScholarGoogle Scholar
  8. R. da Silva, R. K. Stasiu, V. M. Orengo, and C. A. Heuser. Measuring quality of similarity functions in approximate data matching. Journal of Informetrics, 1(1):35--46, January 2007.Google ScholarGoogle ScholarCross RefCross Ref
  9. N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB 2004, pages 864--875, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Doan, Y. Lu, Y. Lee, and J. Han. Profile-based object matching for information integration. IEEE Intelligent Systems, 18(5):54--59, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. F. Dorneles, C. A. Heuser, A. E. N. Lima, A. S. da Silva, and E. S. de Moura. Measuring similarity between collection of values. In WIDM 2004: 6th ACM Intl. Workshop on Web Information and Data Management, pages 56--63, New York, NY, USA, 2004. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.Google ScholarGoogle ScholarCross RefCross Ref
  13. L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an rdbms for web data integration. In WWW 2003, pages 90--101, New York, NY, USA, 2003. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Guha, N. Koudas, A. Marathe, and D. Srivastava. Merging the results of approximate match operations. In VLDB 2004, pages 636--647, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Guha, N. Koudas, D. Srivastava, and X. Yu. Reasoning about approximate match query results. In ICDE 2006, page 8, Atlanta, GA, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Koudas, A. Marathe, and D. Srivastava. Flexible string matching against large databases in practice. In VLDB 2004, pages 1078--1086, Toronto, Canada, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Lee. On the effectiveness of the skew divergence of statistical language analysis. Artificial Intelligence and Statistics, pages 65--72, 2001.Google ScholarGoogle Scholar
  18. A. Motro. Vague: A user interface to relational databases that permits vague queries. ACM Transactions on Office Information Systems, 6(3):187--214, July 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. S. Ristad and P. N. Yianilos. Learning string edit distance. IEEE Transactions on Pattern Recognition and Machine Intelligence, 20(5):522--532, May 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. SecondString. Carnegie Mellon University. Project Page, http://secondstring.sourceforge.net/.Google ScholarGoogle Scholar
  21. R. K. Stasiu, C. A. Heuser, and R. Silva. Estimating recall and precision for vague queries in databases. In CAISE 2005, Lecture Notes in Computer Science, pages 187--200. Springer Verlag, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems, 26(8):607--633, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A strategy for allowing meaningful and comparable scores in approximate matching

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
      November 2007
      1048 pages
      ISBN:9781595938039
      DOI:10.1145/1321440

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 November 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader