research-article

A strategy for allowing meaningful and comparable scores in approximate matching

Authors:
Carina F. Dorneles

UFRGS, Porto Alegre, Brazil

UFRGS, Porto Alegre, Brazil
View Profile

,
Carlos A. Heuser

UFRGS, Porto Alegre, Brazil

UFRGS, Porto Alegre, Brazil
View Profile

,
Viviane Moreira Orengo

UFRGS, Porto Alegre, Brazil

UFRGS, Porto Alegre, Brazil
View Profile

,
Altigran S. da Silva

UFAM, Manaus, Brazil

UFAM, Manaus, Brazil
View Profile

,
Edleno S. de Moura

UFAM, Manaus, Brazil

UFAM, Manaus, Brazil
View Profile

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementNovember 2007Pages 303–312https://doi.org/10.1145/1321440.1321484

Published:06 November 2007Publication History

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Pages 303–312

ABSTRACT

The goal of approximate data matching is to assess whether two distinct data instances represent the same real world object. This is usually achieved through the use of a similarity function, which returns a score that defines how similar two data instances are. If this score surpasses a given threshold, both data instances are considered as representing the same real world object. The score values returned by a similarity function depend on the algorithm that implements the function and have no meaning to the user (apart from the fact that a higher similarity value means that two data instances are more similar). In this paper, we propose that instead of defining the threshold in terms of the scores returned by a similarity function, the user specifies the precision that is expected from the matching process. Precision is a well known quality measure and has a clear interpretation from the user's point of view. Our approach relies on mapping between similarity scores and precision values based on a training data set. Experimental results show the training may be executed against a representative data set, and reused for other databases from the same domain.

References

R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarDigital Library
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16--23, September/October 2003. Google ScholarDigital Library
N. Bruno, S. Chaudhuri, and L. Gravano. Top-k selection queries over relational databases: Mapping strategies and performance evaluation. ACM Trans. Database Syst., 27(2):153--187, 2002. Google ScholarDigital Library
C. Buckley and E. M. Voorhees. Evaluating evaluation measure stability. In ACM SIGIR 2000, pages 33--40, New York, NY, USA, 2000. ACM Press. Google ScholarDigital Library
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD 2003, pages 313--324, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
P. Christen, T. Churches, and M. Hegland. Febrl - a parallel open source data linkage system. In PAKDD 2004 (LNAI 3056), pages 638--647. Springer, 2004.Google Scholar
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), August 9-10, 2003, Acapulco, Mexico, pages 73--78, 2003.Google Scholar
R. da Silva, R. K. Stasiu, V. M. Orengo, and C. A. Heuser. Measuring quality of similarity functions in approximate data matching. Journal of Informetrics, 1(1):35--46, January 2007.Google ScholarCross Ref
N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB 2004, pages 864--875, 2004. Google ScholarDigital Library
A. Doan, Y. Lu, Y. Lee, and J. Han. Profile-based object matching for information integration. IEEE Intelligent Systems, 18(5):54--59, 2003. Google ScholarDigital Library
C. F. Dorneles, C. A. Heuser, A. E. N. Lima, A. S. da Silva, and E. S. de Moura. Measuring similarity between collection of values. In WIDM 2004: 6th ACM Intl. Workshop on Web Information and Data Management, pages 56--63, New York, NY, USA, 2004. ACM Press. Google ScholarDigital Library
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.Google ScholarCross Ref
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an rdbms for web data integration. In WWW 2003, pages 90--101, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
S. Guha, N. Koudas, A. Marathe, and D. Srivastava. Merging the results of approximate match operations. In VLDB 2004, pages 636--647, 2004. Google ScholarDigital Library
S. Guha, N. Koudas, D. Srivastava, and X. Yu. Reasoning about approximate match query results. In ICDE 2006, page 8, Atlanta, GA, USA, 2006. Google ScholarDigital Library
N. Koudas, A. Marathe, and D. Srivastava. Flexible string matching against large databases in practice. In VLDB 2004, pages 1078--1086, Toronto, Canada, 2004. Google ScholarDigital Library
L. Lee. On the effectiveness of the skew divergence of statistical language analysis. Artificial Intelligence and Statistics, pages 65--72, 2001.Google Scholar
A. Motro. Vague: A user interface to relational databases that permits vague queries. ACM Transactions on Office Information Systems, 6(3):187--214, July 1988. Google ScholarDigital Library
E. S. Ristad and P. N. Yianilos. Learning string edit distance. IEEE Transactions on Pattern Recognition and Machine Intelligence, 20(5):522--532, May 1998. Google ScholarDigital Library
SecondString. Carnegie Mellon University. Project Page, http://secondstring.sourceforge.net/.Google Scholar
R. K. Stasiu, C. A. Heuser, and R. Silva. Estimating recall and precision for vague queries in databases. In CAISE 2005, Lecture Notes in Computer Science, pages 187--200. Springer Verlag, 2005. Google ScholarDigital Library
S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems, 26(8):607--633, 2001. Google ScholarDigital Library

Index Terms

A strategy for allowing meaningful and comparable scores in approximate matching
1. Information systems
  1. Data management systems

Recommendations

A strategy for allowing meaningful and comparable scores in approximate matching

Approximate data matching aims at assessing whether two distinct instances of data represent the same real-world object. The comparison between data values is usually done by applying a similarity function which returns a similarity score. If this score ...
Read More
A Demo of the Data Civilizer System
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Finding relevant data for a specific task from the numerous data sources available in any organization is a daunting task. This is not only because of the number of possible data sources where the data of interest resides, but also due to the data being ...
Read More
PACMMOD Volume 1, Issue 3: Editorial
PACMMOD

We are excited to introduce this new issue of PACMMOD (Proceedings of the ACM on Management of Data). PACMMOD is a new journal, concerned with the principles, algorithms, techniques, systems, and applications of database management systems, data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
November 2007
1048 pages
ISBN:9781595938039
DOI:10.1145/1321440
Co-chair:
Alberto H. F. Laender,
Conference Chairs:
André O. Falcão
Universidade de Lisboa, Portugal
,
Øystein Haug Olsen,
General Chair:
Mário J. Silva
(Universidade de Lisboa, Portugal)
,
Program Chairs:
Ricardo Baeza-Yates,
Deborah L. McGuinness,
Bjorn Olstad
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 November 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data cleaning
data integration
similarity querying
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 415
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A strategy for allowing meaningful and comparable scores in approximate matching

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

A strategy for allowing meaningful and comparable scores in approximate matching

A Demo of the Data Civilizer System

PACMMOD Volume 1, Issue 3: Editorial