ABSTRACT
The goal of approximate data matching is to assess whether two distinct data instances represent the same real world object. This is usually achieved through the use of a similarity function, which returns a score that defines how similar two data instances are. If this score surpasses a given threshold, both data instances are considered as representing the same real world object. The score values returned by a similarity function depend on the algorithm that implements the function and have no meaning to the user (apart from the fact that a higher similarity value means that two data instances are more similar). In this paper, we propose that instead of defining the threshold in terms of the scores returned by a similarity function, the user specifies the precision that is expected from the matching process. Precision is a well known quality measure and has a clear interpretation from the user's point of view. Our approach relies on mapping between similarity scores and precision values based on a training data set. Experimental results show the training may be executed against a representative data set, and reused for other databases from the same domain.
- R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarDigital Library
- M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16--23, September/October 2003. Google ScholarDigital Library
- N. Bruno, S. Chaudhuri, and L. Gravano. Top-k selection queries over relational databases: Mapping strategies and performance evaluation. ACM Trans. Database Syst., 27(2):153--187, 2002. Google ScholarDigital Library
- C. Buckley and E. M. Voorhees. Evaluating evaluation measure stability. In ACM SIGIR 2000, pages 33--40, New York, NY, USA, 2000. ACM Press. Google ScholarDigital Library
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD 2003, pages 313--324, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
- P. Christen, T. Churches, and M. Hegland. Febrl - a parallel open source data linkage system. In PAKDD 2004 (LNAI 3056), pages 638--647. Springer, 2004.Google Scholar
- W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), August 9-10, 2003, Acapulco, Mexico, pages 73--78, 2003.Google Scholar
- R. da Silva, R. K. Stasiu, V. M. Orengo, and C. A. Heuser. Measuring quality of similarity functions in approximate data matching. Journal of Informetrics, 1(1):35--46, January 2007.Google ScholarCross Ref
- N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB 2004, pages 864--875, 2004. Google ScholarDigital Library
- A. Doan, Y. Lu, Y. Lee, and J. Han. Profile-based object matching for information integration. IEEE Intelligent Systems, 18(5):54--59, 2003. Google ScholarDigital Library
- C. F. Dorneles, C. A. Heuser, A. E. N. Lima, A. S. da Silva, and E. S. de Moura. Measuring similarity between collection of values. In WIDM 2004: 6th ACM Intl. Workshop on Web Information and Data Management, pages 56--63, New York, NY, USA, 2004. ACM Press. Google ScholarDigital Library
- I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.Google ScholarCross Ref
- L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an rdbms for web data integration. In WWW 2003, pages 90--101, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
- S. Guha, N. Koudas, A. Marathe, and D. Srivastava. Merging the results of approximate match operations. In VLDB 2004, pages 636--647, 2004. Google ScholarDigital Library
- S. Guha, N. Koudas, D. Srivastava, and X. Yu. Reasoning about approximate match query results. In ICDE 2006, page 8, Atlanta, GA, USA, 2006. Google ScholarDigital Library
- N. Koudas, A. Marathe, and D. Srivastava. Flexible string matching against large databases in practice. In VLDB 2004, pages 1078--1086, Toronto, Canada, 2004. Google ScholarDigital Library
- L. Lee. On the effectiveness of the skew divergence of statistical language analysis. Artificial Intelligence and Statistics, pages 65--72, 2001.Google Scholar
- A. Motro. Vague: A user interface to relational databases that permits vague queries. ACM Transactions on Office Information Systems, 6(3):187--214, July 1988. Google ScholarDigital Library
- E. S. Ristad and P. N. Yianilos. Learning string edit distance. IEEE Transactions on Pattern Recognition and Machine Intelligence, 20(5):522--532, May 1998. Google ScholarDigital Library
- SecondString. Carnegie Mellon University. Project Page, http://secondstring.sourceforge.net/.Google Scholar
- R. K. Stasiu, C. A. Heuser, and R. Silva. Estimating recall and precision for vague queries in databases. In CAISE 2005, Lecture Notes in Computer Science, pages 187--200. Springer Verlag, 2005. Google ScholarDigital Library
- S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems, 26(8):607--633, 2001. Google ScholarDigital Library
Index Terms
- A strategy for allowing meaningful and comparable scores in approximate matching
Recommendations
A strategy for allowing meaningful and comparable scores in approximate matching
Approximate data matching aims at assessing whether two distinct instances of data represent the same real-world object. The comparison between data values is usually done by applying a similarity function which returns a similarity score. If this score ...
A Demo of the Data Civilizer System
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataFinding relevant data for a specific task from the numerous data sources available in any organization is a daunting task. This is not only because of the number of possible data sources where the data of interest resides, but also due to the data being ...
PACMMOD Volume 1, Issue 3: Editorial
PACMMODWe are excited to introduce this new issue of PACMMOD (Proceedings of the ACM on Management of Data). PACMMOD is a new journal, concerned with the principles, algorithms, techniques, systems, and applications of database management systems, data ...
Comments