Abstract
Record linkage is used to associate entities from multiple data sources. For example, two organizations contemplating a merger may want to know how common their customer bases are so that they may better assess the benefits of the merger. Another example is a database of people who are forbidden from a certain activity by regulators, may need to be compared to a list of people engaged in that activity. The autonomous entities who wish to carry out the record matching computation are often reluctant to fully share their data; they fear losing control over its subsequent dissemination and usage, or they want to insure privacy because the data is proprietary or confidential, and/or they are cautious simply because privacy laws forbid its disclosure or regulate the form of that disclosure. In such cases, the problem of carrying out the linkage computation without full data exchange has been called private record linkage. Previous private record linkage techniques have made use of a third party. We provide efficient techniques for private record linkage that improve on previous work in that (1) our techniques make no use of a third party, and (2) they achieve much better performance than previous schemes in terms of their execution time while maintaining acceptable quality of output compared to nonprivacy settings. Our protocol consists of two phases. The first phase primarily produces candidate record pairs for matching, by carrying out a very fast (but not accurate) matching between such pairs of records. The second phase is a novel protocol for efficiently computing distances between each candidate pair (without any expensive cryptographic operations such as modular exponentiations). Our experimental evaluation of our approach validates these claims.
- Agrawal, R., Evfimievski, A., and Srikant, R. 2003. Information sharing across private databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 86--97. Google ScholarDigital Library
- Agrawal, R., Asonov, D., Kantarcioglu, M., and Li, Y. 2006. Sovereign joins. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). IEEE Computer Society. Google ScholarDigital Library
- Al-Lawati, A., Lee, D., and McDaniel, P. 2005. Blocking-aware private record linkage. In Proceedings of the 2nd International Workshop on information Quality in Information Systems. ACM, 59--68. Google ScholarDigital Library
- Arkady, M. 2007. Data Quality Assessment. Technics Publications, LLC. Google ScholarDigital Library
- Atallah, M. J., Kerschbaum, F., and Du, W. 2003. Secure and private sequence comparisons. In Proceedings of the ACM Workshop on Privacy in the Electronic Society. 39--44. Google ScholarDigital Library
- Bachteler, T., Schnell, R., and Reiher, J. 2010. An empirical comparison of approaches to approximate string matching in private record linkage. In Proceedings of Statistics Canada Symposium, Social Statistics: The Interplay among Censuses, Surveys and Administrative Data.Google Scholar
- Bourgain, J. 1985. On Lipschitz embedding of finite metric spaces in Hilbert space. Israel J. Math. 52, 1--2, 46--52.Google ScholarCross Ref
- Christen, P. 2011. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 99. Google ScholarDigital Library
- Churches, T. and Christen, P. 2004. Some methods for blindfolded record linkage. BMC Med. Inform. Decision Making 4, 9.Google ScholarCross Ref
- Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., and Zhu, M. Y. 2002. Tools for privacy preserving data mining. SIGKDD Explor. 4, 2, 28--34. Google ScholarDigital Library
- Du, W. and Atallah, M. J. 2000. Protocols for secure remote database access with approximate matching. In Proceedings of the 1st Workshop on Security and Privacy in E-Commerce.Google Scholar
- Du, W. and Atallah, M. J. 2001. Privacy-preserving statistical analysis. In Proceedings of the 17th Annual Computer Security Applications Conference. 102--110. Google ScholarDigital Library
- Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1, 1--16. Google ScholarDigital Library
- Emekci, F., Agrawal, D., Abbadi, A. E., and Gulbeden, A. 2006. Privacy preserving query processing using third parties. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). IEEE Computer Society. Google ScholarDigital Library
- Fellegi, I. P. and Sunter, A. B. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64, 328, 1183--1210.Google ScholarCross Ref
- Freedman, M. J., Nissim, K., and Pinkas, B. 2004. Effcient private matching and set intersection. In Proceedings of the Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT).Google Scholar
- Goethals, B., Laur, S., Lipmaa, H., and Mielikinen, T. 2004. On private scalar product computation for privacy-preserving data mining. In Proceedings of the 7th Annual International Conference in Information Security and Cryptology. 104--120. Google ScholarDigital Library
- Hernandez, M. A. and Stolfo, S. J. 1998. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining Knowl. Discov. 2, 1, 9--37. Google ScholarDigital Library
- Hjaltason, G. R. and Samet, H. 2003. Properties of embedding methods for similarity searching in metric spaces. IEEE Trans. Pattern Anal. Mach. Intell. 25, 5, 530--549. Google ScholarDigital Library
- Inan, A., Kantarcioglu, M., Bertino, E., and Scannapieco, M. 2008. A hybrid approach to private record linkage. In Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE). 496--505. Google ScholarDigital Library
- Inan, A., Kantarcioglu, M., Ghinita, G., and Bertino, E. 2010. Private record matching using differential privacy. In Proceedings of the 13th International Conference on Extending Database Technology. ACM, 123--134. Google ScholarDigital Library
- Jin, L., Li, C., and Mehrotra, S. 2003. Efficient record linkage in large data sets. In Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA). IEEE Computer Society, Los Alamitos, CA, 137. Google ScholarDigital Library
- Karakasidis, A. and Verykios, V. 2009. Privacy preserving record linkage using phonetic codes. In Proceedings of the 4th Balkan Conference in Informatics. IEEE, 101--106. Google ScholarDigital Library
- Kissner, L. and Song, D. 2005. Private and threshold set-intersection. Tech. rep. CMU-CS-05-113.Google Scholar
- Koudas, N., Sarawagi, S., and Srivastava, D. 2006. Record linkage: similarity measures and algorithms. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 802--803. Google ScholarDigital Library
- Linial, N., London, E., and Rabinovich, Y. 1995. The geometry of graphs and some of its algorithmic applications. Combinatorica 15, 2, 215--245.Google ScholarCross Ref
- McCallum, A., Nigam, K., and Ungar, L. H. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 169--178. Google ScholarDigital Library
- Monge, A. E. and Elkan, C. P. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings 2nd ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD). 23--29.Google Scholar
- Paillier, P. 1999. Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of International Conference on the Theory and Application of Cryptographic Techniques (EUROCRYPT). 223--238. Google ScholarDigital Library
- Ravikumar, P. and Fienberg, S. E. 2004. A secure protocol for computing string distance metrics. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), Workshop on Security Aspects of Data Mining (PSDM).Google Scholar
- Ravikumar, P., Cohen, W., and Fienberg, S. E. 2004. A secure protocol for computing string distance metrics. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), Workshop on Security Aspects of Data Mining (PSDM).Google Scholar
- Scannapieco, M., Figotin, I., Bertino, E., and Elmagarmid, A. K. 2007. Privacy preserving schema and data matching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 653--664. Google ScholarDigital Library
- Schneier, B. 1996. Applied Cryptography 2nd Ed. John Wiley & Sons.Google Scholar
- Schnell, R., Bachteler, T., and Reiher, J. 2009. Privacy-preserving record linkage using bloom filters. BMC Med. Inform. Decision Making 9, 1, 41.Google ScholarCross Ref
- Smith, S. W. 1997. The Scientist and Engineer’s Guide to Digital Signal Processing. California Technical Publishing. Google ScholarDigital Library
- Sweeney, L. 2002. k-anonymity: A model for protecting privacy. Internat. J. Uncertainty, Fuzziness Knowl.-Based Syst. 10, 5, 557--570. Google ScholarDigital Library
- Yakout, M., Atallah, M. J., and Elmagarmid, A. 2009. Efficient private record linkage. In Proceedings of the 25nd International Conference on Data Engineering (ICDE). IEEE Computer Society. Google ScholarDigital Library
Index Terms
- Efficient and Practical Approach for Private Record Linkage
Recommendations
Efficient Private Record Linkage
ICDE '09: Proceedings of the 2009 IEEE International Conference on Data EngineeringRecord linkage is the computation of the associations among records of multiple databases. It arises in contexts like the integration of such databases, online interactions and negotiations, and many others. The autonomous entities who wish to carry out ...
Hybrid Private Record Linkage: Separating Differentially Private Synopses from Matching Records
Private record linkage protocols allow multiple parties to exchange matching records, which refer to the same entities or have similar values, while keeping the non-matching ones secret. Conventional protocols are based on computationally expensive ...
Pad and Chaff: secure approximate string matching in private record linkage
IIWAS '12: Proceedings of the 14th International Conference on Information Integration and Web-based Applications & ServicesThe promise of high-quality, integrated data sets for policy analysis is great. However, significant impediments remain that make the creation of such data sets very difficult. The identification of processes and algorithms for carrying out approximate ...
Comments