ABSTRACT
The task of linking multiple databases with the aim to identify records that refer to the same entity is occurring increasingly in many application areas. If unique identifiers for the entities are not available in all the databases to be linked, techniques that calculate approximate similarities between records must be used for the identification of matching pairs of records. Often, the records to be linked contain personal information such as names and addresses. In many applications, the exchange of attribute values that contain such personal details between organisations is not allowed due to privacy concerns. The linking of records between databases without revealing the actual attribute values in these records is the research problem known as 'privacy-preserving record linkage' (PPRL). While various approaches have been proposed to deal with privacy within the record linkage process, a viable solution that is well applicable to real-world conditions needs to address the major aspect of scalability of linking very large databases while preserving security and linkage quality.
We propose a novel two-party protocol for PPRL that addresses scalability, security and quality/accuracy. The protocol is based on (1) the use of reference values that are available to both database owners, and allows them to individually calculate the similarities between their attribute values and the reference values; and (2) the binning of these calculated similarity values to allow their secure exchange between the two database owners. Experiments on a real-world database with nearly two million records yield linkage results that have a linear scalability to large databases and high linkage accuracy, allowing for approximate matching in the privacy-preserving context. Since the protocol has a low computational burden and allows quality approximate matching while still preserving the privacy of the databases that are matched, the protocol can be useful for many real-world applications requiring PPRL.
- Agrawal, R., Evfimievski, A. & Srikant, R. (2003), Information sharing across private databases, in 'ACM SIGMOD', ACM, pp. 86--97. Google ScholarDigital Library
- Al-Lawati, A., Lee, D. & McDaniel, P. (2005), Blocking-aware private record linkage, in 'International Workshop on Information Quality in Information Systems', pp. 59--68. Google ScholarDigital Library
- Atallah, M., Kerschbaum, F. & Du, W. (2003), Secure and private sequence comparisons, in 'ACM workshop on Privacy in the Electronic Society', pp. 39--44. Google ScholarDigital Library
- Bachteler, T., Schnell, R. & Reiher, J. (2010), An empirical comparison of approaches to approximate string matching in private record linkage, in 'Proceedings of Statistics Canada Symposium 2010. Social Statistics: The Interplay among Censuses, Surveys and Administrative Data'.Google Scholar
- Christen, P. (2006a), A comparison of personal name matching: Techniques and practical issues, in 'Workshop on Mining Complex Data, held at IEEE ICDM'06', Hong Kong. Google ScholarDigital Library
- Christen, P. (2006b), Privacy-preserving data linkage and geocoding: Current approaches and research directions, in 'Workshop on Privacy Aspects of Data Mining, held at IEEE ICDM'06', Hong Kong, pp. 497--501. Google ScholarDigital Library
- Christen, P. (2011), 'A survey of indexing techniques for scalable record linkage and deduplication', IEEE Transactions on Knowledge and Data Engineering. Google ScholarDigital Library
- Churches, T. & Christen, P. (2004), 'Some methods for blindfolded record linkage', BioMed Central Medical Informatics and Decision Making 4(9).Google Scholar
- Du, W., Atallah, M. & Kerschbaum, F. (2000), Protocols for secure remote database access with approximate matching, in 'Proceedings of the 1st ACM Workshop on Security and Privacy in E-Commerce'.Google Scholar
- Durham, E., Xue, Y., Kantarcioglu, M. & Malin, B. (2011), 'Quantifying the correctness, computational complexity, and security of privacy-preserving string comparators for record linkage', Information Fusion In Press. Google ScholarDigital Library
- Elmagarmid, A. K., Ipeirotis, P. G. & Verykios, V. S. (2007), 'Duplicate record detection: A survey', IEEE Transactions on Knowledge and Data Engineering 19(1), 1--16. Google ScholarDigital Library
- Fellegi, I. P. & Sunter, A. B. (1969), 'A theory for record linkage', Journal of the American Statistical Society 64(328).Google ScholarCross Ref
- Hall, R. & Fienberg, S. (2010), Privacy-preserving record linkage, in 'Privacy in Statistical Databases, Springer LNCS 6344', Corfu, Greece, pp. 269--283. Google ScholarDigital Library
- Hernandez, M. A. & Stolfo, S. J. (1995), The merge/purge problem for large databases, in 'ACM SIGMOD'95', San Jose. Google ScholarDigital Library
- Herzog, T., Scheuren, F. & Winkler, W. (2007), Data quality and record linkage techniques, Springer Verlag. Google ScholarDigital Library
- Inan, A., Kantarcioglu, M., Bertino, E. & Scannapieco, M. (2008), A hybrid approach to private record linkage, in 'IEEE ICDE', pp. 496--505. Google ScholarDigital Library
- Inan, A., Kantarcioglu, M., Ghinita, G. & Bertino, E. (2010), Private record matching using differential privacy, in 'International Conference on Extending Database Technology'. Google ScholarDigital Library
- Karakasidis, A. & Verykios, V. (2009), Privacy preserving record linkage using phonetic codes, in 'Fourth Balkan Conference in Informatics', IEEE, pp. 101--106. Google ScholarDigital Library
- Karakasidis, A. & Verykios, V. (2010), Advances in privacy preserving record linkage, in 'E-activity and Innovative Technology, Advances in Applied Intelligence Technologies Book Series', IGI Global, pp. 22--34.Google Scholar
- Kissner, L. & Song, D. (2005), Private and threshold set-intersection, Technical report, Carnegie Mellon University.Google Scholar
- Krawczyk, H., Bellare, M. & Canetti, R. (1997), HMAC: Keyed-hashing for message authentication, in 'Internet RFCs'. Google ScholarDigital Library
- Manning, C. & Schüütze, H. (1999), Foundations of statistical natural language processing, Vol. 59, MIT Press. Google ScholarDigital Library
- O'Keefe, C., Yung, M., Gu, L. & Baxter, R. (2004), Privacy-preserving data linkage protocols, in 'Proceedings of the 2004 ACM workshop on Privacy in the electronic society', pp. 94--102. Google ScholarDigital Library
- Pang, C., Gu, L., Hansen, D. & Maeder, A. (2009), 'Privacy-preserving fuzzy matching using a public reference table', Intelligent Patient Management pp. 71--89.Google Scholar
- Raghavan, V., Bollmann, P. & Jung, G. (1989), 'A critical investigation of recall and precision as measures of retrieval system performance', ACM Transactions on Information Systems (TOIS) 7(3), 205--229. Google ScholarDigital Library
- Ravikumar, P., Cohen, W. & Fienberg, S. (2004), A secure protocol for computing string distance metrics, in 'Workshop on Privacy and Security Aspects of Data Mining held at IEEE ICDM'04', pp. 40--46.Google Scholar
- Scannapieco, M., Figotin, I., Bertino, E. & Elmagarmid, A. (2007), Privacy preserving schema and data matching, in 'ACM SIGMOD', pp. 653--664. Google ScholarDigital Library
- Schneier, B. (1995), Applied cryptography: Protocols, algorithms, and source code in C, 2nd Edition, John Wiley & Sons, Inc., New York. Google ScholarDigital Library
- Song, D., Wagner, D. & Perrig, A. (2000), 'Practical techniques for searches on encrypted data', Security and Privacy, IEEE Symposium on p. 44. Google ScholarDigital Library
- Trepetin, S. (2008), 'Privacy-preserving string comparisons in record linkage systems: a review', Information Security Journal: A Global Perspective 17(5), 253--266. Google ScholarDigital Library
- Verykios, V., Karakasidis, A. & Mitrogiannis, V. (2009), 'Privacy preserving record linkage approaches', Int. J. of Data Mining, Modelling and Management 1(2), 206--221.Google ScholarCross Ref
- Winkler, W. E. (2006), Overview of record linkage and current research directions, Technical Report RR2006/02, US Bureau of the Census.Google Scholar
- Witten, I. H., Moffat, A. & Bell, T. C. (1999), Managing Gigabytes, 2nd Edition, Morgan Kaufmann.Google Scholar
- Yakout, M., Atallah, M. & Elmagarmid, A. (2009), Efficient private record linkage, in 'IEEE ICDE', pp. 1283--1286. Google ScholarDigital Library
Index Terms
- An efficient two-party protocol for approximate matching in private record linkage
Recommendations
An iterative two-party protocol for scalable privacy-preserving record linkage
AusDM '12: Proceedings of the Tenth Australasian Data Mining Conference - Volume 134Record linkage is the process of identifying which records in different databases refer to the same real-world entities. When personal details of individuals, such as names and addresses, are used to link databases across different organisations, then ...
A taxonomy of privacy-preserving record linkage techniques
The process of identifying which records in two or more databases correspond to the same entity is an important aspect of data quality activities such as data pre-processing and data integration. Known as record linkage, data matching or entity ...
Efficient protocols for private record linkage
SAC '14: Proceedings of the 29th Annual ACM Symposium on Applied ComputingRecord linkage allows data from different sources to be integrated to facilitate data mining tasks. However, in many cases, records have to be linked by personally identifiable information. To prevent privacy breaches, ideally records should be linked ...
Comments