ABSTRACT
Record linkage is the process of identifying which records in different databases refer to the same real-world entities. When personal details of individuals, such as names and addresses, are used to link databases across different organisations, then privacy becomes a major concern. Often it is not permissible to exchange identifying data among organisations. Linking databases in situations where no private or confidential information can be revealed is known as 'privacy-preserving record linkage' (PPRL). We propose a novel protocol for scalable and approximate PPRL based on Bloom filters in a scenario where no third party is available to conduct a linkage.
While two-party protocols are more secure because there is no possibility of collusion between one of the database owners and the third party, these protocols generally require more complex and expensive techniques to ensure that a database owner cannot infer any sensitive information about the other party's data during the linkage process. Our two-party protocol uses an efficient privacy technique called Bloom filters, and conducts an iterative classification of record pairs into matches and non-matches, as selected bits of the Bloom filters are revealed. Experiments conducted on real-world databases that contain nearly two million records, show that our protocol is scalable to large databases while providing sufficient privacy characteristics and achieving high linkage quality.
- Agrawal, R., Evfimievski, A. & Srikant, R. (2003), Information sharing across private databases, in 'ACM SIGMOD', San Diego, pp. 86--97. Google ScholarDigital Library
- Al-Lawati, A., Lee, D. & McDaniel, P. (2005), Blocking-aware private record linkage, in 'Journal of Data and Information Quality', pp. 59--68.Google Scholar
- Atallah, M., Kerschbaum, F. & Du, W. (2003), Secure and private sequence comparisons, in 'ACM Workshop on Privacy in the Electronic Society', pp. 39--44. Google ScholarDigital Library
- Bloom, B. (1970), 'Space/time trade-offs in hash coding with allowable errors', Communications of the ACM 13(7), 422--426. Google ScholarDigital Library
- Broder, A., Mitzenmacher, M. & Mitzenmacher, A. (2002), Network applications of Bloom filters: A survey, in 'Internet Mathematics'.Google Scholar
- Christen, P. (2006), Privacy-preserving data linkage and geocoding: Current approaches and research directions, in 'IEEE ICDM Workshop on Privacy Aspects of Data Mining', Hong Kong. Google ScholarDigital Library
- Christen, P. (2009), Geocode matching and privacy preservation, in 'Workshop on Privacy, Security, and Trust in KDD', Springer, pp. 7--24. Google ScholarDigital Library
- Christen, P. (2011), 'A survey of indexing techniques for scalable record linkage and deduplication', IEEE Transactions on Knowledge and Data Engineering. Google ScholarDigital Library
- Christen, P. (2012), Data Matching, Data-Centric Systems and Applications, Springer.Google Scholar
- Clifton, C., Kantarcioglu, M., Doan, A., Schadow, G., Vaidya, J., Elmagarmid, A. & Suciu, D. (2004), Privacy-preserving data integration and sharing, in 'ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery', pp. 19--26. Google ScholarDigital Library
- Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X. & Zhu, M. (2002), 'Tools for privacy preserving distributed data mining', SIGKDD Explorations 4(2), 28--34. Google ScholarDigital Library
- Durham, E. (2012), A framework for accurate, efficient private record linkage, PhD thesis, Vanderbilt University.Google Scholar
- Durham, E., Xue, Y., Kantarcioglu, M. & Malin, B. (2010), Private medical record linkage with approximate matching, in 'AMIA Annual Symposium Proceedings', p. 182.Google Scholar
- Durham, E., Xue, Y., Kantarcioglu, M. & Malin, B. (2011), 'Quantifying the correctness, computational complexity, and security of privacy-preserving string comparators for record linkage', Information Fusion. Google ScholarDigital Library
- Dusserre, L., Quantin, C. & Bouzelat, H. (1995), 'A one way public key cryptosystem for the linkage of nominal files in epidemiological studies', Medinfo 8, 644--647.Google Scholar
- Freedman, M., Ishai, Y., Pinkas, B. & Reingold, O. (2005), 'Keyword search and oblivious pseudorandom functions', Theory of Cryptography. Google ScholarDigital Library
- Gionis, A., Indyk, P. & Motwani, R. (1999), Similarity search in high dimensions via hashing, in 'Proceedings of the International Conference on Very Large Data Bases', pp. 518--529. Google ScholarDigital Library
- Goldreich, O. (2004), Foundations of cryptography: Basic applications, Vol. 2, Cambridge Univ Press. Google ScholarDigital Library
- Hall, R. & Fienberg, S. (2010), Privacy-preserving record linkage, in 'Privacy in Statistical Databases, Springer LNCS 6344', Corfu, Greece, pp. 269--283. Google ScholarDigital Library
- Inan, A., Kantarcioglu, M., Bertino, E. & Scannapieco, M. (2008), A hybrid approach to private record linkage, in 'IEEE ICDE', Cancun, Mexico, pp. 496--505. Google ScholarDigital Library
- Inan, A., Kantarcioglu, M., Ghinita, G. & Bertino, E. (2010), Private record matching using differential privacy, in 'EDBT'. Google ScholarDigital Library
- Kantarcioglu, M., Jiang, W. & Malin, B. (2008), A privacy-preserving framework for integrating person-specific databases, in 'Privacy in Statistical Databases', Springer, pp. 298--314. Google ScholarDigital Library
- Karakasidis, A. & Verykios, V. (2010), Advances in privacy preserving record linkage, in 'E-activity and Innovative Technology, Advances in Applied Intelligence Technologies Book Series', IGI Global.Google Scholar
- Karakasidis, A. & Verykios, V. (2011), 'Secure blocking+secure matching = secure record linkage', Journal of Computing Science and Engineering 5, 223--235.Google ScholarCross Ref
- Karakasidis, A. & Verykios, V. (2012), Reference table based k-anonymous private blocking, in 'ACM Symposium on Applied Computing', Riva del Garda, Italy. Google ScholarDigital Library
- Karakasidis, A., Verykios, V. & Christen, P. (2011), Fake injection strategies for private phonetic matching, in 'International Workshop on Data Privacy Management', Leuven, Belgium. Google ScholarDigital Library
- Kargupta, H., Datta, S., Wang, Q. & Sivakumar, K. (2003), On the privacy preserving properties of random data perturbation techniques, in 'ICDM', IEEE, pp. 99--106. Google ScholarDigital Library
- Kissner, L. & Song, D. (2004), Private and threshold set-intersection, in 'Technical Report', Carnegie Mellon University.Google Scholar
- Kuzu, M., Kantarcioglu, M., Durham, E. & Malin, B. (2011), A constraint satisfaction cryptanalysis of Bloom filters in private record linkage, in 'Privacy Enhancing Technologies', Springer, pp. 226--245. Google ScholarDigital Library
- Lai, P., Yiu, S., Chow, K., Chong, C. & Hui, L. (2006), An Efficient Bloom filter based Solution for Multiparty Private Matching, in 'International Conference on Security and Management'.Google Scholar
- Lindell, Y. & Pinkas, B. (2009), 'Secure multiparty computation for privacy-preserving data mining', Journal of Privacy and Confidentiality 1(1), 5.Google ScholarCross Ref
- Mohammed, N., Fung, B. & Debbabi, M. (2011), 'Anonymity meets game theory: secure data integration with malicious participants', VLDB 20(4), 567--588. Google ScholarDigital Library
- O'Keefe, C., Yung, M., Gu, L. & Baxter, R. (2004), Privacy-preserving data linkage protocols, in 'ACM Workshop on Privacy in the Electronic Society'. Google ScholarDigital Library
- Pang, C., Gu, L., Hansen, D. & Maeder, A. (2009), 'Privacy-preserving fuzzy matching using a public reference table', Intelligent Patient Management pp. 71--89.Google Scholar
- Ravikumar, P., Cohen, W. & Fienberg, S. (2004), A secure protocol for computing string distance metrics, in 'Workshop on Privacy and Security Aspects of Data Mining held at IEEE ICDM'04', pp. 40--46.Google Scholar
- Scannapieco, M., Figotin, I., Bertino, E. & Elmagarmid, A. (2007), Privacy preserving schema and data matching, in 'ACM SIGMOD', pp. 653--664. Google ScholarDigital Library
- Schnell, R., Bachteler, T. & Reiher, J. (2009), 'Privacy-preserving record linkage using Bloom filters', BMC Medical Informatics and Decision Making 9(1).Google Scholar
- Song, D., Wagner, D. & Perrig, A. (2000), 'Practical techniques for searches on encrypted data', sp. Google ScholarDigital Library
- Trepetin, S. (2008), 'Privacy-preserving string comparisons in record linkage systems: a review', Information Security Journal: A Global Perspective 17(5), 253--266. Google ScholarDigital Library
- Van Eycken, E., Haustermans, K., Buntinx, F., Ceuppens, A., Weyler, J., Wauters, E., VAN, O. et al. (2000), 'Evaluation of the encryption procedure and record linkage in the Belgian National Cancer Registry', Archives of public health 58(6), 281--294.Google Scholar
- Vatsalan, D., Christen, P. & Verykios, V. (2011), An efficient two-party protocol for approximate matching in private record linkage, in 'AusDM, CRPIT 121'. Google ScholarDigital Library
- Vatsalan, D., Christen, P. & Verykios, V. (2013), 'A taxonomy of privacy-preserving record linkage techniques', to appear in Journal of Information Systems. Google ScholarDigital Library
- Verykios, V., Karakasidis, A. & Mitrogiannis, V. (2009), 'Privacy preserving record linkage approaches', Int. J. of Data Mining, Modelling and Management 1(2), 206--221.Google ScholarCross Ref
- Weber, S., Lowe, H., Das, A. & Ferris, T. (2012), 'A simple heuristic for blindfolded record linkage', Journal of the American Medical Informatics Association.Google ScholarCross Ref
- Yakout, M., Atallah, M. & Elmagarmid, A. (2012), 'Efficient and practical approach for private record linkage', Journal of Data and Information Quality 3(3), 5. Google ScholarDigital Library
Recommendations
Scalable Privacy-Preserving Record Linkage for Multiple Databases
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge ManagementPrivacy-preserving record linkage (PPRL) is the process of identifying records that correspond to the same real-world entities across several databases without revealing any sensitive information about these entities. Various techniques have been ...
An efficient two-party protocol for approximate matching in private record linkage
AusDM '11: Proceedings of the Ninth Australasian Data Mining Conference - Volume 121The task of linking multiple databases with the aim to identify records that refer to the same entity is occurring increasingly in many application areas. If unique identifiers for the entities are not available in all the databases to be linked, ...
A taxonomy of privacy-preserving record linkage techniques
The process of identifying which records in two or more databases correspond to the same entity is an important aspect of data quality activities such as data pre-processing and data integration. Known as record linkage, data matching or entity ...
Comments