skip to main content
Skip header Section
Data Quality and Record Linkage TechniquesMay 2007
Publisher:
  • Springer Publishing Company, Incorporated
ISBN:978-0-387-69502-0
Published:15 May 2007
Pages:
234
Skip Bibliometrics Section
Bibliometrics
Skip Abstract Section
Abstract

This book helps practitioners gain a deeper understanding, at an applied level, of the issues involved in improving data quality through editing, imputation, and record linkage. The first part of the book deals with methods and models. Here, we focus on the Fellegi-Holt edit-imputation model, the Little-Rubin multiple-imputation scheme, and the Fellegi-Sunter record linkage model. Brief examples are included to show how these techniques work. In the second part of the book, the authors present real-world case studies in which one or more of these techniques are used. They cover a wide variety of application areas. These include mortgage guarantee insurance, medical, biomedical, highway safety, and social insurance as well as the construction of list frames and administrative lists. Readers will find this book a mixture of practical advice, mathematical rigor, management insight and philosophy. The long list of references at the end of the book enables readers to delve more deeply into the subjects discussed here. The authors also discuss the software that has been developed to apply the techniques described in our text.

Cited By

  1. ACM
    Carvalho T, Moniz N, Faria P and Antunes L (2023). Survey on Privacy-Preserving Techniques for Microdata Publication, ACM Computing Surveys, 55:14s, (1-42), Online publication date: 31-Dec-2024.
  2. ACM
    Ter Hofstede A, Koschmider A, Marrella A, Andrews R, Fischer D, Sadeghianasl S, Wynn M, Comuzzi M, De Weerdt J, Goel K, Martin N and Soffer P (2023). Process-Data Quality: The True Frontier of Process Mining, Journal of Data and Information Quality, 15:3, (1-21), Online publication date: 30-Sep-2023.
  3. Li S, Schneider M, Yu Y and Gupta S (2023). Reidentification Risk in Panel Data, Information Systems Research, 34:3, (1066-1088), Online publication date: 1-Sep-2023.
  4. ACM
    Kirielle N, Christen P and Ranbaduge T (2022). Unsupervised Graph-Based Entity Resolution for Complex Entities, ACM Transactions on Knowledge Discovery from Data, 17:1, (1-30), Online publication date: 28-Feb-2023.
  5. Peng G, Liu C, Talaei-Khoei A and Storey V (2023). A Review of the State of the Art of Data Quality in Healthcare, Journal of Global Information Management, 31:1, (1-18), Online publication date: 3-Feb-2023.
  6. ACM
    Ilyas I and Rekatsinas T (2022). Machine Learning and Data Cleaning: Which Serves the Other?, Journal of Data and Information Quality, 14:3, (1-11), Online publication date: 30-Sep-2022.
  7. ACM
    Wu J, Hiltabrand R, Soós D and Giles C Scholarly big data quality assessment Proceedings of the 22nd ACM Symposium on Document Engineering, (1-4)
  8. Kwiek M and Roszka W (2022). Are female scientists less inclined to publish alone? The gender solo research gap, Scientometrics, 127:4, (1697-1735), Online publication date: 1-Apr-2022.
  9. Hellings J and Sadoghi M (2021). ByShard, Proceedings of the VLDB Endowment, 14:11, (2230-2243), Online publication date: 1-Jul-2021.
  10. ACM
    Barlaug N and Gulla J (2021). Neural Networks for Entity Matching: A Survey, ACM Transactions on Knowledge Discovery from Data, 15:3, (1-37), Online publication date: 30-Jun-2021.
  11. Abu Ahmad H and Wang H (2020). Automatic weighted matching rectifying rule discovery for data repairing, The VLDB Journal — The International Journal on Very Large Data Bases, 29:6, (1433-1447), Online publication date: 1-Nov-2020.
  12. ACM
    Koumarelas I, Jiang L and Naumann F (2020). Data Preparation for Duplicate Detection, Journal of Data and Information Quality, 12:3, (1-24), Online publication date: 30-Sep-2020.
  13. Kang N, Kim J, On B and Lee I (2020). A node resistance-based probability model for resolving duplicate named entities, Scientometrics, 124:3, (1721-1743), Online publication date: 1-Sep-2020.
  14. Gupta S, Hellings J, Rahnama S and Sadoghi M (2020). Building high throughput permissioned blockchain fabrics, Proceedings of the VLDB Endowment, 13:12, (3441-3444), Online publication date: 1-Aug-2020.
  15. ACM
    Wu R, Chaba S, Sawlani S, Chu X and Thirumuruganathan S ZeroER: Entity Resolution using Zero Labeled Examples Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, (1149-1164)
  16. Powell B and Smith P (2019). Computing expectations and marginal likelihoods for permutations, Computational Statistics, 35:2, (871-891), Online publication date: 1-Jun-2020.
  17. ACM
    Kimelfeld B and Martens W (2019). Technical Perspective, ACM SIGMOD Record, 48:1, (23-23), Online publication date: 5-Nov-2019.
  18. ACM
    Zhang Y, Ng K, Churchill T and Christen P Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases Proceedings of the 27th ACM International Conference on Information and Knowledge Management, (2213-2221)
  19. ACM
    Koumarelas I, Kroschk A, Mosley C and Naumann F (2018). Experience, Journal of Data and Information Quality, 10:2, (1-16), Online publication date: 13-Sep-2018.
  20. Hao S, Tang N, Li G, Li J and Feng J (2018). Distilling relations using knowledge bases, The VLDB Journal — The International Journal on Very Large Data Bases, 27:4, (497-519), Online publication date: 1-Aug-2018.
  21. ACM
    Chen J and Zhang Q Distinct Sampling on Streaming Data with Near-Duplicates Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, (369-382)
  22. ACM
    Bertossi L and Milani M (2018). Ontological Multidimensional Data Models and Contextual Data Quality, Journal of Data and Information Quality, 9:3, (1-36), Online publication date: 15-Mar-2018.
  23. ACM
    Wu J, Sefid A, Ge A and Giles C A Supervised Learning Approach To Entity Matching Between Scholarly Big Datasets Proceedings of the 9th Knowledge Capture Conference, (1-4)
  24. Reyes-Galaviz O, Pedrycz W, He Z and Pizzi N (2017). A supervised gradient-based learning algorithm for optimized entity resolution, Data & Knowledge Engineering, 112:C, (106-129), Online publication date: 1-Nov-2017.
  25. ACM
    Wang J and Tang N (2017). Dependable Data Repairing with Fixing Rules, Journal of Data and Information Quality, 8:3-4, (1-34), Online publication date: 17-Jul-2017.
  26. ACM
    Ding X, Wang H, Gao Y, Li J and Gao H Determining the currency of dynamic data Proceedings of the ACM Turing 50th Celebration Conference - China, (1-6)
  27. Bahmani Z, Bertossi L and Vasiloglou N (2017). ERBlox, International Journal of Approximate Reasoning, 83:C, (118-141), Online publication date: 1-Apr-2017.
  28. Altwaijry H, Kalashnikov D and Mehrotra S (2017). QDA, IEEE Transactions on Knowledge and Data Engineering, 29:2, (402-417), Online publication date: 1-Feb-2017.
  29. ACM
    Chen D and Zhang Q Streaming Algorithms for Robust Distinct Elements Proceedings of the 2016 International Conference on Management of Data, (1433-1447)
  30. ACM
    Wang H, Li M, Bu Y, Li J, Gao H and Zhang J (2016). Cleanix, ACM SIGMOD Record, 44:4, (35-40), Online publication date: 9-May-2016.
  31. ACM
    Mesiti M MergeGraphs Proceedings of the 17th International Conference on Information Integration and Web-based Applications & Services, (1-10)
  32. ACM
    Fan W (2015). Data Quality, ACM SIGMOD Record, 44:3, (7-18), Online publication date: 3-Dec-2015.
  33. Tiedeken J, Bauer T, Herbst J and Reichert M Determining the Quality of Product Data Integration Proceedings of the Confederated International Conferences on On the Move to Meaningful Internet Systems: OTM 2015 Conferences - Volume 9415, (267-284)
  34. ACM
    Zhang Q Communication-Efficient Computation on Distributed Noisy Datasets Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures, (313-322)
  35. ACM
    Wang H, Li M, Bu Y, Li J, Gao H and Zhang J Cleanix Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, (2024-2026)
  36. Winkler W (2014). Matching and record linkage, WIREs Computational Statistics, 6:5, (313-325), Online publication date: 18-Aug-2014.
  37. ACM
    Wang J and Tang N Towards dependable data repairing with fixing rules Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, (457-468)
  38. ACM
    Fan W, Ma S, Tang N and Yu W (2014). Interaction between Record Matching and Data Repairing, Journal of Data and Information Quality, 4:4, (1-38), Online publication date: 1-May-2014.
  39. Baral C and Vo N Event-Object Reasoning with Curated Knowledge Bases Proceedings of the 12th International Conference on Logic Programming and Nonmonotonic Reasoning - Volume 8148, (161-167)
  40. Verykios V and Christen P (2013). Privacy-preserving record linkage, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3:5, (321-332), Online publication date: 1-Sep-2013.
  41. Xie H, Wang H, Li J and Gao H A data cleaning framework based on user feedback Proceedings of the 14th international conference on Web-Age Information Management, (514-520)
  42. Ramadan B, Christen P, Liang H, Gayler R and Hawking D Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution Revised Selected Papers of PAKDD 2013 International Workshops on Trends and Applications in Knowledge Discovery and Data Mining - Volume 7867, (47-58)
  43. ACM
    Kuzu M, Kantarcioglu M, Inan A, Bertino E, Durham E and Malin B Efficient privacy-aware record integration Proceedings of the 16th International Conference on Extending Database Technology, (167-178)
  44. ACM
    Zhou Y, Nelson E, Kobayashi F and Talburt J (2013). A Graduate-Level Course on Entity Resolution and Information Quality, Journal of Data and Information Quality, 4:2, (1-10), Online publication date: 1-Mar-2013.
  45. Sariyar M and Borg A (2012). Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity data, Computer Methods and Programs in Biomedicine, 108:3, (1160-1169), Online publication date: 1-Dec-2012.
  46. Vatsalan D, Christen P and Verykios V An efficient two-party protocol for approximate matching in private record linkage Proceedings of the Ninth Australasian Data Mining Conference - Volume 121, (125-136)
  47. Fan W, Li J, Ma S, Tang N and Yu W (2020). CerFix, Proceedings of the VLDB Endowment, 4:12, (1375-1378), Online publication date: 1-Aug-2011.
  48. ACM
    Fan W, Li J, Ma S, Tang N and Yu W Interaction between record matching and data repairing Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, (469-480)
  49. Zhou Y and Talburt J (2011). Staging a realistic entity resolution challenge for students, Journal of Computing Sciences in Colleges, 26:5, (88-95), Online publication date: 1-May-2011.
  50. ACM
    Fan W and Geerts F (2010). Relative information completeness, ACM Transactions on Database Systems, 35:4, (1-44), Online publication date: 1-Nov-2010.
  51. Hall R and Fienberg S Privacy-preserving record linkage Proceedings of the 2010 international conference on Privacy in statistical databases, (269-283)
  52. Fan W, Li J, Ma S, Tang N and Yu W (2010). Towards certain fixes with editing rules and master data, Proceedings of the VLDB Endowment, 3:1-2, (173-184), Online publication date: 1-Sep-2010.
  53. Baumgartner N, Gottesheim W, Mitsch S, Retschitzegger W and Schwinger W "Same, Same but Different" A Survey on Duplicate Detection Methods for Situation Awareness Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II, (1050-1068)
  54. ACM
    Rodic J and Baranovic M Generating data quality rules and integration into ETL process Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP, (65-72)
  55. Sokolova M, El Emam K, Rose S, Chowdhury S, Neri E, Jonker E and Peyton L Personal health information leak prevention in heterogeneous texts Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains, (58-69)
  56. ACM
    Whang S, Menestrina D, Koutrika G, Theobald M and Garcia-Molina H Entity resolution with iterative blocking Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, (219-232)
Contributors
  • U.S. Census Bureau
  • U.S. Census Bureau

Recommendations