skip to main content
Skip header Section
Data Cleaning: A Practical PerspectiveSeptember 2013
Publisher:
  • Morgan & Claypool Publishers
ISBN:978-1-60845-677-2
Published:01 September 2013
Pages:
86
Skip Bibliometrics Section
Bibliometrics
Skip Abstract Section
Abstract

Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merging data collected independently across different databases. These errors in data warehouses often result in erroneous upstream reports, and could impact business decisions negatively. Therefore, one of the critical challenges while maintaining large data warehouses is that of ensuring the quality of data in the data warehouse remains high. The process of maintaining high data quality is commonly referred to as data cleaning. In this book, we first discuss the goals of data cleaning. Often, the goals of data cleaning are not well defined and could mean different solutions in different scenarios. Toward clarifying these goals, we abstract out a common set of data cleaning tasks that often need to be addressed. This abstraction allows us to develop solutions for these common data cleaning tasks. We then discuss a few popular approaches for developing such solutions. In particular, we focus on an operator-centric approach for developing a data cleaning platform. The operator-centric approach involves the development of customizable operators that could be used as building blocks for developing common solutions. This is similar to the approach of relational algebra for query processing. The basic set of operators can be put together to build complex queries. Finally, we discuss the development of custom scripts which leverage the basic data cleaning operators along with relational operators to implement effective solutions for data cleaning tasks. Table of Contents: Preface / Acknowledgments / Introduction / Technological Approaches / Similarity Functions / Operator: Similarity Join / Operator: Clustering / Operator: Parsing / Task: Record Matching / Task: Deduplication / Data Cleaning Scripts / Conclusion / Bibliography / Authors' Biographies

Cited By

  1. ACM
    Clemen T, Ahmady-Moghaddam N, Lenfers U, Ocker F, Osterholz D, Ströbele J and Glake D Multi-Agent Systems and Digital Twins for Smarter Cities Proceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, (45-55)
  2. ACM
    Glake D, Ritter N, Ocker F, Ahmady-Moghaddam N, Osterholz D, Lenfers U and Clemen T Hierarchical Semantics Matching For Heterogeneous Spatio-temporal Sources Proceedings of the 30th ACM International Conference on Information & Knowledge Management, (565-575)
  3. Sintos S, Agarwal P and Yang J (2019). Selecting data to clean for fact checking, Proceedings of the VLDB Endowment, 12:13, (2408-2421), Online publication date: 1-Sep-2019.
  4. Deng D, Kim A, Madden S and Stonebraker M (2017). SilkMoth, Proceedings of the VLDB Endowment, 10:10, (1082-1093), Online publication date: 1-Jun-2017.
  5. ACM
    Timonin A, Bozhday A and Bershadsky A Research of filtration methods for reference social profile data Proceedings of the International Conference on Electronic Governance and Open Society: Challenges in Eurasia, (189-193)
  6. ACM
    Burdick D, Fagin R, Kolaitis P, Popa L and Tan W (2016). A Declarative Framework for Linking Entities, ACM Transactions on Database Systems, 41:3, (1-38), Online publication date: 8-Aug-2016.
  7. ACM
    Koehler H and Link S Qualitative Cleaning of Uncertain Data Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, (2269-2274)
  8. ACM
    Fan W (2015). Data Quality, ACM SIGMOD Record, 44:3, (7-18), Online publication date: 3-Dec-2015.
  9. ACM
    Bergman M, Milo T, Novgorodov S and Tan W Query-Oriented Data Cleaning with Oracles Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, (1199-1214)
Contributors
  • Google LLC

Recommendations