Cleaning organizational data of discrepancies in structure and content is important for data warehousing and Enterprise Data Integration (EDI). Current commercial solutions for data cleaning involve many iterations of time-consuming "data quality" analysis to find errors, and long-running transformations to fix them. Users need to endure long waits and often write complex transformation programs. We present an interactive framework for data cleaning that tightly integrates transformation and discrepancy detection. Users gradually build transformations by adding or undoing transforms, in a intu-itive, graphical manner through a spreadsheet-like interface; the effect of a transformis shown at once on records visible on screen. In the background, the system incrementally searches for discrepancies on the latest transformed version of data, flagging them as they are found. This allows users to gradually construct a transformation as discrepancies are found, and clean the data without writing complex programs or enduring long delays. Balancing the goals of power, ease of specification, and interactive application, we choose a set of transforms that can be used for transformations within data records as well as for higher-order transformations. We also present initial work on optimizing a sequence of transforms.
Cited By
- Jin Z, Anderson M, Cafarella M and Jagadish H Foofah Proceedings of the 2017 ACM International Conference on Management of Data, (683-698)
- Chaudhuri S, Chen B, Ganti V and Kaushik R Example-driven design of efficient record matching queries Proceedings of the 33rd international conference on Very large data bases, (327-338)
- Stonebraker M and Hellerstein J Content integration for e-business Proceedings of the 2001 ACM SIGMOD international conference on Management of data, (552-560)
- Stonebraker M and Hellerstein J (2001). Content integration for e-business, ACM SIGMOD Record, 30:2, (552-560), Online publication date: 1-Jun-2001.
- Cui Y and Widom J Lineage Tracing for General Data Warehouse Transformations Proceedings of the 27th International Conference on Very Large Data Bases, (471-480)