ABSTRACT
Interest in collaborative dataset versioning has emerged due to complex, ad-hoc, and collaborative nature of data science, and the need to record and reason about data at various stages of pre-processing, cleaning, and analysis. To support effective collaborative dataset versioning, one critical operation is differentiation : to succinctly describe what has changed from one dataset to the next. Differentiation, or diffing, allows users to understand changes between two versions, to better understand the evolution process, or to support effective merging or conflict detection across versions. We demonstrate DataDiff, a practical and concise data-diff tool that provides human-interpretable explanations of changes between datasets without reliance on the operations that led to the changes.
- {n. d.}. dbv. ({n. d.}). https://dbv.vizuina.com/Google Scholar
- {n. d.}. Liquibase. ({n. d.}). http://www.liquibase.org/Google Scholar
- {n. d.}. Noms. ({n. d.}). https://github.com/attic-labs/nomsGoogle Scholar
- 2017. Towards a Theory of Data-Diff: Optimal Synthesis of Succinct Data Modification Script. Technical Report. http://data-people.cs.illinois.edu/papers/datadiff. pdfGoogle Scholar
- Azza Abouzied et al. 2013. Learning and verifying quantified boolean queries by example. In PODS. ACM, 49--60. Google ScholarDigital Library
- Ilsoo Ahn et al. 1986. Performance evaluation of a temporal database management system. In ACM SIGMOD Record, Vol. 15. ACM, 96--107. Google ScholarDigital Library
- Mohammed Al-Kateb et al. 2013. Temporal query processing in Teradata. In EDBT '13. ACM, 573--578. Google ScholarDigital Library
- Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J Elmore, Samuel Madden, and Aditya G Parameswaran. 2015. Datahub: Collaborative data science &dataset version management at scale. CIDR (2015).Google Scholar
- Angela Bonifati et al. 2016. Learning join queries from user examples. TODS 40, 4 (2016), 24. Google ScholarDigital Library
- Amit Chavan and Amol Deshpande. 2017. DEX: Query Execution in a Delta-based Storage System. In SIGMOD. ACM, 171--186. Google ScholarDigital Library
- Anish Das Sarma et al. 2010. Synthesizing view definitions from data. In ICDT. ACM, 89--103. Google ScholarDigital Library
- Joseph M Hellerstein et al. 2017. Ground: A Data Context Service.. In CIDR.Google Scholar
- Silu Huang, Liqi Xu, Jialin Liu, Aaron J. Elmore, and Aditya G. Parameswaran. 2017. OrpheusDB: Bolt-on Versioning for Relational Databases. PVLDB (2017). Google ScholarDigital Library
- Christian S Jensen and Richard T Snodgrass. 1999. Temporal data management. IEEE Transactions on Knowledge and Data Engineering 11, 1 (1999), 36--44. Google ScholarDigital Library
- Linan Jiang, Betty Salzberg, David B Lomet, and Manuel Barrena García. 2000. The BT-tree: A Branched and Temporal Access Method.. In VLDB. 451--460. Google ScholarDigital Library
- Gad M Landau et al. 1995. Historical queries along multiple lines of time evolution. The VLDB Journal 4, 4 (1995), 703--726. Google ScholarDigital Library
- Michael Maddox, David Goehring, Aaron J Elmore, Samuel Madden, Aditya Parameswaran, and Amol Deshpande. 2016. Decibel: The relational dataset branching system. VLDB 9, 9 (2016), 624--635. Google ScholarDigital Library
- Kiril Panev and Sebastian Michel. 2016. Reverse Engineering Top-k Database Queries with PALEO.. In EDBT. 113--124.Google Scholar
- Theodoros Rekatsinas et al. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10, 11 (2017), 1190--1201. Google ScholarDigital Library
- Betty Joan Salzberg and David B Lomet. 1995. Branched and Temporal Index Structures. College of Computer Science, Northeastern University.Google Scholar
- Cynthia M Saracco, Matthias Nicola, and Lenisha Gandhi. 2010. A matter of time: Temporal data management in DB2 for z/OS. IBM Corporation, New York (2010).Google Scholar
- Q. T. Tran, C. Y. Chan, and S. Parthasarathy. 2009. Query by Output.Google Scholar
- Meihui Zhang et al. 2013. Reverse engineering complex join queries. In SIGMOD. ACM, 809--820. Google ScholarDigital Library
Index Terms
- DataDiff: User-Interpretable Data Transformation Summaries for Collaborative Data Analysis
Recommendations
Managing versions of web documents in a transaction-time web server
WWW '04: Proceedings of the 13th international conference on World Wide WebThis paper presents a transaction-time HTTP server, called TTApache that supports document versioning. A document often consists of a main file formatted in HTML or XML and several included files such as images and stylesheets. A change to any of the ...
DyVer: Dynamic Version Handling for Array Databases
ICS '23: Proceedings of the 37th International Conference on SupercomputingArray databases are important data management systems for scientific applications. In array databases, version handling is an important problem due to the no-overwrite feature of scientific data. Existing studies for optimizing data versioning in ...
MySQL Collaboration by Approving and Tracking Updates with Dependencies: A Versioning Approach
Computational Science and Its Applications – ICCSA 2022 WorkshopsAbstractIn recent times, data science has seen a rapid increase in the need for individuals and teams to analyze and manipulate data at scale for various scientific and commercial purposes. Groups often collaboratively analyze datasets, thereby leading to ...
Comments