ABSTRACT
Data wrangling, the multi-faceted process by which the data required by an application is identified, extracted, cleaned and integrated, is often cumbersome and labor intensive. In this paper, we present an architecture that supports a complete data wrangling lifecycle, orchestrates components dynamically, builds on automation wherever possible, is informed by whatever data is available, refines automatically produced results in the light of feedback, takes into account the user's priorities, and supports data scientists with diverse skill sets. The architecture is demonstrated in practice for wrangling property sales and open government data.
- S. Abiteboul, V. Vianu, B. S. Fordham, and Y. Yesha. Relational transducers for electronic commerce. J. Comput. Syst. Sci., 61(2):236--269, 2000. Google ScholarDigital Library
- A. Calì, G. Gottlob, and T. Lukasiewicz. A generaldatalog-based framework for tractable query answering over ontologies. J. Web Sem., 14:57--83, 2012. Google ScholarDigital Library
- D. Deng et al. The data civilizer system. In CIDR, 2017.Google Scholar
- W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool, 2012. Google ScholarDigital Library
- M. H. Farid, A. Roatis, I. F. Ilyas, H.-F. Hoffmann, and X. Chu. CLAMS: bringing quality to data lakes. In SIGMOD, pages 2089--2092, 2016. Google ScholarDigital Library
- T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, and C. Schallhart. The ontological key: automatically understanding and integrating forms to access the deep web. VLDBJ, 22(5):615--640, 2013. Google ScholarDigital Library
- T. Furche, G. Gottlob, L. Libkin, G. Orsi, and N. W. Paton. Data wrangling for big data: Challenges and opportunities. In EDBT, pages 473--478, 2016.Google Scholar
- R. Hai, S. Geisler, and C. Quix. Constance: An intelligent data lake system. In SIGMOD, pages 2097--2100, 2016. Google ScholarDigital Library
- S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011. Google ScholarDigital Library
- J. Morcos, Z. Abedjan, I. Francis Ilyas, M. Ouzzani, P. Papotti, and M. Stonebraker. Dataxformer: An interactive data transformation tool. In SIGMOD, pages 883--888, 2015. Google ScholarDigital Library
- I. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino. Data wrangling: The challenging journey from the wild to the lake. In CIDR, 2015.Google Scholar
- P. Vassiliadis. A survey of extract-transform-load technology. IJDWM, 5(3):1--27, 2011.Google Scholar
Index Terms
- The VADA Architecture for Cost-Effective Data Wrangling
Recommendations
Data Diff: Interpretable, Executable Summaries of Changes in Distributions for Data Wrangling
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningMany analyses in data science are not one-off projects, but are repeated over multiple data samples, such as once per month, once per quarter, and so on. For example, if a data scientist performs an analysis in 2017 that saves a significant amount of ...
Data wrangling at scale: the experience of EW-shopp
ECSA '18: Proceedings of the 12th European Conference on Software Architecture: Companion ProceedingsThis paper presents a subsystem of a comprehensive platform dedicated to data transformation, linking and extension of large data sets. Furthermore, we detail and discuss both the main requirements that have led to the design and development of the ...
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
AbstractConsolidation of the research information improves the quality of data integration, reducing duplicates between systems and enabling the required flexibility and scalability when processing various data sources. We assume that the combination of a ...
Comments