skip to main content
10.1145/3035918.3058730acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper

The VADA Architecture for Cost-Effective Data Wrangling

Published:09 May 2017Publication History

ABSTRACT

Data wrangling, the multi-faceted process by which the data required by an application is identified, extracted, cleaned and integrated, is often cumbersome and labor intensive. In this paper, we present an architecture that supports a complete data wrangling lifecycle, orchestrates components dynamically, builds on automation wherever possible, is informed by whatever data is available, refines automatically produced results in the light of feedback, takes into account the user's priorities, and supports data scientists with diverse skill sets. The architecture is demonstrated in practice for wrangling property sales and open government data.

References

  1. S. Abiteboul, V. Vianu, B. S. Fordham, and Y. Yesha. Relational transducers for electronic commerce. J. Comput. Syst. Sci., 61(2):236--269, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Calì, G. Gottlob, and T. Lukasiewicz. A generaldatalog-based framework for tractable query answering over ontologies. J. Web Sem., 14:57--83, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Deng et al. The data civilizer system. In CIDR, 2017.Google ScholarGoogle Scholar
  4. W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. H. Farid, A. Roatis, I. F. Ilyas, H.-F. Hoffmann, and X. Chu. CLAMS: bringing quality to data lakes. In SIGMOD, pages 2089--2092, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, and C. Schallhart. The ontological key: automatically understanding and integrating forms to access the deep web. VLDBJ, 22(5):615--640, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Furche, G. Gottlob, L. Libkin, G. Orsi, and N. W. Paton. Data wrangling for big data: Challenges and opportunities. In EDBT, pages 473--478, 2016.Google ScholarGoogle Scholar
  8. R. Hai, S. Geisler, and C. Quix. Constance: An intelligent data lake system. In SIGMOD, pages 2097--2100, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Morcos, Z. Abedjan, I. Francis Ilyas, M. Ouzzani, P. Papotti, and M. Stonebraker. Dataxformer: An interactive data transformation tool. In SIGMOD, pages 883--888, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. I. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino. Data wrangling: The challenging journey from the wild to the lake. In CIDR, 2015.Google ScholarGoogle Scholar
  12. P. Vassiliadis. A survey of extract-transform-load technology. IJDWM, 5(3):1--27, 2011.Google ScholarGoogle Scholar

Index Terms

  1. The VADA Architecture for Cost-Effective Data Wrangling

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
      May 2017
      1810 pages
      ISBN:9781450341974
      DOI:10.1145/3035918

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 May 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Author Tags

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader