skip to main content
10.1145/2882903.2882957acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections

SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment

Published:14 June 2016Publication History

Editorial Notes

Computationally Replicable. The experimental results of this paper were replicated by a SIGMOD Review Committee and were found to support the central results reported in the paper. Details of the review process are found here

ABSTRACT

We analyze the workload from a multi-year deployment of a database-as-a-service platform targeting scientists and data scientists with minimal database experience. Our hypothesis was that relatively minor changes to the way databases are delivered can increase their use in ad hoc analysis environments. The web-based SQLShare system emphasizes easy dataset-at-a-time ingest, relaxed schemas and schema inference, easy view creation and sharing, and full SQL support. We find that these features have helped attract workloads typically associated with scripts and files rather than relational databases: complex analytics, routine processing pipelines, data publishing, and collaborative analysis. Quantitatively, these workloads are characterized by shorter dataset "lifetimes", higher query complexity, and higher data complexity. We report on usage scenarios that suggest SQL is being used in place of scripts for one-off data analysis and ad hoc data sharing. The workload suggests that a new class of relational systems emphasizing short-term, ad hoc analytics over engineered schemas may improve uptake of database technology in data science contexts. Our contributions include a system design for delivering databases into these contexts, a description of a public research query workload dataset released to advance research in analytic data systems, and an initial analysis of the workload that provides evidence of new use cases under-supported in existing systems.

Skip Supplemental Material Section

Supplemental Material

References

  1. Apache hadoop. https://hadoop.apache.org/. Accessed: 2014--10--14.Google ScholarGoogle Scholar
  2. Big data techniques applied to media and computer graphics applications. https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf.Google ScholarGoogle Scholar
  3. OpenRefine (formerly google refine). http://openrefine.org/. Accessed: 2014--10--14.Google ScholarGoogle Scholar
  4. M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383--1394. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. arXiv preprint arXiv:1409.0798, 2014.Google ScholarGoogle Scholar
  6. J. Clark, S. DeRose, et al. Xml path language (xpath). W3C recommendation, 16, 1999.Google ScholarGoogle Scholar
  7. S. Cohen-Boulakia and U. Leser. Search, adapt, and reuse: the future of scientific workflows. ACM SIGMOD Record, 40(2):6--16, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. P. P. Council. TPC-H benchmark specification. http://www.tpc.org/tpch/, 2008.Google ScholarGoogle Scholar
  9. E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program., 13(3):219--237, July 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Doan and A. Y. Halevy. Semantic integration research in the database community: A brief survey. AI magazine, 26(1):83, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. ACM Sigmod Record, 34(4):27--33, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google fusion tables: web-centered data management and collaboration. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1061--1066. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker, et al. Demonstration of the myria big data management service. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, Sigmod '14, pages 881--884. ACM, 7 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Howe, G. Cole, E. Souroush, P. Koutris, A. Key, N. Khoussainova, and L. Battle. Database-as-a-service for long-tail science. In Scientific and Statistical Database Management, pages 480--489. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Howe, F. Ribalet, D. Halperin, S. Chitnis, and E. V. Armbrust. Sqlshare: Scientific workflow via relational view sharing. Computing in Science & Engineering, Special Issue on Science Data Management, 15(2), 2013.Google ScholarGoogle Scholar
  16. S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 3363--3372. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. In IEEE Visual Analytics Science & Technology (VAST), 2012.Google ScholarGoogle Scholar
  18. S. M. Kent. Sloan digital sky survey. In Science with Astronomical Near-Infrared Sky Surveys, pages 27--30. Springer, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  19. N. Khoussainova, M. Balazinska, W. Gatterbauer, Y. Kwon, and D. Suciu. A case for a collaborative query management system. arXiv preprint arXiv:0909.1778, 2009.Google ScholarGoogle Scholar
  20. N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Context-aware autocompletion for sql. Proceedings of the VLDB Endowment, 4(1):22--33, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Kim, V. Sazawal, D. Notkin, and G. Murphy. An empirical study of code clone genealogies. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 187--196. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. Li, T. Pan, and H. V. Jagadish. Schema-free sql. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1051--1062, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Mozafari, E. Z. Y. Goh, and D. Y. Yoon. Cliffguard: A principled framework for finding robust database designs. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1167--1182. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. Ogasawara, J. Dias, F. Porto, P. Valduriez, and M. Mattoso. An algebraic approach for data-centric scientific workflows. Proc. of VLDB Endowment, 4(12):1328--1339, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Ren, Y. Kwon, M. Balazinska, and B. Howe. Hadoop's adolescence: an analysis of hadoop usage in scientific workloads. Proceedings of the VLDB Endowment, 6(10):853--864, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Rosson and J. Carroll. Active programming strategies in reuse. In O. Nierstrasz, editor, ECOOP '93 -- Object-Oriented Programming, volume 707 of Lecture Notes in Computer Science, pages 4--20. Springer Berlin Heidelberg, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Roy, K. Ramamritham, S. Seshadri, P. Shenoy, and S. Sudarshan. Don't trash your intermediate results, cache'em. arXiv preprint cs/0003005, 2000.Google ScholarGoogle Scholar
  28. V. Singh, J. Gray, A. Thakar, A. S. Szalay, J. Raddick, B. Boroski, S. Lebedeva, and B. Yanny. Skyserver traffic report-the first five years. arXiv preprint cs/0701173, 2007.Google ScholarGoogle Scholar
  29. M. Stonebraker, J. Becla, D. J. DeWitt, K. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. Requirements for science data bases and scidb. In CIDR 2009, Fourth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4--7, 2009, Online Proceedings, 2009.Google ScholarGoogle Scholar
  30. I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields. Workflows for e-Science: Scientific Workflows for Grids. Springer Publishing Company, Incorporated, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
              June 2016
              2300 pages
              ISBN:9781450335317
              DOI:10.1145/2882903

              Copyright © 2016 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 14 June 2016

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate785of4,003submissions,20%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader