Editorial Notes
Computationally Replicable. The experimental results of this paper were replicated by a SIGMOD Review Committee and were found to support the central results reported in the paper. Details of the review process are found here
ABSTRACT
We analyze the workload from a multi-year deployment of a database-as-a-service platform targeting scientists and data scientists with minimal database experience. Our hypothesis was that relatively minor changes to the way databases are delivered can increase their use in ad hoc analysis environments. The web-based SQLShare system emphasizes easy dataset-at-a-time ingest, relaxed schemas and schema inference, easy view creation and sharing, and full SQL support. We find that these features have helped attract workloads typically associated with scripts and files rather than relational databases: complex analytics, routine processing pipelines, data publishing, and collaborative analysis. Quantitatively, these workloads are characterized by shorter dataset "lifetimes", higher query complexity, and higher data complexity. We report on usage scenarios that suggest SQL is being used in place of scripts for one-off data analysis and ad hoc data sharing. The workload suggests that a new class of relational systems emphasizing short-term, ad hoc analytics over engineered schemas may improve uptake of database technology in data science contexts. Our contributions include a system design for delivering databases into these contexts, a description of a public research query workload dataset released to advance research in analytic data systems, and an initial analysis of the workload that provides evidence of new use cases under-supported in existing systems.
Supplemental Material
Available for Download
Rights information
Graphs, Plots, Results
- Apache hadoop. https://hadoop.apache.org/. Accessed: 2014--10--14.Google Scholar
- Big data techniques applied to media and computer graphics applications. https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf.Google Scholar
- OpenRefine (formerly google refine). http://openrefine.org/. Accessed: 2014--10--14.Google Scholar
- M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383--1394. ACM, 2015. Google ScholarDigital Library
- A. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. arXiv preprint arXiv:1409.0798, 2014.Google Scholar
- J. Clark, S. DeRose, et al. Xml path language (xpath). W3C recommendation, 16, 1999.Google Scholar
- S. Cohen-Boulakia and U. Leser. Search, adapt, and reuse: the future of scientific workflows. ACM SIGMOD Record, 40(2):6--16, 2011. Google ScholarDigital Library
- T. P. P. Council. TPC-H benchmark specification. http://www.tpc.org/tpch/, 2008.Google Scholar
- E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program., 13(3):219--237, July 2005. Google ScholarDigital Library
- A. Doan and A. Y. Halevy. Semantic integration research in the database community: A brief survey. AI magazine, 26(1):83, 2005. Google ScholarDigital Library
- M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. ACM Sigmod Record, 34(4):27--33, 2005. Google ScholarDigital Library
- H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google fusion tables: web-centered data management and collaboration. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1061--1066. ACM, 2010. Google ScholarDigital Library
- D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker, et al. Demonstration of the myria big data management service. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, Sigmod '14, pages 881--884. ACM, 7 2014. Google ScholarDigital Library
- B. Howe, G. Cole, E. Souroush, P. Koutris, A. Key, N. Khoussainova, and L. Battle. Database-as-a-service for long-tail science. In Scientific and Statistical Database Management, pages 480--489. Springer, 2011. Google ScholarDigital Library
- B. Howe, F. Ribalet, D. Halperin, S. Chitnis, and E. V. Armbrust. Sqlshare: Scientific workflow via relational view sharing. Computing in Science & Engineering, Special Issue on Science Data Management, 15(2), 2013.Google Scholar
- S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 3363--3372. ACM, 2011. Google ScholarDigital Library
- S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. In IEEE Visual Analytics Science & Technology (VAST), 2012.Google Scholar
- S. M. Kent. Sloan digital sky survey. In Science with Astronomical Near-Infrared Sky Surveys, pages 27--30. Springer, 1994.Google ScholarCross Ref
- N. Khoussainova, M. Balazinska, W. Gatterbauer, Y. Kwon, and D. Suciu. A case for a collaborative query management system. arXiv preprint arXiv:0909.1778, 2009.Google Scholar
- N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Context-aware autocompletion for sql. Proceedings of the VLDB Endowment, 4(1):22--33, 2010. Google ScholarDigital Library
- M. Kim, V. Sazawal, D. Notkin, and G. Murphy. An empirical study of code clone genealogies. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 187--196. ACM, 2005. Google ScholarDigital Library
- F. Li, T. Pan, and H. V. Jagadish. Schema-free sql. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1051--1062, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- B. Mozafari, E. Z. Y. Goh, and D. Y. Yoon. Cliffguard: A principled framework for finding robust database designs. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1167--1182. ACM, 2015. Google ScholarDigital Library
- E. Ogasawara, J. Dias, F. Porto, P. Valduriez, and M. Mattoso. An algebraic approach for data-centric scientific workflows. Proc. of VLDB Endowment, 4(12):1328--1339, 2011.Google ScholarDigital Library
- K. Ren, Y. Kwon, M. Balazinska, and B. Howe. Hadoop's adolescence: an analysis of hadoop usage in scientific workloads. Proceedings of the VLDB Endowment, 6(10):853--864, 2013. Google ScholarDigital Library
- M. Rosson and J. Carroll. Active programming strategies in reuse. In O. Nierstrasz, editor, ECOOP '93 -- Object-Oriented Programming, volume 707 of Lecture Notes in Computer Science, pages 4--20. Springer Berlin Heidelberg, 1993. Google ScholarDigital Library
- P. Roy, K. Ramamritham, S. Seshadri, P. Shenoy, and S. Sudarshan. Don't trash your intermediate results, cache'em. arXiv preprint cs/0003005, 2000.Google Scholar
- V. Singh, J. Gray, A. Thakar, A. S. Szalay, J. Raddick, B. Boroski, S. Lebedeva, and B. Yanny. Skyserver traffic report-the first five years. arXiv preprint cs/0701173, 2007.Google Scholar
- M. Stonebraker, J. Becla, D. J. DeWitt, K. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. Requirements for science data bases and scidb. In CIDR 2009, Fourth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4--7, 2009, Online Proceedings, 2009.Google Scholar
- I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields. Workflows for e-Science: Scientific Workflows for Grids. Springer Publishing Company, Incorporated, 2014. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009. Google ScholarDigital Library
Index Terms
- SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment
Recommendations
Progress in Database Search Strategies
Retrieval speed and precision ultimately determine the success of any database system. This article outlines the challenges posed by distributed and heterogeneous database systems, including those that store unstructured data, and surveys recent work. ...
Federating Object-Oriented and Relational Databases: The IRO-DB Experience
COOPIS '97: Proceedings of the Second IFCIS International Conference on Cooperative Information SystemsFrom the beginning of 1994 to the end of 1996, the IRO-DB (Interoperable Relational and Object-Oriented Databases) ESPRIT project has developed tools for accessing relational and object-oriented databases in an integrated way, and for designing and ...
An object-oriented prototype for a geophysical database
SSST '95: Proceedings of the 27th Southeastern Symposium on System Theory (SSST'95)Database management systems (DBMSs) are being used in a wide variety of domains to handle many types of data. Scientific data pose a special challenge to DBMSs due to their volume and complex nature. The object-oriented model has many additional ...
Comments