skip to main content
research-article
Free Access

Reasoning on data partitioning for single-round multi-join evaluation in massively parallel systems

Authors Info & Claims
Published:21 February 2017Publication History
Skip Abstract Section

Abstract

Evaluating queries over massive amounts of data is a major challenge in the big data era. Modern massively parallel systems, such as, Spark, organize query answering as a sequence of rounds each consisting of a distinct communication phase followed by a computation phase. The communication phase redistributes data over the available servers, while in the subsequent computation phase each server performs the actual computation on its local data. There is a growing interest in single-round algorithms for evaluating multiway joins where data is first reshuffled over the servers and then evaluated in a parallel but communication-free way. As the amount of communication induced by a reshuffling of the data is a dominating cost in such systems, we introduce a framework for reasoning about data partitioning to detect when we can avoid the data reshuffling step. Specifically, we formalize the decision problems parallel-correctness and transfer of parallel-correctness, provide semantical characterizations, and obtain tight complexity bounds.

References

  1. Afrati, F.N., Ullman, J.D. Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng. 23, 9 (2011), 1282--1298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ameloot, T.J., Geck, G., Ketsman, B., Neven, F., Schwentick, T. Parallel-correctness and transferability for conjunctive queries, submitted for journal publication (2015).Google ScholarGoogle Scholar
  3. Arora, S., Barak, B. Computational Complexity -- A Modern Approach. Cambridge University Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Beame, P., Koutris, P., Suciu, D. Communication steps for parallel query processing. In Proceedings of the 32nd Symposium on Principles of Database Systems, PODS'13 (2013), 273--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Beame, P., Koutris, P., Suciu, D. Skew in parallel query processing. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'14 (2014), 212--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, A.K. Elmagarmid and D. Agrawal, eds. (Indianapolis, Indiana, USA, June 6--10, 2010). ACM 975--986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chu, S., Balazinska, M., Suciu, D. From theory to practice: Efficient join query evaluation in a parallel database system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015), 63--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dean, J., Ghemawat, S. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ganguly, S., Silberschatz, A., Tsur, S. Parallel bottom-up processing of datalog queries. J. Log. Program. 14, 1&2 (1992), 101--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Geck, G., Ketsman, B., Neven, F., Schwentick, T. Parallel-correctness and containment for conjunctive queries with union and negation. In International Conference on Database Theory (2016), 9:1--9:17.Google ScholarGoogle Scholar
  11. Halperin, D., Teixeira de Almeida, V., Choo, L.L., Chu, S., Koutris, P., Moritz, D., Ortiz, J., Ruamviboonsuk, V., Wang, J., Whitaker, A., Xu, S., Balazinska, M., Howe, B., Suciu, D. Demonstration of the Myria big data management service. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD'14 (2014), 881--884. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Koutris, P., Suciu, D. Parallel evaluation of conjunctive queries. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, M. Lenzerini and T. Schwentick, eds. (Athens, Greece, June 12--16, 2011). ACM, 223--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T. Dremel: Interactive analysis of web-scale datasets. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 330--339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mugnier, M., Simonet, G., Thomazo, M. On the complexity of entailment in existential conjunctive first-order logic with atomic negation. Inf. Comput. 215 (2012), 8--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Nehme, R., Bruno, N. Automated partitioning design in parallel database systems. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD'11 (2011), 1137--1148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ngo, H.Q., Porat, E., Ré, C., Rudra, A. Worst-case optimal join algorithms. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012 (2012), 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A. Pig latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, J. Tsong and L. Wang, eds.(Vancouver, BC, Canada, June 10--12, 2008). ACM 1099--1110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Rao, J., Zhang, C., Megiddo, N., Lohman, G. Automating physical database design in a parallel database. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD'02 (2002), 558--569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Shute, J., Vingralek, R., Samwel, B., Handy, B., Whipkey, C., Rollins, E., Oancea, M., Littlefield, K., Menestrina, D., Ellner, S., Cieslewicz, J., Rae, I., Stancescu, T., Apte, H. F1: A distributed sql database that scales. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1068--1079. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R. Hive: A warehousing solution over a map-reduce framework. PVLDB 2, 2 (2009), 1626--1629. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ullman, J.D. Information integration using logical views. Theor. Comput. Sci. 239, 2 (2000), 189--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Veldhuizen, T.L. Triejoin: A simple, worst-case optimal join algorithm. In Proceedings of the 17th International Conference on Database Theory (ICDT) (2014), 96--106.Google ScholarGoogle Scholar
  23. Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I. Shark: Sql and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD'13 (2013), 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reasoning on data partitioning for single-round multi-join evaluation in massively parallel systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Communications of the ACM
          Communications of the ACM  Volume 60, Issue 3
          March 2017
          89 pages
          ISSN:0001-0782
          EISSN:1557-7317
          DOI:10.1145/3055102
          • Editor:
          • Moshe Y. Vardi
          Issue’s Table of Contents

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 February 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format