Reasoning on data partitioning for single-round multi-join evaluation in massively parallel systems

Authors:
Tom J. Ameloot

Hasselt University and Transnational University of Limburg, Hasselt, Belgium

Hasselt University and Transnational University of Limburg, Hasselt, Belgium
View Profile

,
Gaetano Geck

TU Dortmund University, Dortmund, Germany

TU Dortmund University, Dortmund, Germany
View Profile

,
Bas Ketsman

Hasselt University and Transnational University of Limburg, Hasselt, Belgium

Hasselt University and Transnational University of Limburg, Hasselt, Belgium
View Profile

,
Frank Neven

Hasselt University and Transnational University of Limburg, Hasselt, Belgium

Hasselt University and Transnational University of Limburg, Hasselt, Belgium
View Profile

,
Thomas Schwentick

TU Dortmund University, Dortmund, Germany

TU Dortmund University, Dortmund, Germany
View Profile

Authors Info & Claims

Communications of the ACM Volume 60 Issue 3March 2017pp 93–100https://doi.org/10.1145/3041063

Published:21 February 2017Publication History

Communications of the ACM

Abstract

Evaluating queries over massive amounts of data is a major challenge in the big data era. Modern massively parallel systems, such as, Spark, organize query answering as a sequence of rounds each consisting of a distinct communication phase followed by a computation phase. The communication phase redistributes data over the available servers, while in the subsequent computation phase each server performs the actual computation on its local data. There is a growing interest in single-round algorithms for evaluating multiway joins where data is first reshuffled over the servers and then evaluated in a parallel but communication-free way. As the amount of communication induced by a reshuffling of the data is a dominating cost in such systems, we introduce a framework for reasoning about data partitioning to detect when we can avoid the data reshuffling step. Specifically, we formalize the decision problems parallel-correctness and transfer of parallel-correctness, provide semantical characterizations, and obtain tight complexity bounds.

References

Afrati, F.N., Ullman, J.D. Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng. 23, 9 (2011), 1282--1298. Google ScholarDigital Library
Ameloot, T.J., Geck, G., Ketsman, B., Neven, F., Schwentick, T. Parallel-correctness and transferability for conjunctive queries, submitted for journal publication (2015).Google Scholar
Arora, S., Barak, B. Computational Complexity -- A Modern Approach. Cambridge University Press, 2009. Google ScholarDigital Library
Beame, P., Koutris, P., Suciu, D. Communication steps for parallel query processing. In Proceedings of the 32nd Symposium on Principles of Database Systems, PODS'13 (2013), 273--284. Google ScholarDigital Library
Beame, P., Koutris, P., Suciu, D. Skew in parallel query processing. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'14 (2014), 212--223. Google ScholarDigital Library
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, A.K. Elmagarmid and D. Agrawal, eds. (Indianapolis, Indiana, USA, June 6--10, 2010). ACM 975--986. Google ScholarDigital Library
Chu, S., Balazinska, M., Suciu, D. From theory to practice: Efficient join query evaluation in a parallel database system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015), 63--78. Google ScholarDigital Library
Dean, J., Ghemawat, S. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
Ganguly, S., Silberschatz, A., Tsur, S. Parallel bottom-up processing of datalog queries. J. Log. Program. 14, 1&2 (1992), 101--126. Google ScholarDigital Library
Geck, G., Ketsman, B., Neven, F., Schwentick, T. Parallel-correctness and containment for conjunctive queries with union and negation. In International Conference on Database Theory (2016), 9:1--9:17.Google Scholar
Halperin, D., Teixeira de Almeida, V., Choo, L.L., Chu, S., Koutris, P., Moritz, D., Ortiz, J., Ruamviboonsuk, V., Wang, J., Whitaker, A., Xu, S., Balazinska, M., Howe, B., Suciu, D. Demonstration of the Myria big data management service. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD'14 (2014), 881--884. Google ScholarDigital Library
Koutris, P., Suciu, D. Parallel evaluation of conjunctive queries. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, M. Lenzerini and T. Schwentick, eds. (Athens, Greece, June 12--16, 2011). ACM, 223--234. Google ScholarDigital Library
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T. Dremel: Interactive analysis of web-scale datasets. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 330--339. Google ScholarDigital Library
Mugnier, M., Simonet, G., Thomazo, M. On the complexity of entailment in existential conjunctive first-order logic with atomic negation. Inf. Comput. 215 (2012), 8--31. Google ScholarDigital Library
Nehme, R., Bruno, N. Automated partitioning design in parallel database systems. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD'11 (2011), 1137--1148. Google ScholarDigital Library
Ngo, H.Q., Porat, E., Ré, C., Rudra, A. Worst-case optimal join algorithms. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012 (2012), 37--48. Google ScholarDigital Library
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A. Pig latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, J. Tsong and L. Wang, eds.(Vancouver, BC, Canada, June 10--12, 2008). ACM 1099--1110. Google ScholarDigital Library
Rao, J., Zhang, C., Megiddo, N., Lohman, G. Automating physical database design in a parallel database. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD'02 (2002), 558--569. Google ScholarDigital Library
Shute, J., Vingralek, R., Samwel, B., Handy, B., Whipkey, C., Rollins, E., Oancea, M., Littlefield, K., Menestrina, D., Ellner, S., Cieslewicz, J., Rae, I., Stancescu, T., Apte, H. F1: A distributed sql database that scales. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1068--1079. Google ScholarDigital Library
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R. Hive: A warehousing solution over a map-reduce framework. PVLDB 2, 2 (2009), 1626--1629. Google ScholarDigital Library
Ullman, J.D. Information integration using logical views. Theor. Comput. Sci. 239, 2 (2000), 189--210. Google ScholarDigital Library
Veldhuizen, T.L. Triejoin: A simple, worst-case optimal join algorithm. In Proceedings of the 17th International Conference on Database Theory (ICDT) (2014), 96--106.Google Scholar
Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I. Shark: Sql and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD'13 (2013), 13--24. Google ScholarDigital Library

Index Terms

Reasoning on data partitioning for single-round multi-join evaluation in massively parallel systems
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs
2. Theory of computation
  1. Models of computation
  2. Theory and algorithms for application domains
    1. Database theory
      1. Database query languages (principles)

Recommendations

Data partitioning for single-round multi-join evaluation in massively parallel systems

A dominant cost for query evaluation in modern massively distributed systems is the number of communication rounds. For this reason, there is a growing interest in single-round multiway join algorithms where data is first reshuffled over many servers ...
Read More
Query optimization for massively parallel data processing
SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. Some vendors have ...
Read More
Massively Parallel Join Algorithms

Due to the rapid development of massively parallel data processing systems such as MapReduce and Spark, there have been revived interests in designing algorithms in a massively parallel computational model. Computing multi-way joins, as one of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 60, Issue 3
March 2017
89 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3055102
Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 February 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 4,543
  Total Downloads
- Downloads (Last 12 months)174
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Reasoning on data partitioning for single-round multi-join evaluation in massively parallel systems

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Data partitioning for single-round multi-join evaluation in massively parallel systems

Query optimization for massively parallel data processing

Massively Parallel Join Algorithms