research-article

Mosaics in Big Data: Stratosphere, Apache Flink, and Beyond

Author:
Volker Markl

Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TUB), The Intelligent Analytics for Massive Data Department at the German Research Center for Artificial Intelligence (DFKI), Berlin, Germany

Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TUB), The Intelligent Analytics for Massive Data Department at the German Research Center for Artificial Intelligence (DFKI), Berlin, Germany
View Profile

DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based SystemsJune 2018Pages 7–13https://doi.org/10.1145/3210284.3214344

Published:25 June 2018Publication History

DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems

Pages 7–13

ABSTRACT

The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define "big data", i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.

References

J. E. Gonzalez, Y. Low, H. Gu, D. Bickson and C. Guestrin, "PowerGraph: Distributed Graph-Parallel Computation.," in USENIX, Berkeley, 2012. Google ScholarDigital Library
S. Borkar and A. Chien, "The future of microprocessors," Communications of the ACM, vol. 54, no. 5, pp. 67--77, 2011. Google ScholarDigital Library
A. M. Caulfield, J. Coburn, T. Mollov, A. De, A. Akel, J. He, A. Jagatheesan, R. K. Gupta, A. Snavely and S. Swanson, "Understanding the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Washington DC, 2010. Google ScholarDigital Library
J. G. Harris, N. Shetterley, A. Alter and e. al, "The Team Solution to the Data Scientist Shortage.," Accenture IfHP, 2013.Google Scholar
J. Manyika and e. al, "Big data: The next frontier for innovation, competition and productivity.," McKinsey GI, 2011.Google Scholar
A. Thusoo, S. S. J., N. Jain and e. al, "Hive - a petabyte scale data warehouse using Hadoop.," in ICDE, 2010.Google Scholar
M. Armbrust, R. S. Xin, C. Lian and e. al, "Spark SQL: Relational Data Processing in Spark," in SIGMOD, 2015. Google ScholarDigital Library
E. Jahani, M. J. Cafarella and C. Ré, "Automatic Optimization for MapReduce Programs.," in PVLDB 4, 2011. Google ScholarDigital Library
V. Markl, "On Declarative Analysis and Data Independence in the Big Data Era.," in PVLDB, 2014. Google ScholarDigital Library
E. F. Codd, "A Relational Model of Data for Large Shared Data Banks.," CACM, vol. 13, no. 6, pp. 377--387, 1970. Google ScholarDigital Library
D. D. Chamberlin and R. F. Boyce, "SEQUEL: A Structured English Query Language.," in SIGMOD, 1974. Google ScholarDigital Library
D. D. Chamberlin, A. M. Gilbert and R. A. Yost, "A History of System R and SQL.," in VLDB, 1981. Google ScholarDigital Library
P. G. Selinger and e. al, "Access Path Selection in a Relational Database Management System.," in SIGMOD, 1979. Google ScholarDigital Library
Why Apache Beam?, http://data-artisans.com/why-apache-beam/.Google Scholar
"Stratosphere," {Online}. Available: http://stratosphere.eu.Google Scholar
"Apache Flink," {Online}. Available: http://flink.apache.org.Google Scholar
"Unified Stream & Batch Processing with Apache Flink," {Online}. Available: youtu.be/8Uh3ycG3Wew.Google Scholar
"Apache Flink Article," {Online}. Available: https://en.wikipedia.org/wiki/Apache_Flink.Google Scholar
P. P. Buneman and e. al, "Programming with Complex Objects and Collection Types.," Theor. Comp. Sci., vol. 149, no. 1, pp. 3--48, 1995. Google ScholarDigital Library
T. Grust, "Comprehending queries.," Universität Konstanz, 1999.Google Scholar
G. Hutton, "A tutorial on the universality and expressiveness of fold.," J. of Funct. Programming, vol. 9, no. 4, 1999. Google ScholarDigital Library
P. P. Buneman and e. al, "Programming with Complex Objects and Collection Types.," Theor. Comp. Sci., vol. 149, no. 1, pp. 3--48, 1995. Google ScholarDigital Library
S. Bird, P. Buneman and W. C. Tan, "Towards a query language for annotation graphs.," CoRR cs.CL/0007023, 2000.Google Scholar
M. Heimel, V. Markl and e. al, "Hardware-Oblivious Parallelism for In-Memory Column-Stores.," in PVLDB 6(9), 2013. Google ScholarDigital Library
M. Heimel, M. Kiefer and V. Markl, "GPU-Accelerated KDE Models for MD Selectivity Estimation," in SIGMOD, 2015.Google Scholar
A. Crotty, A. Galakatos, K. Dursun and e. al, ""Big" Data, Big Analytics, Small Clusters," in CIDR, 2015.Google Scholar
F. McSherry, M. Isard and D. G. Murray, ""Scalability! But at what COST?"," in HotOS XV, 2015. Google ScholarDigital Library
D. Broneske, S. Breß, M. Heimel and G. Saake, "Toward Hardware-Sensitive Database Operations," in EDBT, 2014.Google Scholar
W. Han, W. Kwak, J. Lee, G. M. Lohman and V. Markl, "Parallelizing Query Optimization," in PVLDB 1(1), 2008. Google ScholarDigital Library
A. Alexandrov, R. Bergmann, S. Ewen, J. C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl and e. al, "The Stratosphere platform for big data analytics," VLDB J., vol. 23, no. 6, 2014. Google ScholarDigital Library
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl and D. Warneke, "Nephele/PACTs: A programming model and execution framework for web-scale analytical processing," in SoCC, 2010. Google ScholarDigital Library
A. Alexandrov, A. Katsifodimos, G. Krastev and V. Markl, "Implicit Parallelism through Deep Language Embedding," in SIGMOD Record, 2016. Google ScholarDigital Library
A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb and V. Markl, "Implicit Parallelism through Deep Language Embedding," in SIGMOD, 2015. Google ScholarDigital Library
Apache Flink, "Powered by Flink," {Online}. Available: https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink. {Accessed 2017}.Google Scholar
I. Gog, M. Schwarzkopf and e. al, "Hand: Musketeer: all for one, one for all in data processing systems.," in EuroSys, 2015. Google ScholarDigital Library
S. Ewen, K. Tzoumas, M. Kaufmann and V. Markl, "Spinning Fast Iterative Data Flows.," PVLDB, vol. 5, no. 11, pp. 1268--1279, 2012. Google ScholarDigital Library
P. Carbone, A. Katsifodimos, A. Ewen, V. Markl and e. al:, "Apache Flink™: Stream and Batch Processing in a Single Engine.," IEEE Data Eng. Bull., vol. 38, no. 4, pp. 28--38, 2015.Google Scholar
S. Breß, H. Funke and J. Teubner, "Robust Query Processing in Co-Processor-accelerated Databases.," in SIGMOD, 2016. Google ScholarDigital Library
J. Soto and V. Markl, "A Historical Account of Apache Flink," {Online}. Available: http://www.dima.tu-berlin.de/fileadmin/fg131/Informationsmaterial/Apache_Flink_Origins_for_Public_Release.pdf.. {Accessed 2017}.Google Scholar
S. Schelter, S. Ewen, K. Tzoumas and V. Markl, "All Roads lead to Rome: Optimistic Recovery for distributed iterative data processing.," in CIKM, 2013. Google ScholarDigital Library
A. Kunft, A. Alexandrov, A. Katsifodimos,. Markl: Bridging the gap: towards optimization across linear and relational algebra. BeyondMR@SIGMOD 2016: 1 Google ScholarDigital Library
A. Kunft, A. Katsifodimos, S. Schelter, et al: BlockJoin: Efficient Matrix Partitioning Through Joins. PVLDB 10(13): 2061--2072 (2017) Google ScholarDigital Library

Index Terms

Mosaics in Big Data: Stratosphere, Apache Flink, and Beyond
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Introduction to Big Data: Scalable Representation and Analytics for Data Science Minitrack
HICSS '13: Proceedings of the 2013 46th Hawaii International Conference on System Sciences

Big data is an emerging phenomenon characterized by the three Vs: volume, velocity, and variety. The volume of data has increased from terabytes to petabytes and is encroaching on exabytes. Some pundits are suggesting that zettabytes (1021) are ...
Read More
Big Data & Data Science: A Descriptive Research on Big Data Evolution and a Proposed Combined Platform by Integrating R and Python on Hadoop for Big Data Analytics and Visualization
ICCA 2020: Proceedings of the International Conference on Computing Advancements

In this technological era, Big Data is a new glorified term in where Data Science is the secret sauce of it. Undoubtedly, the digitalization of data is not the whole story; it is just a beginning of Data Science area of study. There was a time when the ...
Read More
A Brief Survey on Big Data in Healthcare

This article presents a brief introduction to big data and big data analytics and also their roles in the healthcare system. A definite range of scientific researches about big data analytics in the healthcare system have been reviewed. The definition ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems
June 2018
289 pages
ISBN:9781450357821
DOI:10.1145/3210284

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 June 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Apache Flink
big data
data science
declarative languages
federation
heterogeneous data management
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
DEBS '18 Paper Acceptance Rate12of31submissions,39%Overall Acceptance Rate130of553submissions,24%
More
Upcoming Conference
DEBS '24

Sponsor:

sigmod

sigmod

The 18th ACM International Conference on Distributed and Event-based Systems

June 24 - 28, 2024

Villeurbanne , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 310
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mosaics in Big Data: Stratosphere, Apache Flink, and Beyond

DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Introduction to Big Data: Scalable Representation and Analytics for Data Science Minitrack

Big Data & Data Science: A Descriptive Research on Big Data Evolution and a Proposed Combined Platform by Integrating R and Python on Hadoop for Big Data Analytics and Visualization

A Brief Survey on Big Data in Healthcare

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Mosaics in Big Data: Stratosphere, Apache Flink, and Beyond

DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Introduction to Big Data: Scalable Representation and Analytics for Data Science Minitrack

Big Data & Data Science: A Descriptive Research on Big Data Evolution and a Proposed Combined Platform by Integrating R and Python on Hadoop for Big Data Analytics and Visualization

A Brief Survey on Big Data in Healthcare

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media