ABSTRACT
The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define "big data", i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson and C. Guestrin, "PowerGraph: Distributed Graph-Parallel Computation.," in USENIX, Berkeley, 2012. Google ScholarDigital Library
- S. Borkar and A. Chien, "The future of microprocessors," Communications of the ACM, vol. 54, no. 5, pp. 67--77, 2011. Google ScholarDigital Library
- A. M. Caulfield, J. Coburn, T. Mollov, A. De, A. Akel, J. He, A. Jagatheesan, R. K. Gupta, A. Snavely and S. Swanson, "Understanding the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Washington DC, 2010. Google ScholarDigital Library
- J. G. Harris, N. Shetterley, A. Alter and e. al, "The Team Solution to the Data Scientist Shortage.," Accenture IfHP, 2013.Google Scholar
- J. Manyika and e. al, "Big data: The next frontier for innovation, competition and productivity.," McKinsey GI, 2011.Google Scholar
- A. Thusoo, S. S. J., N. Jain and e. al, "Hive - a petabyte scale data warehouse using Hadoop.," in ICDE, 2010.Google Scholar
- M. Armbrust, R. S. Xin, C. Lian and e. al, "Spark SQL: Relational Data Processing in Spark," in SIGMOD, 2015. Google ScholarDigital Library
- E. Jahani, M. J. Cafarella and C. Ré, "Automatic Optimization for MapReduce Programs.," in PVLDB 4, 2011. Google ScholarDigital Library
- V. Markl, "On Declarative Analysis and Data Independence in the Big Data Era.," in PVLDB, 2014. Google ScholarDigital Library
- E. F. Codd, "A Relational Model of Data for Large Shared Data Banks.," CACM, vol. 13, no. 6, pp. 377--387, 1970. Google ScholarDigital Library
- D. D. Chamberlin and R. F. Boyce, "SEQUEL: A Structured English Query Language.," in SIGMOD, 1974. Google ScholarDigital Library
- D. D. Chamberlin, A. M. Gilbert and R. A. Yost, "A History of System R and SQL.," in VLDB, 1981. Google ScholarDigital Library
- P. G. Selinger and e. al, "Access Path Selection in a Relational Database Management System.," in SIGMOD, 1979. Google ScholarDigital Library
- Why Apache Beam?, http://data-artisans.com/why-apache-beam/.Google Scholar
- "Stratosphere," {Online}. Available: http://stratosphere.eu.Google Scholar
- "Apache Flink," {Online}. Available: http://flink.apache.org.Google Scholar
- "Unified Stream & Batch Processing with Apache Flink," {Online}. Available: youtu.be/8Uh3ycG3Wew.Google Scholar
- "Apache Flink Article," {Online}. Available: https://en.wikipedia.org/wiki/Apache_Flink.Google Scholar
- P. P. Buneman and e. al, "Programming with Complex Objects and Collection Types.," Theor. Comp. Sci., vol. 149, no. 1, pp. 3--48, 1995. Google ScholarDigital Library
- T. Grust, "Comprehending queries.," Universität Konstanz, 1999.Google Scholar
- G. Hutton, "A tutorial on the universality and expressiveness of fold.," J. of Funct. Programming, vol. 9, no. 4, 1999. Google ScholarDigital Library
- P. P. Buneman and e. al, "Programming with Complex Objects and Collection Types.," Theor. Comp. Sci., vol. 149, no. 1, pp. 3--48, 1995. Google ScholarDigital Library
- S. Bird, P. Buneman and W. C. Tan, "Towards a query language for annotation graphs.," CoRR cs.CL/0007023, 2000.Google Scholar
- M. Heimel, V. Markl and e. al, "Hardware-Oblivious Parallelism for In-Memory Column-Stores.," in PVLDB 6(9), 2013. Google ScholarDigital Library
- M. Heimel, M. Kiefer and V. Markl, "GPU-Accelerated KDE Models for MD Selectivity Estimation," in SIGMOD, 2015.Google Scholar
- A. Crotty, A. Galakatos, K. Dursun and e. al, ""Big" Data, Big Analytics, Small Clusters," in CIDR, 2015.Google Scholar
- F. McSherry, M. Isard and D. G. Murray, ""Scalability! But at what COST?"," in HotOS XV, 2015. Google ScholarDigital Library
- D. Broneske, S. Breß, M. Heimel and G. Saake, "Toward Hardware-Sensitive Database Operations," in EDBT, 2014.Google Scholar
- W. Han, W. Kwak, J. Lee, G. M. Lohman and V. Markl, "Parallelizing Query Optimization," in PVLDB 1(1), 2008. Google ScholarDigital Library
- A. Alexandrov, R. Bergmann, S. Ewen, J. C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl and e. al, "The Stratosphere platform for big data analytics," VLDB J., vol. 23, no. 6, 2014. Google ScholarDigital Library
- D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl and D. Warneke, "Nephele/PACTs: A programming model and execution framework for web-scale analytical processing," in SoCC, 2010. Google ScholarDigital Library
- A. Alexandrov, A. Katsifodimos, G. Krastev and V. Markl, "Implicit Parallelism through Deep Language Embedding," in SIGMOD Record, 2016. Google ScholarDigital Library
- A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb and V. Markl, "Implicit Parallelism through Deep Language Embedding," in SIGMOD, 2015. Google ScholarDigital Library
- Apache Flink, "Powered by Flink," {Online}. Available: https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink. {Accessed 2017}.Google Scholar
- I. Gog, M. Schwarzkopf and e. al, "Hand: Musketeer: all for one, one for all in data processing systems.," in EuroSys, 2015. Google ScholarDigital Library
- S. Ewen, K. Tzoumas, M. Kaufmann and V. Markl, "Spinning Fast Iterative Data Flows.," PVLDB, vol. 5, no. 11, pp. 1268--1279, 2012. Google ScholarDigital Library
- P. Carbone, A. Katsifodimos, A. Ewen, V. Markl and e. al:, "Apache Flink™: Stream and Batch Processing in a Single Engine.," IEEE Data Eng. Bull., vol. 38, no. 4, pp. 28--38, 2015.Google Scholar
- S. Breß, H. Funke and J. Teubner, "Robust Query Processing in Co-Processor-accelerated Databases.," in SIGMOD, 2016. Google ScholarDigital Library
- J. Soto and V. Markl, "A Historical Account of Apache Flink," {Online}. Available: http://www.dima.tu-berlin.de/fileadmin/fg131/Informationsmaterial/Apache_Flink_Origins_for_Public_Release.pdf.. {Accessed 2017}.Google Scholar
- S. Schelter, S. Ewen, K. Tzoumas and V. Markl, "All Roads lead to Rome: Optimistic Recovery for distributed iterative data processing.," in CIKM, 2013. Google ScholarDigital Library
- A. Kunft, A. Alexandrov, A. Katsifodimos,. Markl: Bridging the gap: towards optimization across linear and relational algebra. BeyondMR@SIGMOD 2016: 1 Google ScholarDigital Library
- A. Kunft, A. Katsifodimos, S. Schelter, et al: BlockJoin: Efficient Matrix Partitioning Through Joins. PVLDB 10(13): 2061--2072 (2017) Google ScholarDigital Library
Index Terms
- Mosaics in Big Data: Stratosphere, Apache Flink, and Beyond
Recommendations
Introduction to Big Data: Scalable Representation and Analytics for Data Science Minitrack
HICSS '13: Proceedings of the 2013 46th Hawaii International Conference on System SciencesBig data is an emerging phenomenon characterized by the three Vs: volume, velocity, and variety. The volume of data has increased from terabytes to petabytes and is encroaching on exabytes. Some pundits are suggesting that zettabytes (1021) are ...
Big Data & Data Science: A Descriptive Research on Big Data Evolution and a Proposed Combined Platform by Integrating R and Python on Hadoop for Big Data Analytics and Visualization
ICCA 2020: Proceedings of the International Conference on Computing AdvancementsIn this technological era, Big Data is a new glorified term in where Data Science is the secret sauce of it. Undoubtedly, the digitalization of data is not the whole story; it is just a beginning of Data Science area of study. There was a time when the ...
A Brief Survey on Big Data in Healthcare
This article presents a brief introduction to big data and big data analytics and also their roles in the healthcare system. A definite range of scientific researches about big data analytics in the healthcare system have been reviewed. The definition ...
Comments