skip to main content
10.1145/3210284.3214344acmconferencesArticle/Chapter ViewAbstractPublication PagesdebsConference Proceedingsconference-collections
research-article

Mosaics in Big Data: Stratosphere, Apache Flink, and Beyond

Authors Info & Claims
Published:25 June 2018Publication History

ABSTRACT

The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define "big data", i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.

References

  1. J. E. Gonzalez, Y. Low, H. Gu, D. Bickson and C. Guestrin, "PowerGraph: Distributed Graph-Parallel Computation.," in USENIX, Berkeley, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Borkar and A. Chien, "The future of microprocessors," Communications of the ACM, vol. 54, no. 5, pp. 67--77, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. M. Caulfield, J. Coburn, T. Mollov, A. De, A. Akel, J. He, A. Jagatheesan, R. K. Gupta, A. Snavely and S. Swanson, "Understanding the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Washington DC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. G. Harris, N. Shetterley, A. Alter and e. al, "The Team Solution to the Data Scientist Shortage.," Accenture IfHP, 2013.Google ScholarGoogle Scholar
  5. J. Manyika and e. al, "Big data: The next frontier for innovation, competition and productivity.," McKinsey GI, 2011.Google ScholarGoogle Scholar
  6. A. Thusoo, S. S. J., N. Jain and e. al, "Hive - a petabyte scale data warehouse using Hadoop.," in ICDE, 2010.Google ScholarGoogle Scholar
  7. M. Armbrust, R. S. Xin, C. Lian and e. al, "Spark SQL: Relational Data Processing in Spark," in SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Jahani, M. J. Cafarella and C. Ré, "Automatic Optimization for MapReduce Programs.," in PVLDB 4, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. V. Markl, "On Declarative Analysis and Data Independence in the Big Data Era.," in PVLDB, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. F. Codd, "A Relational Model of Data for Large Shared Data Banks.," CACM, vol. 13, no. 6, pp. 377--387, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. D. Chamberlin and R. F. Boyce, "SEQUEL: A Structured English Query Language.," in SIGMOD, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. D. Chamberlin, A. M. Gilbert and R. A. Yost, "A History of System R and SQL.," in VLDB, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. G. Selinger and e. al, "Access Path Selection in a Relational Database Management System.," in SIGMOD, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Why Apache Beam?, http://data-artisans.com/why-apache-beam/.Google ScholarGoogle Scholar
  15. "Stratosphere," {Online}. Available: http://stratosphere.eu.Google ScholarGoogle Scholar
  16. "Apache Flink," {Online}. Available: http://flink.apache.org.Google ScholarGoogle Scholar
  17. "Unified Stream & Batch Processing with Apache Flink," {Online}. Available: youtu.be/8Uh3ycG3Wew.Google ScholarGoogle Scholar
  18. "Apache Flink Article," {Online}. Available: https://en.wikipedia.org/wiki/Apache_Flink.Google ScholarGoogle Scholar
  19. P. P. Buneman and e. al, "Programming with Complex Objects and Collection Types.," Theor. Comp. Sci., vol. 149, no. 1, pp. 3--48, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Grust, "Comprehending queries.," Universität Konstanz, 1999.Google ScholarGoogle Scholar
  21. G. Hutton, "A tutorial on the universality and expressiveness of fold.," J. of Funct. Programming, vol. 9, no. 4, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. P. Buneman and e. al, "Programming with Complex Objects and Collection Types.," Theor. Comp. Sci., vol. 149, no. 1, pp. 3--48, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Bird, P. Buneman and W. C. Tan, "Towards a query language for annotation graphs.," CoRR cs.CL/0007023, 2000.Google ScholarGoogle Scholar
  24. M. Heimel, V. Markl and e. al, "Hardware-Oblivious Parallelism for In-Memory Column-Stores.," in PVLDB 6(9), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Heimel, M. Kiefer and V. Markl, "GPU-Accelerated KDE Models for MD Selectivity Estimation," in SIGMOD, 2015.Google ScholarGoogle Scholar
  26. A. Crotty, A. Galakatos, K. Dursun and e. al, ""Big" Data, Big Analytics, Small Clusters," in CIDR, 2015.Google ScholarGoogle Scholar
  27. F. McSherry, M. Isard and D. G. Murray, ""Scalability! But at what COST?"," in HotOS XV, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Broneske, S. Breß, M. Heimel and G. Saake, "Toward Hardware-Sensitive Database Operations," in EDBT, 2014.Google ScholarGoogle Scholar
  29. W. Han, W. Kwak, J. Lee, G. M. Lohman and V. Markl, "Parallelizing Query Optimization," in PVLDB 1(1), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Alexandrov, R. Bergmann, S. Ewen, J. C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl and e. al, "The Stratosphere platform for big data analytics," VLDB J., vol. 23, no. 6, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl and D. Warneke, "Nephele/PACTs: A programming model and execution framework for web-scale analytical processing," in SoCC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Alexandrov, A. Katsifodimos, G. Krastev and V. Markl, "Implicit Parallelism through Deep Language Embedding," in SIGMOD Record, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb and V. Markl, "Implicit Parallelism through Deep Language Embedding," in SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Apache Flink, "Powered by Flink," {Online}. Available: https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink. {Accessed 2017}.Google ScholarGoogle Scholar
  35. I. Gog, M. Schwarzkopf and e. al, "Hand: Musketeer: all for one, one for all in data processing systems.," in EuroSys, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. Ewen, K. Tzoumas, M. Kaufmann and V. Markl, "Spinning Fast Iterative Data Flows.," PVLDB, vol. 5, no. 11, pp. 1268--1279, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. P. Carbone, A. Katsifodimos, A. Ewen, V. Markl and e. al:, "Apache Flink™: Stream and Batch Processing in a Single Engine.," IEEE Data Eng. Bull., vol. 38, no. 4, pp. 28--38, 2015.Google ScholarGoogle Scholar
  38. S. Breß, H. Funke and J. Teubner, "Robust Query Processing in Co-Processor-accelerated Databases.," in SIGMOD, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Soto and V. Markl, "A Historical Account of Apache Flink," {Online}. Available: http://www.dima.tu-berlin.de/fileadmin/fg131/Informationsmaterial/Apache_Flink_Origins_for_Public_Release.pdf.. {Accessed 2017}.Google ScholarGoogle Scholar
  40. S. Schelter, S. Ewen, K. Tzoumas and V. Markl, "All Roads lead to Rome: Optimistic Recovery for distributed iterative data processing.," in CIKM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Kunft, A. Alexandrov, A. Katsifodimos,. Markl: Bridging the gap: towards optimization across linear and relational algebra. BeyondMR@SIGMOD 2016: 1 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Kunft, A. Katsifodimos, S. Schelter, et al: BlockJoin: Efficient Matrix Partitioning Through Joins. PVLDB 10(13): 2061--2072 (2017) Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mosaics in Big Data: Stratosphere, Apache Flink, and Beyond

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems
              June 2018
              289 pages
              ISBN:9781450357821
              DOI:10.1145/3210284

              Copyright © 2018 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 25 June 2018

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited

              Acceptance Rates

              DEBS '18 Paper Acceptance Rate12of31submissions,39%Overall Acceptance Rate130of553submissions,24%

              Upcoming Conference

              DEBS '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader