skip to main content
research-article
Open Access

Incremental, iterative data processing with timely dataflow

Published:22 September 2016Publication History
Skip Abstract Section

Abstract

We describe the timely dataflow model for distributed computation and its implementation in the Naiad system. The model supports stateful iterative and incremental computations. It enables both low-latency stream processing and high-throughput batch processing, using a new approach to coordination that combines asynchronous and fine-grained synchronous execution. We describe two of the programming frameworks built on Naiad: GraphLINQ for parallel graph processing, and differential dataflow for nested iterative and incremental computations. We show that a general-purpose system can achieve performance that matches, and sometimes exceeds, that of specialized systems.

References

  1. Abadi, M., Isard, M. Timely dataflow: A model. In Proc. FORTE (2015), 131--145.Google ScholarGoogle Scholar
  2. Abadi, M., Isard, M. Timely rollback: Specification and verification. In Proc. NASA Formal Methods (April 2015), 19--34.Google ScholarGoogle ScholarCross RefCross Ref
  3. Akidau, T., Balikov, A., Bekiroğlu, K., Chernyak, S., Haberman, J., Lax, R., McVeety, S., Mills, D., Nordstrom, P., Whittle, S. MillWheel: Fault-tolerant stream processing at internet scale. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1033--1044. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chandramouli, B., Goldstein, J., Maier, D. On-the-fly progress detection in iterative stream queries. Proc. VLDB Endow. 2, 1 (Aug. 2009), 241--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E. Bigtable: A distributed storage system for structured data. In Proc. OSDI (Nov. 2006), 205--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dean, J., Ghemawat, S. Mapreduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. DeWitt, D., Gray, J. Parallel database systems: The future of high performance database systems. Commun. ACM 35, 6 (June 1992), 85--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V. Spinning fast iterative data flows. Proc. VLDB Endow. 5, 11 (July 2012), 1268--1279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gog, I., Giceva, J., Schwarzkopf, M., Vaswani, K., Vytiniotis, D., Ramalingam, G., Costa, M., Murray, D.G., Hand, S., Isard, M. Broom: Sweeping out garbage collection from big data systems. In Proc. HotOS (May 2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gog, I., Schwarzkopf, M., Crooks, N., Grosvenor, M.P., Clement, A., Hand, S. Musketeer: All for one, one for all in data processing systems. In Proc. EuroSys (Apr. 2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proc. OSDI (Oct. 2012), 17--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I. GraphX: Graph processing in a distributed dataflow framework. In Proc. OSDI (Oct. 2014), 599--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. EuroSys (Mar. 2007), 59--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Lee, E., Messerschmitt, D.G. Synchronous data flow. Proc. IEEE 75, 9 (1987), 1235--1245.Google ScholarGoogle ScholarCross RefCross Ref
  15. Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.-Y. Scaling distributed machine learning with the parameter server. In Proc. OSDI (Oct. 2014), 583--598. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. McSherry, F., Isard, M., Murray, D.G. Scalability! But at what COST? In Proc. HotOS (May 2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. McSherry, F., Murray, D.G., Isaacs, R., Isard, M. Differential dataflow. In Proc. CIDR (Jan. 2013).Google ScholarGoogle Scholar
  18. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T. Dremel: Interactive analysis of web-scale datasets. Proc. VLDB Endow. Proc. VLDB Endow. 3, 1--2 (Sep. 2010), 330--339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Murray, D.G., McSherry, F., Isaacs, R., Isard, M., Barham, P., Abadi, M. Naiad: A timely dataflow system. In Proc. SOSP (Nov. 2013), 439--455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Murray, D.G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., Hand, S. CIEL: A universal execution engine for distributed data-flow computing. In Proc. NSDI (Mar. 2011), 113--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Peng, D., Dabek, F. Large-scale incremental processing using distributed transactions and notifications. In Proc. OSDI (Oct. 2010), 251--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sousa, M., Dillig, I., Vytiniotis, D., Dillig, T., Gkantsidis, C. Consolidation of queries with user-defined functions. In Proc. PLDI (June 2014), 554--564. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tel, G., Mattern, F. The derivation of distributed termination detection algorithms from garbage collection schemes. ACM Trans. Program. Lang. Syst. 15, 1 (Jan. 1993), 1--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Tucker, P.A., Maier, D., Sheard, T., Fegaras, L. Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowledge Data Eng. 15, 3 (2003), 555--568. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Yu, Y., Gunda, P.K., Isard, M. Distributed aggregation for data-parallel computing: Interfaces and implementations. In Proc. SOSP (Oct. 2009), 247--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. OSDI (Dec. 2008), 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. NSDI (Apr. 2012). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Incremental, iterative data processing with timely dataflow

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Communications of the ACM
          Communications of the ACM  Volume 59, Issue 10
          October 2016
          85 pages
          ISSN:0001-0782
          EISSN:1557-7317
          DOI:10.1145/3001840
          • Editor:
          • Moshe Y. Vardi
          Issue’s Table of Contents

          Copyright © 2016 Owner/Author

          This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 September 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format