skip to main content
research-article

FlumeJava: easy, efficient data-parallel pipelines

Published:05 June 2010Publication History
Skip Abstract Section

Abstract

MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google.

References

  1. Cascading. http://www.cascading.org.Google ScholarGoogle Scholar
  2. Hadoop. http://hadoop.apache.org.Google ScholarGoogle Scholar
  3. Pig. http://hadoop.apache.org/pig.Google ScholarGoogle Scholar
  4. R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1 (2), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Dean. Experiences with MapReduce, an abstraction for large-scale computation. In PACT, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51, no. 1, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. H. Halstead Jr. New ideas in parallel Lisp: Language design, implementation, and programming tools. In Workshop on Parallel Lisp, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. R. Larus. C**: A large-grain, object-oriented, data-parallel programming language. In LCPC, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Lasser and S. M. Omohundro. The essential Star-lisp manual. Technical Report 86.15, Thinking Machines, Inc., 1986.Google ScholarGoogle Scholar
  14. E. Meijer, B. Beckman, and G. Bierman. LINQ: reconciling objects, relations and XML in the .NET framework. In SIGMOD Conference, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. S. Nikhil and Arvind. Implicit Parallel Programming in pH. Academic Press, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD Conference, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13 (4), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. R. Rose and G. L. Steele Jr. C*: An extended C language. In C Workshop, 1987.Google ScholarGoogle Scholar
  19. H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD Conference, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. FlumeJava: easy, efficient data-parallel pipelines

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 45, Issue 6
        PLDI '10
        June 2010
        496 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1809028
        Issue’s Table of Contents
        • cover image ACM Conferences
          PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation
          June 2010
          514 pages
          ISBN:9781450300193
          DOI:10.1145/1806596

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 June 2010

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader