Abstract
MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google.
- Cascading. http://www.cascading.org.Google Scholar
- Hadoop. http://hadoop.apache.org.Google Scholar
- Pig. http://hadoop.apache.org/pig.Google Scholar
- R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1 (2), 2008. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI, 2006. Google ScholarDigital Library
- J. Dean. Experiences with MapReduce, an abstraction for large-scale computation. In PACT, 2006. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51, no. 1, 2008. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, 2003. Google ScholarDigital Library
- R. H. Halstead Jr. New ideas in parallel Lisp: Language design, implementation, and programming tools. In Workshop on Parallel Lisp, 1989. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, 2007. Google ScholarDigital Library
- J. R. Larus. C**: A large-grain, object-oriented, data-parallel programming language. In LCPC, 1992. Google ScholarDigital Library
- C. Lasser and S. M. Omohundro. The essential Star-lisp manual. Technical Report 86.15, Thinking Machines, Inc., 1986.Google Scholar
- E. Meijer, B. Beckman, and G. Bierman. LINQ: reconciling objects, relations and XML in the .NET framework. In SIGMOD Conference, 2006. Google ScholarDigital Library
- R. S. Nikhil and Arvind. Implicit Parallel Programming in pH. Academic Press, 2001. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD Conference, 2008. Google ScholarDigital Library
- R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13 (4), 2005. Google ScholarDigital Library
- J. R. Rose and G. L. Steele Jr. C*: An extended C language. In C Workshop, 1987.Google Scholar
- H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD Conference, 2007. Google ScholarDigital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008. Google ScholarDigital Library
Index Terms
- FlumeJava: easy, efficient data-parallel pipelines
Recommendations
FlumeJava: easy, efficient data-parallel pipelines
PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and ImplementationMapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java ...
An object-oriented approach to nested data parallelism
FRONTIERS '95: Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)This paper describes an implementation technique for integrating nested data parallelism into an object-oriented language. Data-parallel programming employs data aggregates called "collections" and expresses parallelism as operations performed over the ...
A flexible processor mapping technique toward data localization for block-cyclic data redistribution
Array redistribution is usually needed for more efficiently executing a data-parallel program on distributed memory multicomputers. To minimize the redistribution data transfer cost, processor mapping techniques were proposed to reduce the amount of ...
Comments