research-article

FlumeJava: easy, efficient data-parallel pipelines

Authors:
Craig Chambers

Google, Seattle, WA, USA

Google, Seattle, WA, USA
View Profile

,
Ashish Raniwala

Google, Seattle, WA, USA

Google, Seattle, WA, USA
View Profile

,
Frances Perry

Google, Seattle, WA, USA

Google, Seattle, WA, USA
View Profile

,
Stephen Adams

Google, Seattle, WA, USA

Google, Seattle, WA, USA
View Profile

,
Robert R. Henry

Google, Seattle, WA, USA

Google, Seattle, WA, USA
View Profile

,
Robert Bradshaw

Google, Seattle, WA, USA

Google, Seattle, WA, USA
View Profile

,
Nathan Weizenbaum

Google, Seattle, WA, USA

Google, Seattle, WA, USA
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 45 Issue 6June 2010pp 363–375https://doi.org/10.1145/1809028.1806638

Published:05 June 2010Publication History

ACM SIGPLAN Notices

Abstract

MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google.

References

Cascading. http://www.cascading.org.Google Scholar
Hadoop. http://hadoop.apache.org.Google Scholar
Pig. http://hadoop.apache.org/pig.Google Scholar
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1 (2), 2008. Google ScholarDigital Library
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI, 2006. Google ScholarDigital Library
J. Dean. Experiences with MapReduce, an abstraction for large-scale computation. In PACT, 2006. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51, no. 1, 2008. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, 2003. Google ScholarDigital Library
R. H. Halstead Jr. New ideas in parallel Lisp: Language design, implementation, and programming tools. In Workshop on Parallel Lisp, 1989. Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, 2007. Google ScholarDigital Library
J. R. Larus. C**: A large-grain, object-oriented, data-parallel programming language. In LCPC, 1992. Google ScholarDigital Library
C. Lasser and S. M. Omohundro. The essential Star-lisp manual. Technical Report 86.15, Thinking Machines, Inc., 1986.Google Scholar
E. Meijer, B. Beckman, and G. Bierman. LINQ: reconciling objects, relations and XML in the .NET framework. In SIGMOD Conference, 2006. Google ScholarDigital Library
R. S. Nikhil and Arvind. Implicit Parallel Programming in pH. Academic Press, 2001. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD Conference, 2008. Google ScholarDigital Library
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13 (4), 2005. Google ScholarDigital Library
J. R. Rose and G. L. Steele Jr. C*: An extended C language. In C Workshop, 1987.Google Scholar
H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD Conference, 2007. Google ScholarDigital Library
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008. Google ScholarDigital Library

Index Terms

FlumeJava: easy, efficient data-parallel pipelines
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

FlumeJava: easy, efficient data-parallel pipelines
PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation

MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java ...
Read More
An object-oriented approach to nested data parallelism
FRONTIERS '95: Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)

This paper describes an implementation technique for integrating nested data parallelism into an object-oriented language. Data-parallel programming employs data aggregates called "collections" and expresses parallelism as operations performed over the ...
Read More
A flexible processor mapping technique toward data localization for block-cyclic data redistribution

Array redistribution is usually needed for more efficiently executing a data-parallel program on distributed memory multicomputers. To minimize the redistribution data transfer cost, processor mapping techniques were proposed to reduce the amount of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 45, Issue 6
PLDI '10
June 2010
496 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1809028
Issue’s Table of Contents
PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2010
514 pages
ISBN:9781450300193
DOI:10.1145/1806596
General Chair:
Ben Zorn
Microsoft Research
,
Program Chair:
Alex Aiken
Stanford University
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 June 2010
Check for updates
Author Tags
data-parallel programming
java
mapreduce
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 303
  Total Citations
  View Citations
- 6,798
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FlumeJava: easy, efficient data-parallel pipelines

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

FlumeJava: easy, efficient data-parallel pipelines

An object-oriented approach to nested data parallelism

A flexible processor mapping technique toward data localization for block-cyclic data redistribution