research-article

Distributed GraphLab: a framework for machine learning and data mining in the cloud

Authors:
Yucheng Low

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Danny Bickson

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Joseph Gonzalez

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Carlos Guestrin

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Aapo Kyrola

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Joseph M. Hellerstein

UC Berkeley

UC Berkeley
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 8pp 716–727https://doi.org/10.14778/2212351.2212354

Published:01 April 2012Publication History

Proceedings of the VLDB Endowment

Abstract

While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees.

We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

References

R. Angles and C. Gutierrez. Survey of graph database models. ACM Comput. Surv., 40(1):1:1--1:39, 2008. Google ScholarDigital Library
A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, pages 81--88. 2008.Google Scholar
D. Batra, A. Kowdle, D. Parikh, L. Jiebo, and C. Tsuhan. iCoseg: Interactive co-segmentation with intelligent scribble guidance. In CVPR, pages 3169--3176, 2010.Google ScholarCross Ref
D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods. Prentice-Hall, Inc., 1989. Google ScholarDigital Library
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.Google ScholarDigital Library
K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63--75, 1985. Google ScholarDigital Library
R. Chen, X. Weng, B. He, and M. Yang. Large graph processing in the cloud. In SIGMOD, pages 1123--1126, 2010. Google ScholarDigital Library
C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages 281--288. 2006.Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
B. Efron, T. Hastie, I. M. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407--499, 2004.Google ScholarCross Ref
G. Elidan, I. McGraw, and D. Koller. Residual Belief Propagation: Informed scheduling for asynchronous message passing. In UAI, pages 165--173, 2006.Google ScholarDigital Library
J. Gonzalez, Y. Low, A. Gretton, and C. Guestrin. Parallel gibbs sampling: From colored fields to thin junction trees. In AISTATS, volume 15, pages 324--332, 2011.Google Scholar
J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS, volume 5, pages 177--184, 2009.Google Scholar
J. Gonzalez, Y. Low, C. Guestrin, and D. O'Hallaron. Distributed parallel inference on large factor graphs. In UAI, 2009. Google ScholarDigital Library
H. Graf, E. Cosatto, L. Bottou, I. Dourdanovic, and V. Vapnik. Parallel support vector machines: The cascade SVM. In NIPS, pages 521--528, 2004.Google ScholarDigital Library
D. Gregor and A. Lumsdaine. The parallel BGL: A generic library for distributed graph computations. POOSC, 2005.Google Scholar
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber. Comparative evaluation of latency reducing and tolerating techniques. SIGARCH Comput. Archit. News, 19(3):254--263, 1991. Google ScholarDigital Library
B. Hindman, A. Konwinski, M. Zaharia, and I. Stoica. A common substrate for cluster computing. In HotCloud, 2009. Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007. Google ScholarDigital Library
U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In ICDM, pages 229--238, 2009. Google ScholarDigital Library
G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput., 48(1):96--129, 1998. Google ScholarDigital Library
S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In SPAA, pages 85--94, 2011. Google ScholarDigital Library
J. Leskovec. Stanford large network dataset collection. http://snap.stanford.edu/data/index.html, 2011.Google Scholar
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In UAI, pages 340--349, 2010.Google Scholar
G. Malewicz, M. H. Austern, A. J. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google ScholarDigital Library
J. Misra. Detecting termination of distributed computations using markers. In PODC, pages 290--294, 1983. Google ScholarDigital Library
R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational EM for latent Dirichlet allocation: An experimental evaluation of speed and scalability. In ICDM Workshops, pages 349--354, 2007. Google ScholarDigital Library
R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355--368. 1998. Google ScholarCross Ref
Neo4j. http://neo4j.org, 2011.Google Scholar
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In NIPS, pages 1081--1088, 2007.Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, 1999.Google Scholar
R. Pearce, M. Gokhale, and N. Amato. Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory. In SC, pages 1--11, 2010. Google ScholarDigital Library
R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In OSDI, 2010. Google ScholarDigital Library
A. G. Siapas. Criticality and parallelism in combinatorial optimization. PhD thesis, Massachusetts Institute of Technology, 1996. Google ScholarDigital Library
A. J. Smola and S. Narayanamurthy. An Architecture for Parallel Topic Models. PVLDB, 3(1):703--710, 2010. Google ScholarDigital Library
S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011. Google ScholarDigital Library
J. W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17:530--531, 1974. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In HotCloud, 2010. Google ScholarDigital Library
Y. Zhang, Q. Gao, L. Gao, and C. Wang. Priter: a distributed framework for prioritized iterative computations. In SOCC, pages 13:1--13:14, 2011. Google ScholarDigital Library
Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In AAIM, pages 337--348, 2008. Google ScholarDigital Library

Index Terms

Distributed GraphLab: a framework for machine learning and data mining in the cloud

Index terms have been assigned to the content through auto-classification.

Recommendations

GraphLab: a new framework for parallel machine learning
UAI'10: Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence

Designing and implementing efficient, provably correct parallel machine learning (ML) algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML ...
Read More
Usability in machine learning at scale with graphlab
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
Today, machine learning (ML) methods play a central role in industry and science. The growth of the Web and improvements in sensor data collection technology have been rapidly increasing the magnitude and complexity of the ML tasks we must solve. This ...
Read More
Distributed Computing in Big Data Analytics: Concepts, Technologies and Applications
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 8
April 2012
96 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 April 2012
Published in pvldb Volume 5, Issue 8
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 639
  Total Citations
  View Citations
- 6,700
  Total Downloads
- Downloads (Last 12 months)47
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

GraphLab: a new framework for parallel machine learning

Usability in machine learning at scale with graphlab

Distributed Computing in Big Data Analytics: Concepts, Technologies and Applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

GraphLab: a new framework for parallel machine learning

Usability in machine learning at scale with graphlab

Distributed Computing in Big Data Analytics: Concepts, Technologies and Applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media