skip to main content
research-article

Distributed GraphLab: a framework for machine learning and data mining in the cloud

Published:01 April 2012Publication History
Skip Abstract Section

Abstract

While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees.

We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

References

  1. R. Angles and C. Gutierrez. Survey of graph database models. ACM Comput. Surv., 40(1):1:1--1:39, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, pages 81--88. 2008.Google ScholarGoogle Scholar
  3. D. Batra, A. Kowdle, D. Parikh, L. Jiebo, and C. Tsuhan. iCoseg: Interactive co-segmentation with intelligent scribble guidance. In CVPR, pages 3169--3176, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  4. D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods. Prentice-Hall, Inc., 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63--75, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Chen, X. Weng, B. He, and M. Yang. Large graph processing in the cloud. In SIGMOD, pages 1123--1126, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages 281--288. 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. Efron, T. Hastie, I. M. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407--499, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  11. G. Elidan, I. McGraw, and D. Koller. Residual Belief Propagation: Informed scheduling for asynchronous message passing. In UAI, pages 165--173, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Gonzalez, Y. Low, A. Gretton, and C. Guestrin. Parallel gibbs sampling: From colored fields to thin junction trees. In AISTATS, volume 15, pages 324--332, 2011.Google ScholarGoogle Scholar
  13. J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS, volume 5, pages 177--184, 2009.Google ScholarGoogle Scholar
  14. J. Gonzalez, Y. Low, C. Guestrin, and D. O'Hallaron. Distributed parallel inference on large factor graphs. In UAI, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Graf, E. Cosatto, L. Bottou, I. Dourdanovic, and V. Vapnik. Parallel support vector machines: The cascade SVM. In NIPS, pages 521--528, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Gregor and A. Lumsdaine. The parallel BGL: A generic library for distributed graph computations. POOSC, 2005.Google ScholarGoogle Scholar
  17. A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber. Comparative evaluation of latency reducing and tolerating techniques. SIGARCH Comput. Archit. News, 19(3):254--263, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Hindman, A. Konwinski, M. Zaharia, and I. Stoica. A common substrate for cluster computing. In HotCloud, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In ICDM, pages 229--238, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput., 48(1):96--129, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In SPAA, pages 85--94, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Leskovec. Stanford large network dataset collection. http://snap.stanford.edu/data/index.html, 2011.Google ScholarGoogle Scholar
  24. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In UAI, pages 340--349, 2010.Google ScholarGoogle Scholar
  25. G. Malewicz, M. H. Austern, A. J. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Misra. Detecting termination of distributed computations using markers. In PODC, pages 290--294, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational EM for latent Dirichlet allocation: An experimental evaluation of speed and scalability. In ICDM Workshops, pages 349--354, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355--368. 1998. Google ScholarGoogle ScholarCross RefCross Ref
  29. Neo4j. http://neo4j.org, 2011.Google ScholarGoogle Scholar
  30. D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In NIPS, pages 1081--1088, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, 1999.Google ScholarGoogle Scholar
  32. R. Pearce, M. Gokhale, and N. Amato. Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory. In SC, pages 1--11, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In OSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. G. Siapas. Criticality and parallelism in combinatorial optimization. PhD thesis, Massachusetts Institute of Technology, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. J. Smola and S. Narayanamurthy. An Architecture for Parallel Topic Models. PVLDB, 3(1):703--710, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17:530--531, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In HotCloud, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Y. Zhang, Q. Gao, L. Gao, and C. Wang. Priter: a distributed framework for prioritized iterative computations. In SOCC, pages 13:1--13:14, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In AAIM, pages 337--348, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Distributed GraphLab: a framework for machine learning and data mining in the cloud
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 5, Issue 8
          April 2012
          96 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 April 2012
          Published in pvldb Volume 5, Issue 8

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader