ABSTRACT
Iterative computations are pervasive among data analysis applications in the cloud, including Web search, online social network analysis, recommendation systems, and so on. These cloud applications typically involve data sets of massive scale. Fast convergence of the iterative computation on the massive data set is essential for these applications. In this paper, we explore the opportunity for accelerating iterative computations and propose a distributed computing framework, PrIter, which enables fast iterative computation by providing the support of prioritized iteration. Instead of performing computations on all data records without discrimination, PrIter prioritizes the computations that help convergence the most, so that the convergence speed of iterative process is significantly improved. We evaluate PrIter on a local cluster of machines as well as on Amazon EC2 Cloud. The results show that PrIter achieves up to 50x speedup over Hadoop for a series of iterative algorithms.
- Amazon ec2. http://aws.amazon.com/ec2/.Google Scholar
- Hadoop. http://hadoop.apache.org/.Google Scholar
- Priter project. http://code.google.com/p/priter/.Google Scholar
- Stanford dataset. http://snap.stanford.edu/data/.Google Scholar
- S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar, D. Ravichandran, and M. Aly. Video suggestion and discovery for youtube: taking random walks through the view graph. In WWW '08, pages 895--904, 2008. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW '98, pages 107--117, 1998. Google ScholarDigital Library
- Y. Bu, B. Howe, M. Balazinska, and D. M. Ernst. Haloop: Efficient iterative data processing on large clusters. In VLDB '10, 2010. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation -- Volume 7, OSDI '06, pages 15--15, Berkeley, CA, USA, 2006. USENIX Association. Google ScholarDigital Library
- Chu, Cheng T., Kim, Sang K., Lin, Yi A., Yu, Yuanyuan, Bradski, Gary R., Ng, Andrew Y., and Olukotun, Kunle. Map-Reduce for Machine Learning on Multicore. In NIPS, pages 281--288, 2006.Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI'04, pages 10--10, 2004. Google ScholarDigital Library
- J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In MapReduce '10, pages 810--818, 2010. Google ScholarDigital Library
- B. He, M. Yang, Z, Guo, R. Chen, B. Su, W. Lin, and L. Zhou. Comet: batched stream processing for data intensive distributed computing. In SoCC '10, pages 63--74, 2010. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys '07, pages 59--72. Google ScholarDigital Library
- U. Kang, C. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In ICDM '09, pages 229--238, 2009. Google ScholarDigital Library
- L. Katz. A new status index derived from sociometric analysis. Psychometrika, 1953.Google ScholarCross Ref
- D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol., 58(7):1019--1031, 2007. Google ScholarDigital Library
- D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. In SoCC '10, pages 51--62, 2010. Google ScholarDigital Library
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new framework for parallel machine learning. CoRR, abs/1006.4990, 2010.Google Scholar
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD '10, pages 135--146, 2010. Google ScholarDigital Library
- D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: A universal execution engine for distributed data-flow computing. In NSDI'11, 2011. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08, pages 1099--1110, 2008. Google ScholarDigital Library
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09, pages 165--178, 2009. Google ScholarDigital Library
- D. Peng and F. Dabe. Large-scale incremental processing using distributed transactions and notifications. In OSDI '10: Proceedings of the 9th conference on Symposium on Opearting Systems Design and Implementation, pages 1--15, 2010. Google ScholarDigital Library
- R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In OSDI'10, 2010. Google ScholarDigital Library
- N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. In SIGIR '02, pages 129--136, 2002. Google ScholarDigital Library
- H. H. Song, T. W. Cho, V. Dave, Y. Zhang, and L. Qiu. Scalable proximity estimation and link prediction in online social networks. In IMC '09, pages 322--335, 2009. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. In VLDB '09, pages 1626--1629, 2009. Google ScholarDigital Library
- C. Wilson, B. Boe, A. Sala, K. P. Puttaswamy, and B. Y. Zhao. User interactions in social networks and their implications. In EuroSys '09, pages 205--218, 2009. Google ScholarDigital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In OSDI '08, pages 1--14, 2008. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In HotCloud'10, pages 10--10, 2010. Google ScholarDigital Library
- M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI '08, pages 29--42, 2008. Google ScholarDigital Library
- Y. Zhang, Q. Gao, L. Gao, and C. Wang. imapreduce: A distributed computing framework for iterative computation. In DataCloud '11, 2011. Google ScholarDigital Library
- T. Zhou, Z. Kuscsik, J.-G. Liu, M. Medo, J. R. Wakeling, and Y.-C. Zhang. Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences, 107(10):4511--4515, March 2010.Google ScholarCross Ref
Index Terms
- PrIter: a distributed framework for prioritized iterative computations
Recommendations
PrIter: A Distributed Framework for Prioritizing Iterative Computations
Iterative computations are pervasive among data analysis applications, including web search, online social network analysis, recommendation systems, and so on. These applications typically involve data sets of massive scale. Fast convergence of the ...
Set-valued mixed quasi-variational inequalities and implicit resolvent equations
In this paper, we introduce and study a new class of variational inequalities, which is called the set-valued mixed quasi-variational inequality. The resolvent operator technique is used to establish the equivalence among generalized set-valued mixed ...
Primal-dual row-action method for convex programming
We present a primal-dual row-action method for the minimization of a convex function subject to general convex constraints. Constraints are used one at a time, no changes are made in the constraint functions and their Jacobian matrix (thus, the row-...
Comments