Abstract
A MapReduce scheduling algorithm plays a critical role in managing large clusters of hardware nodes and meeting multiple quality requirements by controlling the order and distribution of users, jobs, and tasks execution. A comprehensive and structured survey of the scheduling algorithms proposed so far is presented here using a novel multidimensional classification framework. These dimensions are (i) meeting quality requirements, (ii) scheduling entities, and (iii) adapting to dynamic environments; each dimension has its own taxonomy. An empirical evaluation framework for these algorithms is recommended. This survey identifies various open issues and directions for future research.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Classification Framework of MapReduce Scheduling Algorithms
- AMAZON. 2012. Amazon EC2. (Sep 2012). Retrieved October 19, 2012, from http://aws.amazon.com/ec2/.Google Scholar
- APHIVE. 2013. Apache HIVE. Retrieved June 19, 2013, from http://hive.apache.org/.Google Scholar
- APPIG. 2013. Apache Pig. Retrieved June 19, 2013, from http://pig.apache.org/.Google Scholar
- Peter Brucker. 2004. Scheduling Algorithms. Springer-Verlag. Google ScholarDigital Library
- X. Bu, J. Rao, and C. Z. Xu. 2013. Interference and locality-aware task scheduling for MapReduce applications in virtual clusters. In Proceedings of the HPDC. 227--238. Google ScholarDigital Library
- F. Chen, M. Kodialam, and T. V. Lakshman. 2012. Joint scheduling of processing and shuffle phases in MapReduce systems. In Proceedings of INFOCOM. 1143--1151.Google Scholar
- Q. Chen, M. Guo, Q. Deng, L. Zheng, S. Guo, and Y. Shen. 2013. HAT: History-based auto-tuning MapReduce in heterogeneous environments. The Journal of Supercomputing 64, 3 (2013), 1038--1054. Google ScholarDigital Library
- Q. Chen, D. Zhang, M. Guo, Q. Deng, and S. Guo. 2010. SAMR: A self-adaptive MapReduce scheduling algorithm in heterogeneous environment. In Proceedings of CIT. 2736--2743. Google ScholarDigital Library
- J. Dean and S. Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51 (2008), 107--113. Google ScholarDigital Library
- J. Dhok, N. Maheshwari, and V. Varma. 2010. Learning based opportunistic admission control algorithm for MapReduce as a service. In Proceedings of ISEC. 153--160. Google ScholarDigital Library
- M. J. Fischer, X. Su, and Y. Yin. 2010. Assigning tasks for efficiency in Hadoop: Extended abstract. In Proceedings of SPAA. 30--39. Google ScholarDigital Library
- Z. Guo, G. Fox, and M. Zhou. 2012. Improving resource utilization in MapReduce. Technical Report of Indiana University (2012).Google Scholar
- HADOOP. 2012. The Apache Hadoop Project. (September 2012). Retrieved October 2, 2012, from http://hadoop.apache.org/docs/r1.2.1/.Google Scholar
- M. Hammoud, M. S. Rehman, and M. F. Sakr. 2012. Center-of-gravity reduce task scheduling to lower MapReduce network traffic. In IEEE CLOUD. 49--58. Google ScholarDigital Library
- J. J. Hanson. 2011. An introduction to the Hadoop distributed file system. IBM Developer Works, Technical Library (2011).Google Scholar
- HDPAPPS. 2012a. Apache Hadoop YARN. Retrieved April 2014, from http://hadoop.apache.org/docs/current/.Google Scholar
- HDPAPPS. 2012b. Applications powered by Hadoop. Retrieved November 19, 2012, from http://wiki.apache.org/hadoop/PoweredBy.Google Scholar
- C. He, Y. Lu, and D. Swanson. 2011. Matchmaking: A new MapReduce scheduling technique. In Proceedings of CloudCom. 40--47. Google ScholarDigital Library
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. 2011. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of NSDI. 295--308. Google ScholarDigital Library
- S. Ibrahim, H. Jin, L. Lu, B. He, G. Antoniu, and S. Wu. 2012. Maestro: Replica-aware map scheduling for MapReduce. IEEE International Symposium on Cluster Computing and the Grid 0 (2012), 435--442. Google ScholarDigital Library
- M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. 2009. Quincy: Fair scheduling for distributed computing clusters. In Proceedings of SOSP. 261--276. Google ScholarDigital Library
- R. Jain. 1991. The Art of Computer Systems Performance Analysis - Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley. I--XXVII, 1--685.Google Scholar
- J. Jin, J. Luo, A. Song, F. Dong, and R. Xiong. 2011. BAR: An efficient data locality driven task scheduling algorithm for cloud computing. In Proceedings of CCGRID. 295--304. Google ScholarDigital Library
- K. Kc and K. Anyanwu. 2010. Scheduling Hadoop jobs to meet deadlines. In Proceedings of CLOUDCOM. 388--392. Google ScholarDigital Library
- K. A. Kumar, V. K. Konishetty, K. Voruganti, and G. V. P. Rao. 2012. CASH: Context aware scheduler for Hadoop. In Proceedings of ICACCI. 52--61. Google ScholarDigital Library
- W. Lang and J. M. Patel. 2010. Energy management for MapReduce clusters. Proceedings of VLDB Endowment 3, 1--2 (2010), 129--139. Google ScholarDigital Library
- E. L. Lawler, J. K. Lenstra, A. H. G. Rinnooy Kan, and D. B. Shmoys. 1993. Sequencing and scheduling: Algorithms and complexity. Handbooks in Operations Research and Management Science 4 (1993), 445--522.Google ScholarCross Ref
- J. Leverich and C. Kozyrakis. 2010. On the energy (in)efficiency of Hadoop clusters. SIGOPS Operating Systems Review 44, 1 (2010), 61--65. Google ScholarDigital Library
- H. Lin, X. Ma, J. Archuleta, W. Feng, M. Gardner, and Z. Zhang. 2010. MOON: MapReduce on opportunistic environments. In Proceedings of HPDC. 95--106. Google ScholarDigital Library
- H. Mao, S. Hu, Z. Zhang, L. Xiao, and L. Ruan. 2011. A load-driven task scheduler with adaptive DSC for MapReduce. In Proceedings of GREENCOM. 28--33. Google ScholarDigital Library
- M. Mattess, R. N. Calheiros, and R. Buyya. 2013. Scaling MapReduce applications across hybrid clouds to meet soft deadlines. In Proceedings of AINA. 629--636. Google ScholarDigital Library
- R. Nanduri, N. Maheshwari, A. Reddyraja, and V. Varma. 2011. Job aware scheduling algorithm for MapReduce framework. In Proceedings of CloudCom. 724--729. Google ScholarDigital Library
- P. Nguyen, T. Simon, M. Halem, D. Chapman, and Q. Le. 2012. A hybrid scheduling algorithm for data intensive workloads in a MapReduce environment. In Proceedings of UCC. 161--167. Google ScholarDigital Library
- K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. 2013. Sparrow: Distributed, low latency scheduling. In Proceedings of SOSP. 69--84. Google ScholarDigital Library
- P. Visalakshi and T. U. Karthik. 2011. MapReduce scheduler using classifiers for heterogeneous workloads. International Journal of Computer Science and Network Security 11 (2011), 68--73.Google Scholar
- J. Park, D. Lee, B. Kim, J. Huh, and S. Maeng. 2012. Locality-aware dynamic VM reconfiguration on MapReduce clouds. In Proceedings of HPDC. 27--36. Google ScholarDigital Library
- Z. Peng and Y. Ma. 2011. A new scheduling algorithm in Hadoop MapReduce. Communications in Computer and Information Science 237 (2011), 537--543.Google ScholarCross Ref
- L. T. X. Phan, Z. Zhang, Q. Zheng, B. T. Loo, and I. Lee. 2011. An empirical analysis of scheduling techniques for real-time cloud-based data processing. In Proceedings of SOCA. 1--8. Google ScholarDigital Library
- J. Polo, D. de Nadal, D. Carrera, Y. Becerra, V. Beltran, J. Torres, and E. Ayguade. 2009. Adaptive task scheduling for multijob MapReduce environments. In Proceedings of Jornadas de Paralelismo Conference. 96--101A.Google Scholar
- X. Qiu, W. L. Yeow, C. Wu, and F. C. M. Lau. 2013. Cost-minimizing preemptive scheduling of MapReduce workloads on hybrid clouds. In Proceedings of IWQoS. 1--6.Google Scholar
- B. T. Rao and L. S. S. Reddy. 2011. Survey on improved scheduling in Hadoop MapReduce in cloud environments. International Journal of Computer Applications 34 (2011), 29--33.Google Scholar
- A. Rasooli and D. G. Down. 2011. An adaptive scheduling algorithm for dynamic heterogeneous Hadoop systems. In Proceedings of CASCON. 30--44. Google ScholarDigital Library
- A. Rasooli and D. G. Down. 2012. A hybrid scheduling approach for scalable heterogeneous Hadoop systems. In Proceedings of SCC. 1284--1291. Google ScholarDigital Library
- T. Sandholm and K. Lai. 2010. Dynamic proportional share scheduling in Hadoop. In Proceedings of JSSPP. 110--131. Google ScholarDigital Library
- M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. 2013. Omega: Flexible, scalable schedulers for large compute clusters. In Proceedings of EuroSys. 351--364. Google ScholarDigital Library
- B. Sharma, T. Wood, and C. R. Das. 2013. HybridMR: A hierarchical MapReduce scheduler for hybrid data centers. In Proceedings of ICDCS. 102--111. Google ScholarDigital Library
- B. Shi and A. Srivastava. 2010. Thermal and power-aware task scheduling for Hadoop based storage centric datacenters. In Proceedings of GreenComp. 73--83. Google ScholarDigital Library
- X. Sun, C. He, and Y. Lu. 2012. ESAMR: An enhanced self-adaptive MapReduce scheduling algorithm. In Proceedings of ICPADS. 148--155. Google ScholarDigital Library
- J. Tan, X. Meng, and L. Zhang. 2012. Coupling scheduler for Mapreduce/Hadoop. In Proceedings of HPDC. 129--130. Google ScholarDigital Library
- Z. Tang, J. Zhou, K. Li, and R. Li. 2012. A MapReduce task scheduling algorithm for deadline constraints. Cluster Computing, Springer (Dec 2012), 1--8. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. 2009. Hive- A warehousing solution over a map-reduce framework. In Proceedings of VLDB Endowment. 1626--1629. Google ScholarDigital Library
- C. Tian, H. Zhou, Y. He, and L. Zha. 2009. A dynamic MapReduce scheduler for heterogeneous workloads. In Proceedings of GCC. 218--224. Google ScholarDigital Library
- V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler. 2013. Apache Hadoop yarn: Yet another resource negotiator. In Proceedings of SOCC. 5:1--5:16. Google ScholarDigital Library
- A. Verma, L. Cherkasova, and R. H. Campbell. 2012a. Two sides of a coin: Optimizing the schedule of MapReduce jobs to minimize their makespan and improve cluster performance. In Proceedings of MASCOTS. 11--18. Google ScholarDigital Library
- A. Verma, L. Cherkasova, V. S. Kumar, and R. H. Campbell. 2012b. Deadline-based workload management for MapReduce environments: Pieces of the performance puzzle. In Proceedings of NOMS. 900--905.Google Scholar
- X. Wang and Y. Wang. 2011. Energy-efficient multi-task scheduling based on MapReduce for cloud computing. In Proceedings of CIS. 57--62. Google ScholarDigital Library
- Y. Wang and W. Shi. 2013. On scheduling algorithms for MapReduce jobs in heterogeneous clouds with budget constraints. In Proceedings of OPODIS. 251--265. Google ScholarDigital Library
- T. White. 2009. Hadoop: The Definitive Guide (1st ed.). O’Reilly Media, Inc. Google ScholarDigital Library
- J. Wolf, A. Balmin, D. Rajan, K. Hildrum, R. Khandekar, S. Parekh, K. Wu, and R. Vernica. 2012. CIRCUMFLEX: A scheduling optimizer for MapReduce workloads with shared scans. SIGOPS Operating Systems Review. 46 (2012), 26--32. Google ScholarDigital Library
- J. Wolf, D. Rajan, K. Hildrum, R. Khandekar, V. Kumar, S. Parekh, K. Wu, and A. balmin. 2010. FLEX: A slot allocation scheduling optimizer for MapReduce workloads. In Proceedings of Middleware. 1--20. Google ScholarDigital Library
- Y. Xia, L. Wang, Q. Zhao, and G. Zhang. 2011. Research on job scheduling algorithm in Hadoop. Journal of Computational Information Systems 7 (2011), 5769--5775.Google Scholar
- N. Yigitbasi, K. Datta, N. Jain, and T. Willke. 2011. Energy efficient scheduling of MapReduce workloads on heterogeneous clusters. In Proceedings of GCM. 1:1--1:6. Google ScholarDigital Library
- D. Yoo and K. M. Sim. 2011. A comparative review of job scheduling for MapReduce. In Proceedings of CCIS. 353--358.Google Scholar
- D. Yoo and K. M. Sim. 2012. A locality enhanced scheduling method for multiple MapReduce jobs in a workflow application. IPCSIT 24 (Feb 2012), 142--146.Google Scholar
- M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. 2009. Job Scheduling for Multi-User MapReduce Clusters. Technical Report. EECS Department, University of California, Berkeley.Google Scholar
- M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. 2010. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of EuroSys. 265--278. Google ScholarDigital Library
- M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. 2008. Improving MapReduce performance in heterogeneous environments. In Proceedings of OSDI. 29--42. Google ScholarDigital Library
- X. Zhang, Z. Zhong, S. Feng, B. Tu, and J. Fan. 2011. Improving data locality of MapReduce by scheduling in homogeneous computing environments. In Proceedings of ISPA. 120--126. Google ScholarDigital Library
Index Terms
- Classification Framework of MapReduce Scheduling Algorithms
Recommendations
An optimized MapReduce workflow scheduling algorithm for heterogeneous computing
The MapReduce framework is considered to be an effective resolution for huge and parallel data processing. This paper treats a massive data processing workflow as a DAG graph consisting of MapReduce jobs. In a heterogeneous computing environment, the ...
TaskTracker aware scheduler with resource availability control for Hadoop MapReduce
Schedulers are playing a vital role in task assignment for Hadoop MapReduce. In some scenario, the default schedulers of Hadoop spawn tasks in TaskTracker without checking the external dependency and may fail. As a result, Hadoop should rerun the tasks in ...
MapReduce scheduling algorithms in Hadoop: a systematic study
AbstractHadoop is a framework for storing and processing huge volumes of data on clusters. It uses Hadoop Distributed File System (HDFS) for storing data and uses MapReduce to process that data. MapReduce is a parallel computing framework for processing ...
Comments