skip to main content
Skip header Section
Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2March 2014
Publisher:
  • Addison-Wesley Professional
ISBN:978-0-321-93450-5
Published:29 March 2014
Pages:
400
Skip Bibliometrics Section
Bibliometrics
Skip Abstract Section
Abstract

This book is a critically needed resource for the newly released Apache Hadoop 2.0, highlighting YARN as the significant breakthrough that broadens Hadoop beyond the MapReduce paradigm. From the Foreword by Raymie Stata, CEO of Altiscale The Insiders Guide to Building Distributed, Big Data Applications with Apache Hadoop YARN Apache Hadoop is helping drive the Big Data revolution. Now, its data processing has been completely overhauled: Apache Hadoop YARN provides resource management at data center scale and easier ways to create distributed applications that process petabytes of data. And now in Apache Hadoop YARN, two Hadoop technical leaders show you how to develop new applications and adapt existing code to fully leverage these revolutionary advances. YARN project founder Arun Murthy and project lead Vinod Kumar Vavilapalli demonstrate how YARN increases scalability and cluster utilization, enables new programming models and services, and opens new options beyond Java and batch processing. They walk you through the entire YARN project lifecycle, from installation through deployment. Youll find many examples drawn from the authors cutting-edge experiencefirst as Hadoops earliest developers and implementers at Yahoo! and now as Hortonworks developers moving the platform forward and helping customers succeed with it. Coverage includes YARNs goals, design, architecture, and componentshow it expands the Apache Hadoop ecosystem Exploring YARN on a single node Administering YARN clusters and Capacity Scheduler Running existing MapReduce applications Developing a large-scale clustered YARN application Discovering new open source frameworks that run under YARN

Cited By

  1. Huang L, Zhao Y, Mestre P, Han L, Wang K, Gao W and Zhang R (2022). Research on Reverse Skyline Query Algorithm Based on Decision Set, Journal of Database Management, 33:1, (1-28), Online publication date: 21-Jul-2022.
  2. Nguyen C, Hwang S and Kim J (2017). Making a case for the on-demand multiple distributed message queue system in a Hadoop cluster, Cluster Computing, 20:3, (2095-2106), Online publication date: 1-Sep-2017.
  3. Lin J, Yu I, Johnsen E and Lee M ABS-YARN Proceedings of the 19th International Conference on Fundamental Approaches to Software Engineering - Volume 9633, (49-65)
  4. Kumar M, Rath N and Rath S (2016). Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier, Journal of Biomedical Informatics, 60:C, (395-409), Online publication date: 1-Apr-2016.
  5. Kumar M and Kumar Rath S (2015). Classification of microarray using MapReduce based proximal support vector machine classifier, Knowledge-Based Systems, 89:C, (584-602), Online publication date: 1-Nov-2015.
  6. Zafar H, Khan F, Carpenter B, Shafi A and Malik A (2015). MPJ Express Meets YARN, Procedia Computer Science, 51:C, (2678-2682), Online publication date: 1-Sep-2015.
  7. ACM
    Huang B, Boehm M, Tian Y, Reinwald B, Tatikonda S and Reiss F Resource Elasticity for Large-Scale Machine Learning Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, (137-152)
  8. Xu L, Li M and Butt A Gerbil Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, (627-636)
Contributors
  • Cloudera, Inc

Recommendations

Reviews

Aake Edlund

MapReduce from Apache Hadoop 1 (MapReduce MRv1) has in the next-generation MapReduce (MRv2, or YARN) been divided into two components, where the cluster resource management capabilities have become YARN (Yet Another Resource Negotiator), and the MapReduce-specific capabilities remain MapReduce. While in the MapReduce MRv1 architecture, the cluster was managed by a service called the JobTracker, with TaskTracker services on each host launching tasks on behalf of jobs, and the JobTracker serving information about completed jobs. In MapReduce MRv2, the functions of the JobTracker have been split between three services. First is the ResourceManager, a persistent YARN service that receives and runs applications on the cluster. It contains the scheduler, which is pluggable. Next, the MapReduce-specific capabilities of the JobTracker have been moved into the MapReduce Application Master, which is started to manage each MapReduce job and terminated when the job completes. Finally, the JobTracker function of serving information about completed jobs has been moved to the JobHistory Server, while the TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a host. It is responsible for launching containers, each of which can house a map or reduce task. The authors give a good background on the reasoning behind the above move from MRv1 to MRv2, or YARN, and the resulting huge change this brings to the data stacks ecosystem overall. The reader who wants more details, for example, on configuration and tuning, and walk-through examples, needs to go to the web. This area is under constant development, with YARN as no exception. This is evident when it comes to the scripting parts and links. Practical details aside, this book is very useful for the reader to get an overview of the architecture, its capabilities, feature set, and related frameworks. The current source code provided for the book needs to be updated; this is something that would considerably increase the usability of the book, especially if all code (not only from selected chapters) would be added. In its current form, the text is less useful for actual testing of the deployment and management of YARN; however, the core concepts of YARN are well described and explained in a pedagogical way to the reader, with an initial focus on the underlying motivations for the evolution toward YARN. The reader is introduced to the core concepts and functional overview of the YARN components in a stepwise manner. The installation steps are described in detail, helping the user into the machinery of setting up his own YARN environment; however, to get it actually in place, the reader needs to go to the web. Throughout the book, the reader is helped to better understand what is needed, the components' functionality, and what to look for and consider when moving to YARN. A number of installation alternatives are described and the user gets a good idea of today's existing support for managing and tuning the environment. Further details on administration and monitoring are given, with source code for these specific chapters. Building on the initial functional descriptions of YARN, the authors add a deeper level of insight with respect to the inner-workings of YARN in a dedicated section on its architecture. The detailed YARN application would benefit from available (and updated) source code to help the user to reproduce the examples as much as possible. The YARN frameworks section gives the user a hint on the importance of YARN, but could be further detailed and extended. Overall, the book is best viewed as a guide to understanding YARN, and less as a hands-on guide to get the details in place. When the authors update the source code for the book, the reader will find it even more useful. More reviews about this item: Amazon , Goodreads , i-Programmer Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.