skip to main content
Skip header Section
Big Data Analytics with R and HadoopNovember 2013
Publisher:
  • Packt Publishing
ISBN:978-1-78216-328-2
Published:25 November 2013
Pages:
238
Skip Bibliometrics Section
Bibliometrics
Skip Abstract Section
Abstract

Set up an integrated infrastructure of R and Hadoop to turn your data analytics into Big Data analytics Overview Write Hadoop MapReduce within R Learn data analytics with R and the Hadoop platform Handle HDFS data within R Understand Hadoop streaming with R Encode and enrich datasets into R In Detail Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing. Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built, which can process analytics algorithms over a large scale dataset in a scalable manner. This can be implemented through data analytics operations of R, MapReduce, and HDFS of Hadoop. You will start with the installation and configuration of R and Hadoop. Next, you will discover information on various practical data analytics examples with R and Hadoop. Finally, you will learn how to import/export from various data sources to R. Big Data Analytics with R and Hadoop will also give you an easy understanding of the R and Hadoop connectors RHIPE, RHadoop, and Hadoop streaming. What you will learn from this book Integrate R and Hadoop via RHIPE, RHadoop, and Hadoop streaming Develop and run a MapReduce application that runs with R and Hadoop Handle HDFS data from within R using RHIPE and RHadoop Run Hadoop streaming and MapReduce with R Import and export from various data sources to R Approach Big Data Analytics with R and Hadoop is a tutorial style book that focuses on all the powerful big data tasks that can be achieved by integrating R and Hadoop. Who this book is written for This book is ideal for R developers who are looking for a way to perform big data analytics with Hadoop. This book is also aimed at those who know Hadoop and want to build some intelligent applications over Big data with R packages. It would be helpful if readers have basic knowledge of R.

Cited By

  1. Grulich P, Zeuch S and Markl V (2022). Babelfish, Proceedings of the VLDB Endowment, 15:2, (196-210), Online publication date: 1-Oct-2021.
  2. Ku J (2018). A Study on Prediction Model of Equipment Failure Through Analysis of Big Data Based on RHadoop, Wireless Personal Communications: An International Journal, 98:4, (3163-3176), Online publication date: 1-Feb-2018.
  3. Karim S, Soomro T and Aqil Burney S (2018). Spatiotemporal Aspects of Big Data, Applied Computer Systems, 23:2, (90-100), Online publication date: 1-Dec-2018.
  4. ACM
    Liang T, Yeh L and Wu C A Visual MapReduce Program Development Environment for Heterogeneous Computing on Clouds Proceedings of the 2018 International Conference on Computing and Data Engineering, (83-87)
  5. Zhong H, Xiao J and Risi M (2017). Enhancing Health Risk Prediction with Deep Learning on Big Data and Revised Fusion Node Paradigm, Scientific Programming, 2017, Online publication date: 1-Jan-2017.
  6. ACM
    Keka I and Çiço B Big data in electricity-visualization aspect Proceedings of the 16th International Conference on Computer Systems and Technologies, (236-243)
  7. Gheid Z and Challal Y An efficient and privacy-preserving similarity evaluation for big data analytics Proceedings of the 8th International Conference on Utility and Cloud Computing, (281-289)
  8. Páez D, Buenaga Rodríguez M, Sánz E, Villalba M and Gil R Big Data Processing Using Wearable Devices for Wellbeing and Healthy Activities Promotion Proceedings of the 7th International Work-Conference on Ambient Assisted Living. ICT-based Solutions in Real Life Situations - Volume 9455, (196-205)
Contributors

Recommendations

Adrian Pasculescu

This is a practical introductory book for analytics and software supporting teams eager to use simple, scalable, and yet powerful ways to manipulate and analyze large and distributed sets of data. From the first pages we learn that Hadoop is “an open-source Java framework for processing ... vast amounts of data on large clusters of commodity hardware,” and that R is a powerful, actively developed “open-source software package to perform statistical analysis on data” and “a programming language used by data scientist statisticians.” Thus, it is not surprising to learn that software developer groups and companies have invested in combining R and Hadoop in manageable and efficient data processing/analysis environments. The content is well organized in a preface, seven chapters, and a reference section, which can be read more or less independently depending on the level of knowledge of the reader. Of course, this independence brings a bit of unavoidable redundancy in terms of references, examples, and notes. There is electronic access to all the exemplified software code, which can be obtained easily not only for the electronic copy but also for the printed version of the book. This, together with a couple of downloadable virtual Hadoop and R environments, pleasantly enhances the reading experience by allowing a simple try-it-yourself approach. The style is simple and clear, with abundant uniform resource locator (URL) resources, notes, code examples, and screen shots. To follow the examples, the reader must have some basic knowledge of R and understand the meaning of its abstract syntax, such as unlist(lapply(result,'[[',1)). Chapter 1, “Getting Ready to Use R and Hadoop,” describes installing R, the RStudio programming environment, and R packages. It summarizes common R data mining techniques that are exemplified in the next chapters: regression, classification, clustering, and recommendation. It finishes with installing Hadoop, HDFS (Hadoop's “rack-aware file-system”), and introduces MapReduce, “a programming model for processing large datasets distributed on a large cluster.” MapReduce concepts are introduced in chapter 2, “Writing Hadoop MapReduce Programs.” Maps process partitioned input data in parallel based on keys, and are followed by shuffling/sorting and reducing (that is, aggregating) operations that summarize in the output results. Here we also learn about some limitations of MapReduce, and find simple Java code examples together with ways to monitor and debug MapReduce jobs. In chapter 3, “Integrating R and Hadoop,” the author focuses on examples of R and Hadoop integration using packages from different vendors or public sources, such as RHIPE, RHadoop, and HadoopStreaming. The MapReduce approach is obviously easiest to apply when the desired operation on big data is associative and commutative (for example, counting by summation). In this case, distributing partial sums and calculations in parallel will provide the same results. In chapter 4, “Using Hadoop Streaming with R,” the example is related to the segmentation of web page visits by geolocation and stresses the advantages of data streaming. We find here detailed descriptions of the functions provided by the R package HadoopStreaming (hsTableReader, hsKeyValReader, hsLineReader), together with examples of how to run Hadoop and how to prepare commands and read the results. The author describes the data analytics project life cycle using a diverse set of examples, such as categorizing web page popularity, “computing the frequency of stock market change,” and a case-study about “predicting the sale price of blue book for bulldozers,” in chapter 5, “Learning Data Analytics with R and Hadoop.” The main R functions used here are glm (generalized linear model) with the Poisson regression family and randomForests . Elements of supervised and unsupervised machine learning are introduced in chapter 6, “Understanding Big Data Analysis with Machine Learning.” Linear regression is exemplified on a matrix of random numbers; logistic regression on the iris flowers R data. For unsupervised machine learning, k -means clustering is used. The user-based and item-based recommendation algorithms presented are based on another publication [1]. Chapter 7, “Importing and Exporting Data from Various DBs,” briefly shows how to use different types of external data in R. For example, manipulating different type of file formats: .csv, .txt, .RDATA, .rda, and .xlsx; and accessing SQL-based databases, such as MySQL, SQLite, and PostgreSQL, or NoSQL-based distributed document data storage systems, such as MongoDB, HIVE (“a Hadoop-based data warehousing-like framework”), and HBase (“a distributed big data store for Hadoop”). The book ends with a rich list of resources and an index. As always when the preponderant focus is the application as opposed to theory, one peril is unavoidable: imperfect reproducibility of examples in practice (because of obsolete packages and versions or differences in syntax and parameters), and this book is no exception. Overall, the well-written content suggests that Hadoop and MapReduce, this simple and structured type of divide-and-conquer approach in big data analysis, might continue to have a place in the toolbox of any data scientist, especially in combination with R. More reviews about this item: Amazon , Goodreads Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.