skip to main content
Skip header Section
Big Data Science & Analytics: A Hands-On ApproachApril 2016
Publisher:
  • VPT
ISBN:978-0-9960255-3-9
Published:15 April 2016
Pages:
542
Skip Bibliometrics Section
Bibliometrics
Skip Abstract Section
Abstract

We are living in the dawn of what has been termed as the "Fourth Industrial Revolution", which is marked through the emergence of "cyber-physical systems" where software interfaces seamlessly over networks with physical systems, such as sensors, smartphones, vehicles, power grids or buildings, to create a new world of Internet of Things (IoT). Data and information are fuel of this new age where powerful analytics algorithms burn this fuel to generate decisions that are expected to create a smarter and more efficient world for all of us to live in. This new area of technology has been defined as Big Data Science and Analytics, and the industrial and academic communities are realizing this as a competitive technology that can generate significant new wealth and opportunity. Big data is defined as collections of datasets whose volume, velocity or variety is so large that it is difficult to store, manage, process and analyze the data using traditional databases and data processing tools. Big data science and analytics deals with collection, storage, processing and analysis of massive-scale data. Industry surveys, by Gartner and e-Skills, for instance, predict that there will be over 2 million job openings for engineers and scientists trained in the area of data science and analytics alone, and that the job market is in this area is growing at a 150 percent year-over-year growth rate. We have written this textbook, as part of our expanding "A Hands-On Approach"(TM) series, to meet this need at colleges and universities, and also for big data service providers who may be interested in offering a broader perspective of this emerging field to accompany their customer and developer training programs. The typical reader is expected to have completed a couple of courses in programming using traditional high-level languages at the college-level, and is either a senior or a beginning graduate student in one of the science, technology, engineering or mathematics (STEM) fields. An accompanying website for this book contains additional support for instruction and learning (www.big-data-analytics-book.com) The book is organized into three main parts, comprising a total of twelve chapters. Part I provides an introduction to big data, applications of big data, and big data science and analytics patterns and architectures. A novel data science and analytics application system design methodology is proposed and its realization through use of open-source big data frameworks is described. This methodology describes big data analytics applications as realization of the proposed Alpha, Beta, Gamma and Delta models, that comprise tools and frameworks for collecting and ingesting data from various sources into the big data analytics infrastructure, distributed filesystems and non-relational (NoSQL) databases for data storage, and processing frameworks for batch and real-time analytics. This new methodology forms the pedagogical foundation of this book. Part II introduces the reader to various tools and frameworks for big data analytics, and the architectural and programming aspects of these frameworks, with examples in Python. We describe Publish-Subscribe messaging frameworks (Kafka & Kinesis), Source-Sink connectors (Flume), Database Connectors (Sqoop), Messaging Queues (RabbitMQ, ZeroMQ, RestMQ, Amazon SQS) and custom REST, WebSocket and MQTT-based connectors. The reader is introduced to data storage, batch and real-time analysis, and interactive querying frameworks including HDFS, Hadoop, MapReduce, YARN, Pig, Oozie, Spark, Solr, HBase, Storm, Spark Streaming, Spark SQL, Hive, Amazon Redshift and Google BigQuery. Also described are serving databases (MySQL, Amazon DynamoDB, Cassandra, MongoDB) and the Django Python web framework. Part III introduces the reader to various machine learning algorithms with examples using the Spark MLlib and H2O frameworks, and visualizations using frameworks such as Lightning, Pygal and Seaborn.

Contributors
  • Georgia Institute of Technology
  • Georgia Institute of Technology

Recommendations

Reviews

Simon Berkovich

Devoted to the problem of "big data," which has become an important business in many areas of modern life, this book addresses the state of affairs that is termed the "Fourth Industrial Revolution." In a simplified way, the big data situation is described as dealing with "the collections of datasets whose volume, velocity, or variety is so large that it is difficult to store, manage, process and analyze the data using traditional databases and data processing tools." Some industry surveys mentioned in the book "predict that there will be over 2 million job openings for engineers and scientists trained in the area of data science and analytics alone, and that the job market in this area is growing at a 150 percent year-over-year growth rate." This book is written as a manual expanding the "A Hands-On Approach" series, to meet the instructional need at colleges and universities. Also, it may be interesting for big data service providers, offering a broader perspective of this emerging field to accompany training programs for their customers and developers. The book is organized into three main parts, comprising a total of 12 chapters that basically cover all aspects of big data. Part 1 is an introduction to big data. It relates to big data analytics patterns and architectures. According to the authors, the suggested "methodology forms the pedagogical foundation of [the] book." A novel data science and analytics application system design is considered, and its realization through the use of open-source big data frameworks is described. This description comprises tools and frameworks for collecting data from various sources. It presents nonrelational (NoSQL) databases for distributed file systems and data storage, and frameworks for batch and real-time processing. Part 2 contains various tools and frameworks for big data analytics with examples in Python. The reader is introduced to data storage, batch and real-time analysis, and interactive querying frameworks including HDFS, Hadoop, MapReduce, YARN, Pig, Oozie, Spark, Solr, HBase, Storm, Spark Streaming, Spark SQL, Hive, Amazon Redshift, and Google BigQuery. Also described are serving databases (MySQL, Amazon DynamoDB, Cassandra, MongoDB) and the Django Python web framework. Part 3 presents advanced topics related to various machine learning techniques including clustering, classification, regression, and recommendation. The examples use the Spark MLlib and the H2O machine learning frameworks. This part also includes methods of data visualization using frameworks, such as Lightning, Pygal, and Seaborn. In summary, this book presents a comprehensive reference source in relation to the basic aspects of big data analytics. A qualified reader can effectively use it for practical work on big data systems. Yet, it is doubtful that straightforward intensification of processing for large volumes of information items alone could actually lead to the anticipated industrial revolution. The main objective of big data analysis is the formation of knowledge. Primitively thinking, one may assume that accumulation of vast amounts of data is a necessary stipulation for this purpose. But, in fact, it is just an imitation of productive activity. Many big data projects, especially in biology, have been criticized basically for cost and lack of results. Formation of knowledge requires something beyond regular statistical inference; it is a haphazard process that involves serendipity. Thus, successful use of big data requires a qualitatively different approach to the organization of processing. At this time, "big data" developments are basically focused on technical issues of adapting stupendous information processing requirements to the conventional facilities of common information technology. This book presents a good, comprehensive reference source for these efforts. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.