skip to main content
Skip header Section
Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real DataJuly 2010
Publisher:
  • Springer Publishing Company, Incorporated
ISBN:978-1-84882-259-7
Published:07 July 2010
Pages:
397
Skip Bibliometrics Section
Bibliometrics
Skip Abstract Section
Abstract

Each passing year bears witness to the development of ever more powerful computers, increasingly fast and cheap storage media, and even higher bandwidth data connections. This makes it easy to believe that we can now at least in principle solve any problem we are faced with so long as we only have enough data. Yet this is not the case. Although large databases allow us to retrieve many different single pieces of information and to compute simple aggregations, general patterns and regularities often go undetected. Furthermore, it is exactly these patterns, regularities and trends that are often most valuable. To avoid the danger of drowning in information, but starving for knowledge the branch of research known as data analysis has emerged, and a considerable number of methods and software tools have been developed. However, it is not these tools alone but the intelligent application of human intuition in combination with computational power, of sound background knowledge with computer-aided modeling, and of critical reflection with convenient automatic model construction, that results in successful intelligent data analysis projects. Guide to Intelligent Data Analysis provides a hands-on instructional approach to many basic data analysis techniques, and explains how these are used to solve data analysis problems. Topics and features: guides the reader through the process of data analysis, following the interdependent steps of project understanding, data understanding, data preparation, modeling, and deployment and monitoring; equips the reader with the necessary information in order to obtain hands-on experience of the topics under discussion; provides a review of the basics of classical statistics that support and justify many data analysis methods, and a glossary of statistical terms; includes numerous examples using R and KNIME, together with appendices introducing the open source software; integrates illustrations and case-study-style examples to support pedagogical exposition. This practical and systematic textbook/reference for graduate and advanced undergraduate students is also essential reading for all professionals who face data analysis problems. Moreover, it is a book to be used following ones exploration of it. Dr. Michael R. Berthold is Nycomed-Professor of Bioinformatics and Information Mining at the University of Konstanz, Germany. Dr. Christian Borgelt is Principal Researcher at the Intelligent Data Analysis and Graphical Models Research Unit of the European Centre for Soft Computing, Spain. Dr. Frank Hppner is Professor of Information Systems at Ostfalia University of Applied Sciences, Germany. Dr. Frank Klawonn is a Professor in the Department of Computer Science and Head of the Data Analysis and Pattern Recognition Laboratory at Ostfalia University of Applied Sciences, Germany. He is also Head of the Bioinformatics and Statistics group at the Helmholtz Centre for Infection Research, Braunschweig, Germany.

Cited By

  1. ACM
    Hüsing S Epistemic Programming - An insight-driven programming concept for Data Science Proceedings of the 21st Koli Calling International Conference on Computing Education Research, (1-3)
  2. Kochegurova E and Martynova Y (2020). Aspects of Continuous User Identification Based on Free Texts and Hidden Monitoring, Programming and Computing Software, 46:1, (12-24), Online publication date: 1-Jan-2020.
  3. ACM
    Heinemann B, Opel S, Budde L, Schulte C, Frischemeier D, Biehler R, Podworny S and Wassong T Drafting a Data Science Curriculum for Secondary Schools Proceedings of the 18th Koli Calling International Conference on Computing Education Research, (1-5)
  4. ACM
    Batista N, Brandão M, Pinheiro M, Dalip D and Moro M Dealing with Data from Multiple Web Sources Proceedings of the 24th Brazilian Symposium on Multimedia and the Web, (3-6)
  5. Grossi V, Monreale A, Nanni M, Pedreschi D and Turini F Clustering Formulation Using Constraint Optimization Revised Selected Papers of the SEFM 2015 Collocated Workshops on Software Engineering and Formal Methods - Volume 9509, (93-107)
  6. ACM
    Holvitie J and Leppänen V RefUTU Proceedings of the 16th International Conference on Computer Systems and Technologies, (176-183)
  7. ACM
    Anaya I, Simko V, Bourcier J, Plouzeau N and Jézéquel J A prediction-driven adaptation approach for self-adaptive sensor networks Proceedings of the 9th International Symposium on Software Engineering for Adaptive and Self-Managing Systems, (145-154)
  8. Klawonn F, Lechner W and Grigull L Case-Centred multidimensional scaling for classification visualisation in medical diagnosis Proceedings of the second international conference on Health Information Science, (137-148)
  9. Ince K and Klawonn F Handling Different Levels of Granularity within Naive Bayes Classifiers Proceedings of the 14th International Conference on Intelligent Data Engineering and Automated Learning --- IDEAL 2013 - Volume 8206, (521-528)
  10. Klawonn F, Crull K, Kukita A and Pessler F Median polish with power transformations as an alternative for the analysis of contingency tables with patient data Proceedings of the First international conference on Health Information Science, (25-35)
  11. Klawonn F, Höppner F and Jayaram B What are Clusters in High Dimensions and are they Difficult to Find? Revised Selected Papers of the First International Workshop on Clustering High--Dimensional Data - Volume 7627, (14-33)
  12. ACM
    Kosina P and Gama J Very Fast Decision Rules for multi-class problems Proceedings of the 27th Annual ACM Symposium on Applied Computing, (795-800)
  13. Klawonn F, Höppner F and May S An alternative to ROC and AUC analysis of classifiers Proceedings of the 10th international conference on Advances in intelligent data analysis X, (210-221)
Contributors
  • University of Konstanz
  • University of Salzburg
  • Helmholtz Centre for Infection Research (HZI)

Recommendations

Reviews

Corrado Mencar

The clear and complete exposition of arguments, the attention to formalization, and the balanced number of bibliographic references make this book a bright introduction to intelligent data analysis. It is an excellent choice for graduate or advanced undergraduate courses, as well as for researchers and professionals who want get acquainted with this field of study. Intelligent data analysis is the complex process of acquiring useful knowledge from massive amounts of real data (data collected from real-world processes). Such data is possibly incomplete, distributed among several sources, and polluted by noise. Intelligent data analysis is similar to knowledge discovery in data (KDD), but it places more emphasis on the role of the analyst, who intelligently applies available tools to analyze data and design models. After an introduction to general data analysis concepts, the authors reserve the next chapter for playfully but effectively comparing two approaches to data analysis. In the first situation, they apply a number of tools, almost mechanically. In the second situation, they apply an intelligent approach. The chapter clearly shows the risks of a naive approach to data analysis: extracting no useful knowledge from data, or, even more dangerously, extracting false knowledge. The structure of the book takes the user through each of the stages required for intelligent data analysis. The authors adopt the cross industry standard process for data mining (CRISP-DM) model as a guideline for the description of the various steps. Two chapters examine project and data understanding, the two key stages of CRISP-DM necessary for making the most critical choices in the subsequent stages (or for deciding to abandon the project). The chapter on data understanding, in particular, shows a number of techniques for assessing the quality of available data, including data visualization, descriptive statistics, outlier detection, and missing value analysis. Chapter 5 does not describe any stage of the CRISP-DM process. Instead, it is devoted to the basic principles for a correct model design. The chapter covers general topics such as model fitting strategies and criteria, analysis of the possible sources of errors, and model validation. This chapter prepares the reader for the next part of the book, which presents and discusses several models. Starting with chapter 6, the next part of the book describes the basic techniques for data preparation. The subsequent three chapters focus on the three main objectives of data analysis: finding patterns, finding explanations, and finding predictors. Patterns are regularities hidden in data; one can use exploratory techniques to extract them. The book illustrates cluster and deviation analysis, self-organizing maps, and association rules. Explanatory techniques described in the book include rule-based models, decision trees, regression models, and Bayes classifiers. Finally, the book outlines the most basic predictive models, including the k -nearest neighbor algorithm ( k -NN), neural networks (a brief summary), and support vector machines (SVMs). It also briefly describes ensemble methods. The final chapter briefly covers the last two stages of the CRISP-DM process: evaluation and deployment. Since this is an introductory book, it does not cover advanced arguments such as multi-relational data mining, fuzzy models, and structured datasets. Each chapter, however, includes a well-balanced number of references that are useful for investigating advanced topics. The chapters that contain technical content end with a section that illustrates how to apply the described techniques in R (an open-source statistical tool) and the Konstanz Information Miner (KNIME), free software for setting and running knowledge discovery workflows. These two sections, although quite short, are useful for understanding how to concretely apply the described techniques. The book ends with three appendices. The first appendix is a well-written summary of statistics, which is useful for recalling basic notions and properties from descriptive and inferential statistics and from probability theory. The second appendix is an introduction to the R language. Though short, it is sufficient for following the examples in the chapters. Similarly, the last appendix briefly describes KNIME. Overall, the authors hit their target of producing a textbook that aids in understanding the basic processes, methods, and issues for intelligent data analysis. The level of detail is not homogeneous throughout the book-some sections provide only a big picture of the described arguments while others offer more detail-and there are a few typographical errors, but the rigorous and impartial exposition, the use of a uniform notation, the consistent use of the same dataset (Iris) to show the examples, and the adequate bibliography make this book a good selection for the target audience. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.