The fundamental algorithms in data mining and analysis form the basis for the emerging field of data science, which includes automated methods to analyze patterns and models for all kinds of data, with applications ranging from scientific discovery to business intelligence and analytics. This textbook for senior undergraduate and graduate data mining courses provides a broad yet in-depth overview of data mining, integrating related concepts from machine learning and statistics. The main parts of the book include exploratory data analysis, pattern mining, clustering, and classification. The book lays the basic foundations of these tasks, and also covers cutting-edge topics such as kernel methods, high-dimensional data analysis, and complex graphs and networks. With its comprehensive coverage, algorithmic perspective, and wealth of examples, this book offers solid guidance in data mining for students, researchers, and practitioners alike. Key features: Covers both core methods and cutting-edge research Algorithmic approach with open-source implementations Minimal prerequisites: all key mathematical concepts are presented, as is the intuition behind the formulas Short, self-contained chapters with class-tested examples and exercises allow for flexibility in designing a course and for easy reference Supplementary website with lecture slides, videos, project ideas, and more
Cited By
- Chungnoi K, Kongkachandra R and Gulyanon S (2023). The Computational Method for Supporting Thai VerbNet Construction, ACM Transactions on Asian and Low-Resource Language Information Processing, 0:0
- Zhang A, Deng S, Cui D, Yuan Y and Wang G (2023). An Experimental Evaluation of Anomaly Detection in Time Series, Proceedings of the VLDB Endowment, 17:3, (483-496), Online publication date: 1-Nov-2023.
- Nguyen T, Nguyen T, Nguyen T, Yin H, Nguyen T, Jo J and Nguyen Q (2023). Isomorphic Graph Embedding for Progressive Maximal Frequent Subgraph Mining, ACM Transactions on Intelligent Systems and Technology, 0:0
- Sousa M, Vieira P, Queluz M and Rodrigues A (2024). Towards the use of Unsupervised Causal Learning in Wireless Networks Operation, Journal of King Saud University - Computer and Information Sciences, 35:9, Online publication date: 1-Oct-2023.
- Qiu H, Yang Y and Pan H (2023). Underestimation modification for intrinsic dimension estimation, Pattern Recognition, 140:C, Online publication date: 1-Aug-2023.
- Bardou A and Begin T (2023). Analysis of a decentralized Bayesian optimization algorithm for improving spatial reuse in dense WLANs, Computer Communications, 208:C, (158-170), Online publication date: 1-Aug-2023.
- Feres C and Ding Z (2023). An Unsupervised Learning Paradigm for User Scheduling in Large Scale Multi-Antenna Systems, IEEE Transactions on Wireless Communications, 22:5, (2932-2945), Online publication date: 1-May-2023.
- Puspitasari R, Wintarti A and Imah E (2023). Comparison of feature extraction for noise-robust gamelan tone signal recognition, Procedia Computer Science, 216:C, (698-705), Online publication date: 1-Jan-2023.
- Tey F, Wu T and Chen J (2022). Machine Learning-based Short-term Rainfall Prediction from Sky Data, ACM Transactions on Knowledge Discovery from Data, 16:6, (1-18), Online publication date: 31-Dec-2022.
- Park Y (2022). Developing a COVID-19 Crisis Management Strategy Using News Media and Social Media in Big Data Analytics, Social Science Computer Review, 40:6, (1358-1375), Online publication date: 1-Dec-2022.
- Gad A, Sallam K, Chakrabortty R, Ryan M and Abohany A (2022). An improved binary sparrow search algorithm for feature selection in data classification, Neural Computing and Applications, 34:18, (15705-15752), Online publication date: 1-Sep-2022.
- Kim J, Luo S, Cong G and Yu W DMCS : Density Modularity based Community Search Proceedings of the 2022 International Conference on Management of Data, (889-903)
- Tang X, Wu S, Song M, Ying S, Li F and Chen G PreQR: Pre-training Representation for SQL Understanding Proceedings of the 2022 International Conference on Management of Data, (204-216)
- Chowdhury M, Ahmed C and Leung C (2021). A New Approach for Mining Correlated Frequent Subgraphs, ACM Transactions on Management Information Systems, 13:1, (1-28), Online publication date: 31-Mar-2022.
- Bernardini G, Chen H, Fici G, Loukides G and Pissis S (2021). Reverse-Safe Text Indexing, ACM Journal of Experimental Algorithmics, 26, (1-26), Online publication date: 31-Dec-2022.
- Ermakova T, Fabian B, Alexander Fradin D and Gross S A Framework for Internet Connectivity Risk Assessment Based on Graph Models IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, (576-581)
- Januzaj E, Weber M, Keller M, Auch M and Mandl P CoSim: An Approach to Calculate Complex Object Similarity The 23rd International Conference on Information Integration and Web Intelligence, (324-327)
- Stappen L, Schumann L, Sertolli B, Baird A, Weigell B, Cambria E and Schuller B MuSe-Toolbox: The Multimodal Sentiment Analysis Continuous Annotation Fusion and Discrete Class Transformation Toolbox Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge, (75-82)
- Brito D, Assunção R, Souza R and JR. W (2020). SCPP, ACM Transactions on Spatial Algorithms and Systems, 7:1, (1-30), Online publication date: 6-Jan-2021.
- Kaur I, Doja M, Ahmad T, Ahmad M, Hussain A, Nadeem A, Abd El-Latif A and Doulamis A (2021). An Integrated Approach for Cancer Survival Prediction Using Data Mining Techniques, Computational Intelligence and Neuroscience, 2021, Online publication date: 1-Jan-2021.
- Molina-Coronado B, Mori U, Mendiburu A and Miguel-Alonso J (2020). Survey of Network Intrusion Detection Methods From the Perspective of the Knowledge Discovery in Databases Process, IEEE Transactions on Network and Service Management, 17:4, (2451-2479), Online publication date: 1-Dec-2020.
- Pazhaniraja N, Sountharrajan S and Sathis Kumar B (2020). High utility itemset mining: a Boolean operators-based modified grey wolf optimization algorithm, Soft Computing - A Fusion of Foundations, Methodologies and Applications, 24:21, (16691-16704), Online publication date: 1-Nov-2020.
- Alves F, Andongabo A, Gashi I, Ferreira P and Bessani A Follow the Blue Bird: A Study on Threat Data Published on Twitter Computer Security – ESORICS 2020, (217-236)
- Hamilton N and Fulp E An evolutionary approach for constructing multi-stage classifiers Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, (1730-1738)
- Zerabi S, Meshoul S and Boucherkha S (2020). Models for Internal Clustering Validation Indexes Based on Hadoop-MapReduce, International Journal of Distributed Systems and Technologies, 11:3, (42-67), Online publication date: 1-Jul-2020.
- Forouzandeh S, Aghdam A, Forouzandeh S and Xu S (2020). Addressing the Cold-Start Problem Using Data Mining Techniques and Improving Recommender Systems by Cuckoo Algorithm: A Case Study of Facebook, Computing in Science and Engineering, 22:4, (62-73), Online publication date: 1-Jul-2020.
- Azmi E, Strobl M, van Pruijssen R, Ehret U, Meyer J and Streit A Evolutionary Approach of Clustering to Optimize Hydrological Simulations Computational Science and Its Applications – ICCSA 2020, (617-633)
- Mahato S, Goyal N, Ram D and Paul S (2020). Detection of Depression and Scaling of Severity Using Six Channel EEG Data, Journal of Medical Systems, 44:7, Online publication date: 21-May-2020.
- Scitovski R and Sabo K (2019). DBSCAN-like clustering method for various data densities, Pattern Analysis & Applications, 23:2, (541-554), Online publication date: 1-May-2020.
- Mansouri N, Javidi M and Mohammad Hasani Zade B (2019). Using data mining techniques to improve replica management in cloud environment, Soft Computing - A Fusion of Foundations, Methodologies and Applications, 24:10, (7335-7360), Online publication date: 1-May-2020.
- Khanali H and Vaziri B (2019). An improved approach to fuzzy clustering based on FCM algorithm and extended VIKOR method, Neural Computing and Applications, 32:2, (473-484), Online publication date: 1-Jan-2020.
- Tianxing M, Baimuratov I and Zhukova N (2020). A Knowledge-Oriented Recommendation System for Machine Learning Algorithm Finding and Data Processing, International Journal of Embedded and Real-Time Communication Systems, 10:4, (20-38), Online publication date: 1-Oct-2019.
- Feremans L, Vercruyssen V, Cule B, Meert W and Goethals B Pattern-Based Anomaly Detection in Mixed-Type Time Series Machine Learning and Knowledge Discovery in Databases, (240-256)
- Walton N, Sheppard J and Shaw J Using a genetic algorithm with histogram-based feature selection in hyperspectral image classification Proceedings of the Genetic and Evolutionary Computation Conference, (1364-1372)
- Zhang X, Qiao Z, Ahuja A, Fan W, Fox E and Reddy C Discovering Product Defects and Solutions from Online User Generated Contents The World Wide Web Conference, (3441-3447)
- Oraby S, Bhuiyan M, Gundecha P, Mahmud J and Akkiraju R (2019). Modeling and Computational Characterization of Twitter Customer Service Conversations, ACM Transactions on Interactive Intelligent Systems, 9:2-3, (1-28), Online publication date: 25-Apr-2019.
- Santos R, Sousa M, Vieira P, Queluz M and Rodrigues A An Unsupervised Learning Approach for Performance and Configuration Optimization of 4G Networks 2019 IEEE Wireless Communications and Networking Conference (WCNC), (1-6)
- Vodyaho A, Osipov V, Zhukova N and Chervontsev M (2019). Cognitive Technologies in Monitoring Management, Automatic Documentation and Mathematical Linguistics, 53:2, (71-80), Online publication date: 1-Mar-2019.
- Zhukova N and Andriyanova N (2019). Cognitive Monitoring of Distributed Objects, Automatic Documentation and Mathematical Linguistics, 53:1, (32-43), Online publication date: 1-Jan-2019.
- Abid A and Zou J Autowarp Proceedings of the 32nd International Conference on Neural Information Processing Systems, (10568-10578)
- Kul G, Luong D, Xie T, Chandola V, Kennedy O and Upadhyaya S (2018). Similarity Metrics for SQL Query Clustering, IEEE Transactions on Knowledge and Data Engineering, 30:12, (2408-2420), Online publication date: 1-Dec-2018.
- Garcia del Molino A, Lim J and Tan A Predicting Visual Context for Unsupervised Event Segmentation in Continuous Photo-streams Proceedings of the 26th ACM international conference on Multimedia, (10-17)
- Doyle C, Meandzija A, Korniss G, Szymanski B, Asher D and Bowman E Mining personal media thresholds for opinion dynamics and social influence Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, (1258-1265)
- Caetano J, Almeida J and Marques-Neto H Characterizing politically engaged users' behavior during the 2016 US presidential campaign Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, (523-530)
- Tuhkala A, Kärkkäinen T and Nieminen P Semi-automatic literature mapping of participatory design studies 2006--2016 Proceedings of the 15th Participatory Design Conference: Short Papers, Situated Actions, Workshops and Tutorial - Volume 2, (1-5)
- Wu L, Chen P, Yen I, Xu F, Xia Y and Aggarwal C Scalable Spectral Clustering Using Random Binning Features Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, (2506-2515)
- Castro Fernandez R, Culhane W, Watcharapichat P, Weidlich M, Lopez Morales V and Pietzuch P Meta-Dataflows Proceedings of the 2018 International Conference on Management of Data, (1157-1172)
- Baimuratov I and Zhukova N An Approach to Clustering Models Estimation Proceedings of the 22st Conference of Open Innovations Association FRUCT, (19-24)
- Kuznetsov S and Makhalova T (2018). On interestingness measures of formal concepts, Information Sciences: an International Journal, 442:C, (202-219), Online publication date: 1-May-2018.
- Scitovski S (2018). A density-based clustering algorithm for earthquake zoning, Computers & Geosciences, 110:C, (90-95), Online publication date: 1-Jan-2018.
- Swarup Das A, Mehta S and Subramaniam L (2017). AnnoFinA hybrid algorithm to annotate financial text, Expert Systems with Applications: An International Journal, 88:C, (270-275), Online publication date: 1-Dec-2017.
- Abulaish M and Jahiruddin A Novel Weighted Distance Measure for Multi-Attributed Graph Proceedings of the 10th Annual ACM India Compute Conference, (39-47)
- Zhang B and Al Hasan M Name Disambiguation in Anonymized Graphs using Network Embedding Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, (1239-1248)
- Mlambo M, Gasela N, Esiefarienrhe M and Isong B On the Optimization of Improved Apriori Algorithm via Linked-list Trie Proceedings of the 1st International Conference on Big Data Research, (62-66)
- Chen Q, Wan Y, Zhang X, Lei Y, Zobel J and Verspoor K (2018). Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases, Journal of Data and Information Quality, 9:3, (1-27), Online publication date: 30-Sep-2017.
- Brandão M, de Melo P and Moro M Tie strength dynamics over temporal co-authorship social networks Proceedings of the International Conference on Web Intelligence, (306-313)
- Costa E, Fonseca B, Santana M, de Arajo F and Rego J (2017). Evaluating the effectiveness of educational data mining techniques for early prediction of students' academic failure in introductory programming courses, Computers in Human Behavior, 73:C, (247-256), Online publication date: 1-Aug-2017.
- Santos W, Carvalho L, de P. Avelar G, Silva Á, Ponce L, Guedes D and Meira W Lemonade Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, (745-748)
- Zerabi S, Meshoul S, Merniz A and Melal R Towards Clustering Validation in Big Data Context Proceedings of the 2nd international Conference on Big Data, Cloud and Applications, (1-6)
- Anwar T, Liu C, Vu H and Leckie C (2017). Partitioning road networks using density peak graphs, Information Systems, 64:C, (22-40), Online publication date: 1-Mar-2017.
- Brando M and Moro M (2017). Social professional networks, Computer Communications, 100:C, (20-31), Online publication date: 1-Mar-2017.
- Pacella M, Grieco A and Blaco M (2016). On the Use of Self-Organizing Map for Text Clustering in Engineering Change Process Analysis, Computational Intelligence and Neuroscience, 2016, (7), Online publication date: 1-Dec-2016.
- Marbouti F, Diefes-Dux H and Madhavan K (2016). Models for early prediction of at-risk students in a course using standards-based grading, Computers & Education, 103:C, (1-15), Online publication date: 1-Dec-2016.
- Rehioui H, Idrissi A and Abourezq M The Research and Selection of Ideal Cloud Services using Clustering Techniques Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, (1-6)
- Song W, Zhang Z and Li J (2016). A high utility itemset mining algorithm based on subsume index, Knowledge and Information Systems, 49:1, (315-340), Online publication date: 1-Oct-2016.
- Ponde P, Shirwaikar S and Kreiner C An analytical study of security patterns Proceedings of the 21st European Conference on Pattern Languages of Programs, (1-26)
- He J, Veltri E, Santoro D, Li G, Mecca G, Papotti P and Tang N Interactive and Deterministic Data Cleaning Proceedings of the 2016 International Conference on Management of Data, (893-907)
- Nezhadbiglari M, Gonçalves M and Almeida J Early Prediction of Scholar Popularity Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, (181-190)
- Rieck B and Leitte H (2016). Exploring and Comparing Clusterings of Multivariate Data Sets Using Persistent Homology, Computer Graphics Forum, 35:3, (81-90), Online publication date: 1-Jun-2016.
- Kreutzer P, Dotzler G, Ring M, Eskofier B and Philippsen M Automatic clustering of code changes Proceedings of the 13th International Conference on Mining Software Repositories, (61-72)
- (2016). Tutorial on practical tips of the most influential data preprocessing algorithms in data mining, Knowledge-Based Systems, 98:C, (1-29), Online publication date: 15-Apr-2016.
- Avci U and Aran O (2016). Predicting the Performance in Decision-Making Tasks: From Individual Cues to Group Interaction, IEEE Transactions on Multimedia, 18:4, (643-658), Online publication date: 1-Apr-2016.
- Ben Hamza A (2016). Graph regularized sparse coding for 3D shape clustering, Knowledge-Based Systems, 92:C, (92-103), Online publication date: 15-Jan-2016.
- Fersini E, Messina E and Pozzi F (2016). Expressive signals in social media languages to improve polarity detection, Information Processing and Management: an International Journal, 52:1, (20-35), Online publication date: 1-Jan-2016.
- Bhattacharya S and Selvakumar S (2015). LAWRA, Security and Communication Networks, 8:18, (3459-3468), Online publication date: 1-Dec-2015.
- Hamrouni T, Slimani S and Charrada F (2015). A Critical Survey of Data Grid Replication Strategies Based on Data Mining Techniques, Procedia Computer Science, 51:C, (2779-2788), Online publication date: 1-Sep-2015.
- Brandão M and Moro M Analyzing the Strength of Co-authorship Ties with Neighborhood Overlap Proceedings, Part I, of the 26th International Conference on Database and Expert Systems Applications - Volume 9261, (527-542)
- Imran M, Castillo C, Diaz F and Vieweg S (2015). Processing Social Media Messages in Mass Emergency, ACM Computing Surveys, 47:4, (1-38), Online publication date: 21-Jul-2015.
- Hadzic F, Hecker M and Tagarelli A (2015). Ordered subtree mining via transactional mapping using a structure-preserving tree database schema, Information Sciences: an International Journal, 310:C, (97-117), Online publication date: 20-Jul-2015.
- Gonçalves E, Plastino A and Freitas A Simpler is Better Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, (559-566)
- da Silva P, Gonçalves E, Rios E, Muhammad A, Moss A, Pritchard T, Glassborow B, Plastino A and Azeredo R (2015). Automatic classification of carbonate rocks permeability from 1H NMR relaxation data, Expert Systems with Applications: An International Journal, 42:9, (4299-4309), Online publication date: 1-Jun-2015.
- Abujabal A and Berberich K Important Events in the Past, Present, and Future Proceedings of the 24th International Conference on World Wide Web, (1315-1320)
- Saleem A, Asif K, Ali A, Awan S and Alghamdi M Pre-processing Methods of Data Mining Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing, (451-456)
- Fersini E, Messina E and Pozzi F (2014). Sentiment analysis, Decision Support Systems, 68:C, (26-38), Online publication date: 1-Dec-2014.
- Anwar T and Abulaish M (2014). A social graph based text mining framework for chat log investigation, Digital Investigation: The International Journal of Digital Forensics & Incident Response, 11:4, (349-362), Online publication date: 1-Dec-2014.
- Avci U and Aran O Effect of nonverbal behavioral patterns on the performance of small groups Proceedings of the 2014 workshop on Understanding and Modeling Multiparty, Multimodal Interactions, (9-14)
- Anwar T and Abulaish M (2014). Namesake alias mining on the Web and its role towards suspect tracking, Information Sciences: an International Journal, 276:C, (123-145), Online publication date: 20-Aug-2014.
- Naik N, Diao R and Shen Q Choice of effective fitness functions for genetic algorithm-aided dynamic fuzzy rule interpolation 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), (1-8)
Index Terms
- Data Mining and Analysis: Fundamental Concepts and Algorithms