Text Mining: Classification, Clustering, and Applications | Guide books

Text Mining: Classification, Clustering, and ApplicationsJune 2009

June 2009

Publisher:

Chapman & Hall/CRC

ISBN:978-1-4200-5940-3

Published:15 June 2009

Pages:

328

Available at Amazon

Bibliometrics

Abstract

Giving a broad perspective of the field from numerous vantage points, Text Mining focuses on statistical methods for text mining and analysis. It examines methods to automatically cluster and classify text documents and applies these methods in a variety of areas, including adaptive information filtering, information distillation, and text search. The book begins with the classification of documents into predefined categories and then describes novel methods for clustering documents into groups that are not predefined. It concludes with various text mining applications that have significant implications for future research and industrial use.

Cited By

Contributors

Ashok Srivastava
- Publication Years2009 - 2009
- Publication counts1
- Citation count22
- Available for Download0
- Downloads (cumulative)0
- Downloads (12 months)0
- Downloads (6 weeks)0
- Average Downloads per Article0
- Average Citation per Article22
View Full Profile
Mehran Sahami
Stanford University
- Publication Years1993 - 2023
- Publication counts74
- Citation count2,956
- Available for Download45
- Downloads (cumulative)53,387
- Downloads (12 months)2,129
- Downloads (6 weeks)337
- Average Downloads per Article1,186
- Average Citation per Article40
View Full Profile

Index Terms

Text Mining: Classification, Clustering, and Applications
1. Applied computing
  1. Document management and text processing
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

Text mining: classification & clustering of articles related to sports
ACM-SE 43: Proceedings of the 43rd annual Southeast regional conference - Volume 1

Identification of articles related to a particular domain is addressed by Text Mining. This paper demonstrates the benefits of combining classification and clustering towards achieving the goal of grouping very closely related articles/documents. ...
Read More
Text association mining with cross-sentence inference, structure-based document model and multi-relational text mining
Read More
Mining Text Using Keyword Distributions

Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work ...
Read More

Reviews

Reviewer: Sithu D. Sudarsan

Data mining now includes unstructured data mining capabilities, and the mining of unstructured text is known as text mining; since over 80 percent of existing data is unstructured, researchers in this area are working overtime. The book provides a very good overview of some state-of-the-art capabilities. It has ten chapters, each contributed by a group of researchers, primarily from academia. As part of the introduction, the editors provide a very quick overview of the progress made in text mining, which also serves to outline each chapter's topic. In chapter 1, "Analysis of Text Patterns Using Kernel Methods," the authors prove that the computational complexity of pattern analysis remains a polynomial-irrespective feature space dimensionality. They provide an example for the kernel function evaluation, with a language dataset that contains as many as 42 languages. Chapter 2, "Detection of Bias in Media Outlets with Statistical Learning Methods," is a case study that uses the contents of four online media outlets: Cable News Network (CNN), Al Jazeera, the International Herald Tribune , and The Detroit News . Most of the paper deals with outlet identification when given a news item, using a choice of terms and stories. For their experiments, the authors use support vector machines (SVMs), kernel canonical correlation analysis, and multidimensional scaling. While the presentation is clear and easy to follow, the articles they compare are essentially related to the Middle East; the outlets they compare present the stories from different views, making the identification straightforward. As a result, it is unclear how their methods would perform in less-distinct cases. Chapter 3, "Collective Classification for Text Classification," is another case study. In their experiments, the authors use two bibliographic datasets-Cora and CiteSeer-and a hypertext dataset-WebKB. Essentially, they try to identify or infer missing labels or metadata, by positively identifying links between documents. Their approximate inference approaches are based on both local conditional classifiers and global formulations. They conclude with a discussion of the performance of the classifiers, based on the test results. In chapter 4, the authors identify topics based on a document collection. They use JSTOR to demonstrate the effectiveness of latent Dirichlet allocation (LDA), and describe dynamic topic models and correlated topic models. In chapter 5, the authors use Enron email sets to track discussions in email communications. The study uses three- and four-way nonnegative matrix and tensor factorization to reveal discussions-something that could not have been accomplished using two-way tensors. The authors take parallel factor analysis (PARAFAC) for three-way arrays and extend it to four-way arrays. Finally, they present a few more details on term weighting and visualization of their clustering. In the first half of chapter 6, "Text Clustering with Mixture of von Mises-Fisher Distributions," the authors provide the necessary mathematical background, including high-dimensional text datasets. The second half of the chapter presents the experiments and discussions. The authors use multiple sample datasets-from simulated, to the public domain, to testing with four clustering algorithms. They conclude that, while the results are encouraging, more evaluation is needed before conclusions can be drawn. In chapter 7, "Constrained Partitional Clustering of Text Data: An Overview," the authors discuss constraint-based and distance-based clustering approaches, using variants of k -means clustering and constrained vector quantization error techniques. The authors use three different datasets to demonstrate the algorithms' performance characteristics, and confirm the use of the cosine distance function for clustering. Chapter 8 presents adaptive information filtering in text data. More specifically, filtering techniques may make it possible to identify the relevancy of information as it arrives. The author provides a clear description of adaptive filtering, with respect to retrieval, collaborative filtering, and topic detection and tracking; Zhang also provides an overview of the models for each. Chapter 9 discusses a utility-based information distillation approach. The approach highlights the limitations of current solutions, such as adaptive filtering, and provides a system that utilizes certain parts of the existing approaches in order to address the requirements comprehensively. To validate the approach, the authors outline a set of evaluation methods, before providing their experimental results. They end the chapter by presenting the limitations and challenges that still need to be addressed. In chapter 10, "Text Search Enhanced with Types and Entities," the authors address the need to interpret queries appropriately, before performing the search, rather than just using the query terms as search tokens. They propose a scheme to address question-answering tasks. Their algorithm adds indexes to be used at the time the query is executed, to fit the additional information needed with normal tokens while indexing. The key requirement is to define types of questions or relations while indexing the document corpus. The authors conclude that future question-answering applications need to include entity and relationship in text indexes. In summary, the book provides several algorithms for text mining classification, clustering, and applications, including both mathematical background and experimental observations. For readers interested in specific areas, there are several useful references. Researchers can use this book to learn more about the text mining field. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Browse Books

Sections

Cited By

Index Terms

Text mining: classification & clustering of articles related to sports

Text association mining with cross-sentence inference, structure-based document model and multi-relational text mining

Mining Text Using Keyword Distributions

Reviews

Access critical reviews of Computing literature here

Save to Binder

Sections

Cited By

Save to Binder

Index Terms

Recommendations

Text mining: classification & clustering of articles related to sports

Text association mining with cross-sentence inference, structure-based document model and multi-relational text mining

Mining Text Using Keyword Distributions

Reviews

Access critical reviews of Computing literature here