skip to main content
Skip header Section
Text Mining: Classification, Clustering, and ApplicationsJune 2009
Publisher:
  • Chapman & Hall/CRC
ISBN:978-1-4200-5940-3
Published:15 June 2009
Pages:
328
Skip Bibliometrics Section
Bibliometrics
Skip Abstract Section
Abstract

Giving a broad perspective of the field from numerous vantage points, Text Mining focuses on statistical methods for text mining and analysis. It examines methods to automatically cluster and classify text documents and applies these methods in a variety of areas, including adaptive information filtering, information distillation, and text search. The book begins with the classification of documents into predefined categories and then describes novel methods for clustering documents into groups that are not predefined. It concludes with various text mining applications that have significant implications for future research and industrial use.

Cited By

  1. ACM
    Rizun N, Revina A and Edelmann N Application of Text Analytics in Public Service Co-Creation: Literature Review and Research Framework Proceedings of the 24th Annual International Conference on Digital Government Research, (12-22)
  2. ACM
    Li B, Ghawi R and Pfeffer J What we Talk about when we Talk about Earth on Earth Day? The 23rd International Conference on Information Integration and Web Intelligence, (333-339)
  3. ACM
    Alsanad A Arabic Topic Detection Using Discriminative Multi nominal Naïve Bayes and Frequency Transforms Proceedings of the 2018 International Conference on Signal Processing and Machine Learning, (17-21)
  4. ACM
    Costa G and Ortale R (2018). Mining Overlapping Communities and Inner Role Assignments through Bayesian Mixed-Membership Models of Networks with Context-Dependent Interactions, ACM Transactions on Knowledge Discovery from Data, 12:2, (1-32), Online publication date: 13-Mar-2018.
  5. Tian L, Luo P, Wang H, Zheng H and Wu F (2018). CASNMF, Neurocomputing, 275:C, (2031-2040), Online publication date: 31-Jan-2018.
  6. Wang J (2016). Extracting significant pattern histories from timestamped texts using MapReduce, The Journal of Supercomputing, 72:8, (3236-3260), Online publication date: 1-Aug-2016.
  7. Kang Y and Zadorozhny V (2016). Process monitoring using maximum sequence divergence, Knowledge and Information Systems, 48:1, (81-109), Online publication date: 1-Jul-2016.
  8. ACM
    Imran M, Castillo C, Diaz F and Vieweg S (2015). Processing Social Media Messages in Mass Emergency, ACM Computing Surveys, 47:4, (1-38), Online publication date: 21-Jul-2015.
  9. Liu D, Omar H, Liou C, Chi H and Hsu C (2015). Recommending blog articles based on popular event trend analysis, Information Sciences: an International Journal, 305:C, (302-319), Online publication date: 1-Jun-2015.
  10. ACM
    Nath S MAdScope Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services, (59-73)
  11. ACM
    Southavilay V, Yacef K, Reimann P and Calvo R Analysis of collaborative writing processes using revision maps and probabilistic topic models Proceedings of the Third International Conference on Learning Analytics and Knowledge, (38-47)
  12. ACM
    Wang Q, Xu J, Li H and Craswell N (2013). Regularized Latent Semantic Indexing, ACM Transactions on Information Systems, 31:1, (1-44), Online publication date: 1-Jan-2013.
  13. Chi M, Liu J, He H, Bao J and Zhu Y Construction of Chinese A-shares Network Using Latent Dirichlet Allocation Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01, (45-49)
  14. ACM
    Ngo-Ye T and Sinha A (2012). Analyzing Online Review Helpfulness Using a Regressional ReliefF-Enhanced Text Mining Method, ACM Transactions on Management Information Systems, 3:2, (1-20), Online publication date: 1-Jul-2012.
  15. ACM
    Wu W, Li H, Wang H and Zhu K Probase Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, (481-492)
  16. ACM
    Zhang C and Sun J Large scale microblog mining using distributed MB-LDA Proceedings of the 21st International Conference on World Wide Web, (1035-1042)
  17. Liu Z, Chen X, Zheng Y and Sun M Automatic keyphrase extraction by bridging vocabulary gap Proceedings of the Fifteenth Conference on Computational Natural Language Learning, (135-144)
  18. Selamat A and Ahmadi-Abkenari F Architecture for a parallel focused crawler for clickstream analysis Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I, (27-35)
  19. Xiaojun Z (2011). Michael W. Berry and Jacob Kogan (eds.): Text mining: applications and theory, Information Retrieval, 14:2, (208-211), Online publication date: 1-Apr-2011.
  20. ACM
    Jo Y and Oh A Aspect and sentiment unification model for online review analysis Proceedings of the fourth ACM international conference on Web search and data mining, (815-824)
  21. ACM
    Balinsky A, Balinsky H and Simske S On helmholtz's principle for documents processing Proceedings of the 10th ACM symposium on Document engineering, (283-286)
  22. Žižka J and Dařena F Automatic sentiment analysis using the textual pattern content similarity in natural language Proceedings of the 13th international conference on Text, speech and dialogue, (224-231)
Contributors
  • Stanford University

Recommendations

Reviews

Sithu D. Sudarsan

Data mining now includes unstructured data mining capabilities, and the mining of unstructured text is known as text mining; since over 80 percent of existing data is unstructured, researchers in this area are working overtime. The book provides a very good overview of some state-of-the-art capabilities. It has ten chapters, each contributed by a group of researchers, primarily from academia. As part of the introduction, the editors provide a very quick overview of the progress made in text mining, which also serves to outline each chapter's topic. In chapter 1, "Analysis of Text Patterns Using Kernel Methods," the authors prove that the computational complexity of pattern analysis remains a polynomial-irrespective feature space dimensionality. They provide an example for the kernel function evaluation, with a language dataset that contains as many as 42 languages. Chapter 2, "Detection of Bias in Media Outlets with Statistical Learning Methods," is a case study that uses the contents of four online media outlets: Cable News Network (CNN), Al Jazeera, the International Herald Tribune , and The Detroit News . Most of the paper deals with outlet identification when given a news item, using a choice of terms and stories. For their experiments, the authors use support vector machines (SVMs), kernel canonical correlation analysis, and multidimensional scaling. While the presentation is clear and easy to follow, the articles they compare are essentially related to the Middle East; the outlets they compare present the stories from different views, making the identification straightforward. As a result, it is unclear how their methods would perform in less-distinct cases. Chapter 3, "Collective Classification for Text Classification," is another case study. In their experiments, the authors use two bibliographic datasets-Cora and CiteSeer-and a hypertext dataset-WebKB. Essentially, they try to identify or infer missing labels or metadata, by positively identifying links between documents. Their approximate inference approaches are based on both local conditional classifiers and global formulations. They conclude with a discussion of the performance of the classifiers, based on the test results. In chapter 4, the authors identify topics based on a document collection. They use JSTOR to demonstrate the effectiveness of latent Dirichlet allocation (LDA), and describe dynamic topic models and correlated topic models. In chapter 5, the authors use Enron email sets to track discussions in email communications. The study uses three- and four-way nonnegative matrix and tensor factorization to reveal discussions-something that could not have been accomplished using two-way tensors. The authors take parallel factor analysis (PARAFAC) for three-way arrays and extend it to four-way arrays. Finally, they present a few more details on term weighting and visualization of their clustering. In the first half of chapter 6, "Text Clustering with Mixture of von Mises-Fisher Distributions," the authors provide the necessary mathematical background, including high-dimensional text datasets. The second half of the chapter presents the experiments and discussions. The authors use multiple sample datasets-from simulated, to the public domain, to testing with four clustering algorithms. They conclude that, while the results are encouraging, more evaluation is needed before conclusions can be drawn. In chapter 7, "Constrained Partitional Clustering of Text Data: An Overview," the authors discuss constraint-based and distance-based clustering approaches, using variants of k -means clustering and constrained vector quantization error techniques. The authors use three different datasets to demonstrate the algorithms' performance characteristics, and confirm the use of the cosine distance function for clustering. Chapter 8 presents adaptive information filtering in text data. More specifically, filtering techniques may make it possible to identify the relevancy of information as it arrives. The author provides a clear description of adaptive filtering, with respect to retrieval, collaborative filtering, and topic detection and tracking; Zhang also provides an overview of the models for each. Chapter 9 discusses a utility-based information distillation approach. The approach highlights the limitations of current solutions, such as adaptive filtering, and provides a system that utilizes certain parts of the existing approaches in order to address the requirements comprehensively. To validate the approach, the authors outline a set of evaluation methods, before providing their experimental results. They end the chapter by presenting the limitations and challenges that still need to be addressed. In chapter 10, "Text Search Enhanced with Types and Entities," the authors address the need to interpret queries appropriately, before performing the search, rather than just using the query terms as search tokens. They propose a scheme to address question-answering tasks. Their algorithm adds indexes to be used at the time the query is executed, to fit the additional information needed with normal tokens while indexing. The key requirement is to define types of questions or relations while indexing the document corpus. The authors conclude that future question-answering applications need to include entity and relationship in text indexes. In summary, the book provides several algorithms for text mining classification, clustering, and applications, including both mathematical background and experimental observations. For readers interested in specific areas, there are several useful references. Researchers can use this book to learn more about the text mining field. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.