ABSTRACT
We compare various document clustering techniques including K-means, SVD-based method and a graph-based approach and their performance on short text data collected from Twitter. We define a measure for evaluating the cluster error with these techniques. Observations show that graph-based approach using affinity propagation performs best in clustering short text data with minimal cluster error.
- Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta, Clustering short texts using wikipedia, SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA), ACM, 2007, pp. 787--788. Google ScholarDigital Library
- Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman, Indexing by latent semantic analysis, Journal of the American Society for Information Science 41 (1990), 391--407.Google ScholarCross Ref
- Brendan J. Frey and Delbert Dueck, Clustering by passing messages between data points, Science 315 (2007), 972--976.Google ScholarCross Ref
- Jeon hyung Kang, Kristina Lerman, and Plangprasopchok Anon, Analyzing microblogs with affinity propagation, Proceedings of KDD workshop on Social Media Analytic, July 2010. Google ScholarDigital Library
- Brendan O'Connor, Michel Krieger, and David Ahn, Tweetmotif: Exploratory search and topic summarization for twitter, ICWSM, 2010.Google Scholar
- Nordianah Ab Samat, Masrah Azrifah Azmi Murad, Muhamad Taufik Abdullah, and Rodziah Atan, Malay documents clustering algorithm based on singular value decomposition.Google Scholar
- M. Steinbach, G. Karypis, and V. Kumar, A comparison of document clustering techniques, Technical Report 00-034, University of Minnesota, 2000.Google Scholar
Index Terms
- Comparative study of clustering techniques for short text documents
Recommendations
Initializing K-means Clustering Using Affinity Propagation
HIS '09: Proceedings of the 2009 Ninth International Conference on Hybrid Intelligent Systems - Volume 01K-means clustering is widely used due to its fast convergence, but it is sensitive to the initial condition.Therefore, many methods of initializing K-means clustering have been proposed in the literatures. Compared with Kmeans clustering, a novel ...
Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering
AbstractData distribution has a significant impact on clustering results. This study focuses on the effect of cluster size distribution on clustering, namely the uniform effect of k-means and fuzzy c-means (FCM) clustering. We first provide some related ...
Ant clustering algorithm with K-harmonic means clustering
Clustering is an unsupervised learning procedure and there is no a prior knowledge of data distribution. It organizes a set of objects/data into similar groups called clusters, and the objects within one cluster are highly similar and dissimilar with ...
Comments