ABSTRACT
Mining informative patterns from very large, dynamically changing databases poses numerous interesting challenges. Data summarizations (e.g., data bubbles) have been proposed to compress very large static databases into representative points suitable for subsequent effective hierarchical cluster analysis. In many real world applications, however, the databases dynamically change due to frequent insertions and deletions, possibly changing the data distribution and clustering structure over time. Completely reapplying both the data summarization and the clustering algorithm to detect the changes in the clustering structure and update the uncovered data patterns following such deletions and insertions is prohibitively expensive for large fast changing databases. In this paper, we propose a new scheme to maintain data bubbles incrementally. By using incremental data bubbles, a high-quality hierarchical clustering is quickly available at any point in time. In our scheme, a quality measure for incremental data bubbles is used to identify data bubbles that do not compress well their underlying data points after certain insertions and deletions. Only these data bubbles are re-built using efficient split and merge operations. An extensive experimental evaluation shows that the incremental data bubbles provide significantly faster data summarization than completely re-building the data bubbles after a certain number of insertions and deletions, and are effective in preserving (and in some cases even improving) the quality of the data summarization.
- Aggarwal, C. C., Han, J., Wang, J., Yu. P. S. A Framework for Clustering Evolving Data Streams. In VLDB'03, 81--92, 2003. Google ScholarDigital Library
- Ankerst, M., Breuing, M., Kriegel, H-P., Sander, J. OPTICS: Ordering Points to Identify the Clustering Structure. In SIGMOD'99, 49--60, 1999. Google ScholarDigital Library
- Braunmueller, M. E. B., and Kriegel, H-P. Efficiently Supporting Multiple Similarity Queries for Mining in Metric Database. In ICDE'00, 256--267, 2000. Google ScholarDigital Library
- Barbara, D. Requirements for Clustering Data Streams. SIGKDD Explorations, 3:23--27, 2002. Google ScholarDigital Library
- Breuing, M., Kriegel, H-P, Kroger, P., Sander, J. Data Bubbles: Quality Preserving Performance Boosting for Hierarchical Clustering. In SIGMOD'01, 79--90, 2001. Google ScholarDigital Library
- Charikar, M., Chekuri, C., Feder, T., Motwani, R. Incremental Clustering and Dynamic Information Retrieval. In 29th Symposium on Theory of Computing, 626--635, 1997. Google ScholarDigital Library
- Chen, C., Hwang, S., Oyang, Y. An Incremental Hierarchical Data Clustering Algorithm Based on Gravity Theory. In 6th Pacific Asia Conference on Knowledge Discovery and Data Mining, 2002. Google ScholarDigital Library
- Elkan, C. Using the Triangle Inequality to Accelerate k-means. In ICML'03, 2003.Google Scholar
- Ester, M., Kriegel, H-P., Sander, J., Xu, X. A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD'96, 226--231, 1996.Google Scholar
- Ester, M., Kriegel, H-P., Sander, J. Wimmer, M., Xu, X. Incremental Clustering for Mining in a Data Warehousing Environment. VLDB'98, 323--333, 1998. Google ScholarDigital Library
- Filho, R. F. S., Traina, A. J. M., Triana Jr., C., Faloutsos, C. Similarity Search without Tears: The OMNI Family of All-purpose Access Methods. ICDE'01, 623--630. Google ScholarDigital Library
- Ganti, V., Gehrke, J., Ramakrishnan, R. Mining Data Streams under Block Evolution. SIGKDD Explorations, 3:1--10, 2002. Google ScholarDigital Library
- Larsen, B., Aone, C. Fast and Effective Text Mining Using Linear-time Document Clustering. In KDD'99, 16--22, 1999. Google ScholarDigital Library
- MacQueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In 5th Berkeley Symp. Math. Statist. Prob., 281--297, 1967.Google Scholar
- O'Callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R. Streaming-data Algorithms for High-quality Clustering. ICDE'02, 685--704, 2002. Google ScholarDigital Library
- Sander, J., Qin, X., Lu. Z., Niu, N. Kovarsky, A. Automated Extraction of Clusters from Hierarchical Clustering Representations. PAKDD'03. Google ScholarDigital Library
- Sibson, R. SLINK: An Optimally Efficient Algorithm for the Single-link Cluster Method. The Computer Journal, 16(1):30--34, 1973.Google Scholar
- Triola, M. F., Goodman, W. M., Law, R. Elementary Statistics First Canadian Edition, Addison-Wesley, Don Mills, Ontario, 1999. Google ScholarDigital Library
- Widyantoro, D. H., Ioerger, T. R., Yen, J. An Incremental Approach to Building a Cluster Hierarchy. ICDM'02, 705--708, 2002. Google ScholarDigital Library
- Zhang, T., Ramakrishnan, R., Linvy, M. BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD'96, 103--114, 1996 Google ScholarDigital Library
- Incremental and effective data summarization for dynamic hierarchical clustering
Recommendations
Data Summarization with Social Contexts
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementWhile social data is being widely used in various applications such as sentiment analysis and trend prediction, its sheer size also presents great challenges for storing, sharing and processing such data. These challenges can be addressed by data ...
Document update summarization using incremental hierarchical clustering
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementDocument summarization has become a hot topic in recent years. However, most of existing summarization methods work on a batch of documents and do not consider that documents may arrive in a sequence and the corresponding summaries need to be updated in ...
Effective data summarization for hierarchical clustering in large datasets
Cluster analysis in a large dataset is an interesting challenge in many fields of Science and Engineering. One important clustering approach is hierarchical clustering, which outputs hierarchical (nested) structures of a given dataset. The single-link ...
Comments