Article

Incremental and effective data summarization for dynamic hierarchical clustering

Authors:
Samer Nassar

University of Alberta, Edmonton, Alberta, Canada

University of Alberta, Edmonton, Alberta, Canada
View Profile

,
Jörg Sander

University of Alberta, Edmonton, Alberta, Canada

University of Alberta, Edmonton, Alberta, Canada
View Profile

,
Corrine Cheng

University of Alberta, Edmonton, Alberta, Canada

University of Alberta, Edmonton, Alberta, Canada
View Profile

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of dataJune 2004Pages 467–478https://doi.org/10.1145/1007568.1007621

Published:13 June 2004Publication History

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Pages 467–478

ABSTRACT

Mining informative patterns from very large, dynamically changing databases poses numerous interesting challenges. Data summarizations (e.g., data bubbles) have been proposed to compress very large static databases into representative points suitable for subsequent effective hierarchical cluster analysis. In many real world applications, however, the databases dynamically change due to frequent insertions and deletions, possibly changing the data distribution and clustering structure over time. Completely reapplying both the data summarization and the clustering algorithm to detect the changes in the clustering structure and update the uncovered data patterns following such deletions and insertions is prohibitively expensive for large fast changing databases. In this paper, we propose a new scheme to maintain data bubbles incrementally. By using incremental data bubbles, a high-quality hierarchical clustering is quickly available at any point in time. In our scheme, a quality measure for incremental data bubbles is used to identify data bubbles that do not compress well their underlying data points after certain insertions and deletions. Only these data bubbles are re-built using efficient split and merge operations. An extensive experimental evaluation shows that the incremental data bubbles provide significantly faster data summarization than completely re-building the data bubbles after a certain number of insertions and deletions, and are effective in preserving (and in some cases even improving) the quality of the data summarization.

References

Aggarwal, C. C., Han, J., Wang, J., Yu. P. S. A Framework for Clustering Evolving Data Streams. In VLDB'03, 81--92, 2003. Google ScholarDigital Library
Ankerst, M., Breuing, M., Kriegel, H-P., Sander, J. OPTICS: Ordering Points to Identify the Clustering Structure. In SIGMOD'99, 49--60, 1999. Google ScholarDigital Library
Braunmueller, M. E. B., and Kriegel, H-P. Efficiently Supporting Multiple Similarity Queries for Mining in Metric Database. In ICDE'00, 256--267, 2000. Google ScholarDigital Library
Barbara, D. Requirements for Clustering Data Streams. SIGKDD Explorations, 3:23--27, 2002. Google ScholarDigital Library
Breuing, M., Kriegel, H-P, Kroger, P., Sander, J. Data Bubbles: Quality Preserving Performance Boosting for Hierarchical Clustering. In SIGMOD'01, 79--90, 2001. Google ScholarDigital Library
Charikar, M., Chekuri, C., Feder, T., Motwani, R. Incremental Clustering and Dynamic Information Retrieval. In 29th Symposium on Theory of Computing, 626--635, 1997. Google ScholarDigital Library
Chen, C., Hwang, S., Oyang, Y. An Incremental Hierarchical Data Clustering Algorithm Based on Gravity Theory. In 6th Pacific Asia Conference on Knowledge Discovery and Data Mining, 2002. Google ScholarDigital Library
Elkan, C. Using the Triangle Inequality to Accelerate k-means. In ICML'03, 2003.Google Scholar
Ester, M., Kriegel, H-P., Sander, J., Xu, X. A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD'96, 226--231, 1996.Google Scholar
Ester, M., Kriegel, H-P., Sander, J. Wimmer, M., Xu, X. Incremental Clustering for Mining in a Data Warehousing Environment. VLDB'98, 323--333, 1998. Google ScholarDigital Library
Filho, R. F. S., Traina, A. J. M., Triana Jr., C., Faloutsos, C. Similarity Search without Tears: The OMNI Family of All-purpose Access Methods. ICDE'01, 623--630. Google ScholarDigital Library
Ganti, V., Gehrke, J., Ramakrishnan, R. Mining Data Streams under Block Evolution. SIGKDD Explorations, 3:1--10, 2002. Google ScholarDigital Library
Larsen, B., Aone, C. Fast and Effective Text Mining Using Linear-time Document Clustering. In KDD'99, 16--22, 1999. Google ScholarDigital Library
MacQueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In 5th Berkeley Symp. Math. Statist. Prob., 281--297, 1967.Google Scholar
O'Callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R. Streaming-data Algorithms for High-quality Clustering. ICDE'02, 685--704, 2002. Google ScholarDigital Library
Sander, J., Qin, X., Lu. Z., Niu, N. Kovarsky, A. Automated Extraction of Clusters from Hierarchical Clustering Representations. PAKDD'03. Google ScholarDigital Library
Sibson, R. SLINK: An Optimally Efficient Algorithm for the Single-link Cluster Method. The Computer Journal, 16(1):30--34, 1973.Google Scholar
Triola, M. F., Goodman, W. M., Law, R. Elementary Statistics First Canadian Edition, Addison-Wesley, Don Mills, Ontario, 1999. Google ScholarDigital Library
Widyantoro, D. H., Ioerger, T. R., Yen, J. An Incremental Approach to Building a Cluster Hierarchy. ICDM'02, 705--708, 2002. Google ScholarDigital Library
Zhang, T., Ramakrishnan, R., Linvy, M. BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD'96, 103--114, 1996 Google ScholarDigital Library

Incremental and effective data summarization for dynamic hierarchical clustering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
  2. Information systems applications
    1. Data mining

Recommendations

Data Summarization with Social Contexts
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

While social data is being widely used in various applications such as sentiment analysis and trend prediction, its sheer size also presents great challenges for storing, sharing and processing such data. These challenges can be addressed by data ...
Read More
Document update summarization using incremental hierarchical clustering
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Document summarization has become a hot topic in recent years. However, most of existing summarization methods work on a batch of documents and do not consider that documents may arrive in a sequence and the corresponding summaries need to be updated in ...
Read More
Effective data summarization for hierarchical clustering in large datasets

Cluster analysis in a large dataset is an interesting challenge in many fields of Science and Engineering. One important clustering approach is hierarchical clustering, which outputs hierarchical (nested) structures of a given dataset. The single-link ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data
June 2004
988 pages
ISBN:1581138598
DOI:10.1145/1007568
Conference Chairs:
Arnd Christian König
Microsoft Research
,
Stefan Dessloch
University of Kaiserslautern, Germany
,
General Chair:
Patrick Valduriez
INRIA, France
,
Program Chair:
Gerhard Weikum
University of the Saarland
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering
data summarization
incremental data bubbles
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 1,963
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Incremental and effective data summarization for dynamic hierarchical clustering

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Recommendations

Data Summarization with Social Contexts

Document update summarization using incremental hierarchical clustering

Effective data summarization for hierarchical clustering in large datasets