skip to main content
10.1145/1007568.1007621acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Incremental and effective data summarization for dynamic hierarchical clustering

Published:13 June 2004Publication History

ABSTRACT

Mining informative patterns from very large, dynamically changing databases poses numerous interesting challenges. Data summarizations (e.g., data bubbles) have been proposed to compress very large static databases into representative points suitable for subsequent effective hierarchical cluster analysis. In many real world applications, however, the databases dynamically change due to frequent insertions and deletions, possibly changing the data distribution and clustering structure over time. Completely reapplying both the data summarization and the clustering algorithm to detect the changes in the clustering structure and update the uncovered data patterns following such deletions and insertions is prohibitively expensive for large fast changing databases. In this paper, we propose a new scheme to maintain data bubbles incrementally. By using incremental data bubbles, a high-quality hierarchical clustering is quickly available at any point in time. In our scheme, a quality measure for incremental data bubbles is used to identify data bubbles that do not compress well their underlying data points after certain insertions and deletions. Only these data bubbles are re-built using efficient split and merge operations. An extensive experimental evaluation shows that the incremental data bubbles provide significantly faster data summarization than completely re-building the data bubbles after a certain number of insertions and deletions, and are effective in preserving (and in some cases even improving) the quality of the data summarization.

References

  1. Aggarwal, C. C., Han, J., Wang, J., Yu. P. S. A Framework for Clustering Evolving Data Streams. In VLDB'03, 81--92, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ankerst, M., Breuing, M., Kriegel, H-P., Sander, J. OPTICS: Ordering Points to Identify the Clustering Structure. In SIGMOD'99, 49--60, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Braunmueller, M. E. B., and Kriegel, H-P. Efficiently Supporting Multiple Similarity Queries for Mining in Metric Database. In ICDE'00, 256--267, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Barbara, D. Requirements for Clustering Data Streams. SIGKDD Explorations, 3:23--27, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Breuing, M., Kriegel, H-P, Kroger, P., Sander, J. Data Bubbles: Quality Preserving Performance Boosting for Hierarchical Clustering. In SIGMOD'01, 79--90, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Charikar, M., Chekuri, C., Feder, T., Motwani, R. Incremental Clustering and Dynamic Information Retrieval. In 29th Symposium on Theory of Computing, 626--635, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chen, C., Hwang, S., Oyang, Y. An Incremental Hierarchical Data Clustering Algorithm Based on Gravity Theory. In 6th Pacific Asia Conference on Knowledge Discovery and Data Mining, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Elkan, C. Using the Triangle Inequality to Accelerate k-means. In ICML'03, 2003.Google ScholarGoogle Scholar
  9. Ester, M., Kriegel, H-P., Sander, J., Xu, X. A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD'96, 226--231, 1996.Google ScholarGoogle Scholar
  10. Ester, M., Kriegel, H-P., Sander, J. Wimmer, M., Xu, X. Incremental Clustering for Mining in a Data Warehousing Environment. VLDB'98, 323--333, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Filho, R. F. S., Traina, A. J. M., Triana Jr., C., Faloutsos, C. Similarity Search without Tears: The OMNI Family of All-purpose Access Methods. ICDE'01, 623--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ganti, V., Gehrke, J., Ramakrishnan, R. Mining Data Streams under Block Evolution. SIGKDD Explorations, 3:1--10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Larsen, B., Aone, C. Fast and Effective Text Mining Using Linear-time Document Clustering. In KDD'99, 16--22, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. MacQueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In 5th Berkeley Symp. Math. Statist. Prob., 281--297, 1967.Google ScholarGoogle Scholar
  15. O'Callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R. Streaming-data Algorithms for High-quality Clustering. ICDE'02, 685--704, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sander, J., Qin, X., Lu. Z., Niu, N. Kovarsky, A. Automated Extraction of Clusters from Hierarchical Clustering Representations. PAKDD'03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sibson, R. SLINK: An Optimally Efficient Algorithm for the Single-link Cluster Method. The Computer Journal, 16(1):30--34, 1973.Google ScholarGoogle Scholar
  18. Triola, M. F., Goodman, W. M., Law, R. Elementary Statistics First Canadian Edition, Addison-Wesley, Don Mills, Ontario, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Widyantoro, D. H., Ioerger, T. R., Yen, J. An Incremental Approach to Building a Cluster Hierarchy. ICDM'02, 705--708, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Zhang, T., Ramakrishnan, R., Linvy, M. BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD'96, 103--114, 1996 Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Incremental and effective data summarization for dynamic hierarchical clustering

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data
          June 2004
          988 pages
          ISBN:1581138598
          DOI:10.1145/1007568

          Copyright © 2004 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 June 2004

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader