skip to main content
10.5555/2388996.2389081acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

Published:10 November 2012Publication History

ABSTRACT

DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of Dbscan is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency.

We present a new parallel Dbscan algorithm (Pdsdbscan) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of Dbscan. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory.

Using data sets containing up to several hundred million high-dimensional points, we show that Pdsdbscan significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.

References

  1. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery in databases," AI magazine, vol. 17, no. 3, p. 37, 1996.Google ScholarGoogle Scholar
  2. J. MacQueen et al., "Some methods for classification and analysis of multivariate observations," in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1. USA, 1967, pp. 281--297.Google ScholarGoogle Scholar
  3. H. Park and C. Jun, "A simple and fast algorithm for K-medoids clustering," Expert Systems with Applications, vol. 36, no. 2, pp. 3336--3341, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: an efficient data clustering method for very large databases," in ACM SIGMOD Record, vol. 25(2). ACM, 1996, pp. 103--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Ester, H. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise," in Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, vol. 1996. AAAI Press, 1996, pp. 226--231.Google ScholarGoogle Scholar
  6. W. Wang, J. Yang, and R. Muntz, "STING: A statistical information grid approach to spatial data mining," in Proceedings of the International Conference on Very Large Data Bases. IEEE, 1997, pp. 186--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Sheikholeslami, S. Chatterjee, and A. Zhang, "WaveCluster: a wavelet-based clustering approach for spatial data in very large databases," The VLDB Journal, vol. 8, no. 3, pp. 289--304, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Mukhopadhyay and U. Maulik, "Unsupervised satellite image segmentation by combining SA based fuzzy clustering with support vector machine," in Proceedings of 7th ICAPR'09. IEEE, 2009, pp. 381--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Birant and A. Kut, "ST-DBSCAN: An algorithm for clustering spatial-temporal data," Data & Knowledge Engineering, vol. 60, no. 1, pp. 208--221, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Surdeanu, J. Turmo, and A. Ageno, "A hybrid unsupervised approach for document clustering," in Proceedings of the 11th ACM SIGKDD. ACM, 2005, pp. 685--690. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Madeira and A. Oliveira, "Biclustering algorithms for biological data analysis: a survey," Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. 1, no. 1, pp. 24--45, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques. Morgan Kaufmann, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Kargupta and J. Han, Next generation of data mining. Chapman & Hall/CRC, 2009, vol. 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Brecheisen, H. Kriegel, and M. Pfeifle, "Parallel density-based clustering of complex objects," Advances in Knowledge Discovery and Data Mining, pp. 179--188, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Arlia and M. Coppola, "Experiments in parallel clustering with DBSCAN," in Euro-Par 2001 Parallel Processing. Springer, LNCS, 2001, pp. 326--331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Chen, X. Gao, and H. Li, "Parallel DBSCAN with priority r-tree," in Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference on. IEEE, 2010, pp. 508--511.Google ScholarGoogle Scholar
  17. M. Coppola and M. Vanneschi, "High-performance data mining with skeleton-based structured parallel programming," Parallel Computing, vol. 28, no. 5, pp. 793--813, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Fu, W. Zhao, and H. Ma, "Research on parallel DBSCAN algorithm design based on mapreduce," Advanced Materials Research, vol. 301, pp. 1133--1138, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  19. X. Xu, J. Jäger, and H. Kriegel, "A fast parallel clustering algorithm for large spatial databases," High Performance Data Mining, pp. 263--290, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Zhou, S. Zhou, J. Cao, Y. Fan, and Y. Hu, "Approaches for scaling DBSCAN algorithm to large spatial databases," Journal of computer science and technology, vol. 15, no. 6, pp. 509--526, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Cormen, Introduction to algorithms. The MIT press, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Patwary, J. Blair, and F. Manne, "Experiments on union-find algorithms for the disjoint-set data structure," in Proceedings of the 9th International Symposium on Experimental Algorithms (SEA 2010). Springer, LNCS 6049, 2010, pp. 411--423. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. M. A. Patwary, P. Refsnes, and F. Manne, "Multi-core spanning forest algorithms using the disjoint-set data structure," in Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2012). IEEE, 2012, to appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. B. Kennel, "KDTREE 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space," 2004, institute for Nonlinear Science, University of California.Google ScholarGoogle Scholar
  25. N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger, "The r*-tree: an efficient and robust access method for points and rectangles," Proceedings of the 1990 ACM SIGMOD international conference on Management of data, vol. 19, no. 2, pp. 322--331, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Bentley, "Multidimensional binary search trees used for associative searching," Communications of the ACM, vol. 18, no. 9, pp. 509--517, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Tarjan, "A class of algorithms which require nonlinear time to maintain disjoint sets," Journal of computer and system sciences, vol. 18, no. 2, pp. 110--127, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  28. B. Galler and M. Fisher, "An improved equivalence algorithm," Communications of the ACM, vol. 7, pp. 301--303, 1964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Patwary, A. Gebremedhin, and A. Pothen, "New multithreaded ordering and coloring algorithms for multicore architectures," in Proceedings of 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011). Springer, LNCS 6853, 2011, pp. 250--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. Manne and M. Patwary, "A scalable parallel union-find algorithm for distributed memory computers," in Parallel Processing and Applied Mathematics. Springer, LNCS, 2010, pp. 186--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. "CLUTO - clustering high-dimensional datasets," 2006, http://glaros.dtc.umn.edu/gkhome/cluto/cluto/.Google ScholarGoogle Scholar
  32. "Parallel K-means data clustering," 2005, http://users.eecs.northwestern.edu/wkliao/Kmeans/.Google ScholarGoogle Scholar
  33. R. Agrawal and R. Srikant, "Quest synthetic data generator," IBM Almaden Research Center, 1994.Google ScholarGoogle Scholar
  34. J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi, "NU-MineBench 3.0," Technical Report CUCIS-2005-08-01, Northwestern University, Tech. Rep., 2010.Google ScholarGoogle Scholar
  35. G. Lemson and the Virgo Consortium, "Halo and galaxy formation histories from the millennium simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony," Arxiv preprint astro-ph/0608019, 2006.Google ScholarGoogle Scholar
  36. V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly et al., "Simulations of the formation, evolution and clustering of galaxies and quasars," Nature, vol. 435, no. 7042, pp. 629--636, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  37. S. Bertone, G. De Lucia, and P. Thomas, "The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model," Monthly Notices of the Royal Astronomical Society, vol. 379, no. 3, pp. 1143--1154, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  38. G. De Lucia and J. Blaizot, "The hierarchical formation of the brightest cluster galaxies," Monthly Notices of the Royal Astronomical Society, vol. 375, no. 1, pp. 2--14, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  39. R. Bower, A. Benson, R. Malbon, J. Helly, C. Frenk, C. Baugh, S. Cole, and C. Lacey, "Breaking the hierarchy of galaxy formation," Monthly Notices of the Royal Astronomical Society, vol. 370, no. 2, pp. 645--655, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  40. Y. Liu, W.-k. Liao, and A. Choudhary, "Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation," in Proceedings of the 17th International Symposium on Parallel and Distributed Processing (IPDPS 2003). Washington, DC, USA: IEEE, 2003, p. 82.1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
          November 2012
          1161 pages
          ISBN:9781467308045

          Publisher

          IEEE Computer Society Press

          Washington, DC, United States

          Publication History

          • Published: 10 November 2012

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SC '12 Paper Acceptance Rate100of461submissions,22%Overall Acceptance Rate1,516of6,373submissions,24%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader