ABSTRACT
DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of Dbscan is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency.
We present a new parallel Dbscan algorithm (Pdsdbscan) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of Dbscan. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory.
Using data sets containing up to several hundred million high-dimensional points, we show that Pdsdbscan significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.
- U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery in databases," AI magazine, vol. 17, no. 3, p. 37, 1996.Google Scholar
- J. MacQueen et al., "Some methods for classification and analysis of multivariate observations," in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1. USA, 1967, pp. 281--297.Google Scholar
- H. Park and C. Jun, "A simple and fast algorithm for K-medoids clustering," Expert Systems with Applications, vol. 36, no. 2, pp. 3336--3341, 2009. Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: an efficient data clustering method for very large databases," in ACM SIGMOD Record, vol. 25(2). ACM, 1996, pp. 103--114. Google ScholarDigital Library
- M. Ester, H. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise," in Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, vol. 1996. AAAI Press, 1996, pp. 226--231.Google Scholar
- W. Wang, J. Yang, and R. Muntz, "STING: A statistical information grid approach to spatial data mining," in Proceedings of the International Conference on Very Large Data Bases. IEEE, 1997, pp. 186--195. Google ScholarDigital Library
- G. Sheikholeslami, S. Chatterjee, and A. Zhang, "WaveCluster: a wavelet-based clustering approach for spatial data in very large databases," The VLDB Journal, vol. 8, no. 3, pp. 289--304, 2000. Google ScholarDigital Library
- A. Mukhopadhyay and U. Maulik, "Unsupervised satellite image segmentation by combining SA based fuzzy clustering with support vector machine," in Proceedings of 7th ICAPR'09. IEEE, 2009, pp. 381--384. Google ScholarDigital Library
- D. Birant and A. Kut, "ST-DBSCAN: An algorithm for clustering spatial-temporal data," Data & Knowledge Engineering, vol. 60, no. 1, pp. 208--221, 2007. Google ScholarDigital Library
- M. Surdeanu, J. Turmo, and A. Ageno, "A hybrid unsupervised approach for document clustering," in Proceedings of the 11th ACM SIGKDD. ACM, 2005, pp. 685--690. Google ScholarDigital Library
- S. Madeira and A. Oliveira, "Biclustering algorithms for biological data analysis: a survey," Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. 1, no. 1, pp. 24--45, 2004. Google ScholarDigital Library
- J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques. Morgan Kaufmann, 2011. Google ScholarDigital Library
- H. Kargupta and J. Han, Next generation of data mining. Chapman & Hall/CRC, 2009, vol. 7. Google ScholarDigital Library
- S. Brecheisen, H. Kriegel, and M. Pfeifle, "Parallel density-based clustering of complex objects," Advances in Knowledge Discovery and Data Mining, pp. 179--188, 2006. Google ScholarDigital Library
- D. Arlia and M. Coppola, "Experiments in parallel clustering with DBSCAN," in Euro-Par 2001 Parallel Processing. Springer, LNCS, 2001, pp. 326--331. Google ScholarDigital Library
- M. Chen, X. Gao, and H. Li, "Parallel DBSCAN with priority r-tree," in Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference on. IEEE, 2010, pp. 508--511.Google Scholar
- M. Coppola and M. Vanneschi, "High-performance data mining with skeleton-based structured parallel programming," Parallel Computing, vol. 28, no. 5, pp. 793--813, 2002. Google ScholarDigital Library
- Y. Fu, W. Zhao, and H. Ma, "Research on parallel DBSCAN algorithm design based on mapreduce," Advanced Materials Research, vol. 301, pp. 1133--1138, 2011.Google ScholarCross Ref
- X. Xu, J. Jäger, and H. Kriegel, "A fast parallel clustering algorithm for large spatial databases," High Performance Data Mining, pp. 263--290, 2002. Google ScholarDigital Library
- A. Zhou, S. Zhou, J. Cao, Y. Fan, and Y. Hu, "Approaches for scaling DBSCAN algorithm to large spatial databases," Journal of computer science and technology, vol. 15, no. 6, pp. 509--526, 2000. Google ScholarDigital Library
- T. Cormen, Introduction to algorithms. The MIT press, 2001. Google ScholarDigital Library
- M. Patwary, J. Blair, and F. Manne, "Experiments on union-find algorithms for the disjoint-set data structure," in Proceedings of the 9th International Symposium on Experimental Algorithms (SEA 2010). Springer, LNCS 6049, 2010, pp. 411--423. Google ScholarDigital Library
- M. M. A. Patwary, P. Refsnes, and F. Manne, "Multi-core spanning forest algorithms using the disjoint-set data structure," in Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2012). IEEE, 2012, to appear. Google ScholarDigital Library
- M. B. Kennel, "KDTREE 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space," 2004, institute for Nonlinear Science, University of California.Google Scholar
- N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger, "The r*-tree: an efficient and robust access method for points and rectangles," Proceedings of the 1990 ACM SIGMOD international conference on Management of data, vol. 19, no. 2, pp. 322--331, 1990. Google ScholarDigital Library
- J. Bentley, "Multidimensional binary search trees used for associative searching," Communications of the ACM, vol. 18, no. 9, pp. 509--517, 1975. Google ScholarDigital Library
- R. Tarjan, "A class of algorithms which require nonlinear time to maintain disjoint sets," Journal of computer and system sciences, vol. 18, no. 2, pp. 110--127, 1979.Google ScholarCross Ref
- B. Galler and M. Fisher, "An improved equivalence algorithm," Communications of the ACM, vol. 7, pp. 301--303, 1964. Google ScholarDigital Library
- M. Patwary, A. Gebremedhin, and A. Pothen, "New multithreaded ordering and coloring algorithms for multicore architectures," in Proceedings of 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011). Springer, LNCS 6853, 2011, pp. 250--262. Google ScholarDigital Library
- F. Manne and M. Patwary, "A scalable parallel union-find algorithm for distributed memory computers," in Parallel Processing and Applied Mathematics. Springer, LNCS, 2010, pp. 186--195. Google ScholarDigital Library
- "CLUTO - clustering high-dimensional datasets," 2006, http://glaros.dtc.umn.edu/gkhome/cluto/cluto/.Google Scholar
- "Parallel K-means data clustering," 2005, http://users.eecs.northwestern.edu/wkliao/Kmeans/.Google Scholar
- R. Agrawal and R. Srikant, "Quest synthetic data generator," IBM Almaden Research Center, 1994.Google Scholar
- J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi, "NU-MineBench 3.0," Technical Report CUCIS-2005-08-01, Northwestern University, Tech. Rep., 2010.Google Scholar
- G. Lemson and the Virgo Consortium, "Halo and galaxy formation histories from the millennium simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony," Arxiv preprint astro-ph/0608019, 2006.Google Scholar
- V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly et al., "Simulations of the formation, evolution and clustering of galaxies and quasars," Nature, vol. 435, no. 7042, pp. 629--636, 2005.Google ScholarCross Ref
- S. Bertone, G. De Lucia, and P. Thomas, "The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model," Monthly Notices of the Royal Astronomical Society, vol. 379, no. 3, pp. 1143--1154, 2007.Google ScholarCross Ref
- G. De Lucia and J. Blaizot, "The hierarchical formation of the brightest cluster galaxies," Monthly Notices of the Royal Astronomical Society, vol. 375, no. 1, pp. 2--14, 2007.Google ScholarCross Ref
- R. Bower, A. Benson, R. Malbon, J. Helly, C. Frenk, C. Baugh, S. Cole, and C. Lacey, "Breaking the hierarchy of galaxy formation," Monthly Notices of the Royal Astronomical Society, vol. 370, no. 2, pp. 645--655, 2006.Google ScholarCross Ref
- Y. Liu, W.-k. Liao, and A. Choudhary, "Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation," in Proceedings of the 17th International Symposium on Parallel and Distributed Processing (IPDPS 2003). Washington, DC, USA: IEEE, 2003, p. 82.1. Google ScholarDigital Library
- A new scalable parallel DBSCAN algorithm using the disjoint-set data structure
Recommendations
Scalable parallel OPTICS data clustering using graph algorithmic techniques
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisOPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a ...
A new scalable parallel DBSCAN algorithm using the disjoint-set data structure
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and AnalysisDBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of Dbscan is challenging as it exhibits an inherent sequential data access order. Moreover, ...
Pardicle: parallel approximate density-based clustering
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisDbscan is a widely used isodensity-based clustering algorithm for particle data well-known for its ability to isolate arbitrarily-shaped clusters and to filter noise data. The algorithm is super-linear (O(nlogn)) and computationally expensive for large ...
Comments