research-article

A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

Authors:
Mostofa Ali Patwary

Northwestern University, Evanston, IL

Northwestern University, Evanston, IL
View Profile

,
Diana Palsetia

Northwestern University, Evanston, IL

Northwestern University, Evanston, IL
View Profile

,
Ankit Agrawal

Northwestern University, Evanston, IL

Northwestern University, Evanston, IL
View Profile

,
Wei-keng Liao

Northwestern University, Evanston, IL

Northwestern University, Evanston, IL
View Profile

,
Fredrik Manne

University of Bergen, Norway

University of Bergen, Norway
View Profile

,
Alok Choudhary

Northwestern University, Evanston, IL

Northwestern University, Evanston, IL
View Profile

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisNovember 2012Article No.: 62Pages 1–11

Published:10 November 2012Publication History

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Pages 1–11

ABSTRACT

DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of Dbscan is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency.

We present a new parallel Dbscan algorithm (Pdsdbscan) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of Dbscan. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory.

Using data sets containing up to several hundred million high-dimensional points, we show that Pdsdbscan significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.

References

U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery in databases," AI magazine, vol. 17, no. 3, p. 37, 1996.Google Scholar
J. MacQueen et al., "Some methods for classification and analysis of multivariate observations," in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1. USA, 1967, pp. 281--297.Google Scholar
H. Park and C. Jun, "A simple and fast algorithm for K-medoids clustering," Expert Systems with Applications, vol. 36, no. 2, pp. 3336--3341, 2009. Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: an efficient data clustering method for very large databases," in ACM SIGMOD Record, vol. 25(2). ACM, 1996, pp. 103--114. Google ScholarDigital Library
M. Ester, H. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise," in Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, vol. 1996. AAAI Press, 1996, pp. 226--231.Google Scholar
W. Wang, J. Yang, and R. Muntz, "STING: A statistical information grid approach to spatial data mining," in Proceedings of the International Conference on Very Large Data Bases. IEEE, 1997, pp. 186--195. Google ScholarDigital Library
G. Sheikholeslami, S. Chatterjee, and A. Zhang, "WaveCluster: a wavelet-based clustering approach for spatial data in very large databases," The VLDB Journal, vol. 8, no. 3, pp. 289--304, 2000. Google ScholarDigital Library
A. Mukhopadhyay and U. Maulik, "Unsupervised satellite image segmentation by combining SA based fuzzy clustering with support vector machine," in Proceedings of 7th ICAPR'09. IEEE, 2009, pp. 381--384. Google ScholarDigital Library
D. Birant and A. Kut, "ST-DBSCAN: An algorithm for clustering spatial-temporal data," Data & Knowledge Engineering, vol. 60, no. 1, pp. 208--221, 2007. Google ScholarDigital Library
M. Surdeanu, J. Turmo, and A. Ageno, "A hybrid unsupervised approach for document clustering," in Proceedings of the 11th ACM SIGKDD. ACM, 2005, pp. 685--690. Google ScholarDigital Library
S. Madeira and A. Oliveira, "Biclustering algorithms for biological data analysis: a survey," Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. 1, no. 1, pp. 24--45, 2004. Google ScholarDigital Library
J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques. Morgan Kaufmann, 2011. Google ScholarDigital Library
H. Kargupta and J. Han, Next generation of data mining. Chapman & Hall/CRC, 2009, vol. 7. Google ScholarDigital Library
S. Brecheisen, H. Kriegel, and M. Pfeifle, "Parallel density-based clustering of complex objects," Advances in Knowledge Discovery and Data Mining, pp. 179--188, 2006. Google ScholarDigital Library
D. Arlia and M. Coppola, "Experiments in parallel clustering with DBSCAN," in Euro-Par 2001 Parallel Processing. Springer, LNCS, 2001, pp. 326--331. Google ScholarDigital Library
M. Chen, X. Gao, and H. Li, "Parallel DBSCAN with priority r-tree," in Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference on. IEEE, 2010, pp. 508--511.Google Scholar
M. Coppola and M. Vanneschi, "High-performance data mining with skeleton-based structured parallel programming," Parallel Computing, vol. 28, no. 5, pp. 793--813, 2002. Google ScholarDigital Library
Y. Fu, W. Zhao, and H. Ma, "Research on parallel DBSCAN algorithm design based on mapreduce," Advanced Materials Research, vol. 301, pp. 1133--1138, 2011.Google ScholarCross Ref
X. Xu, J. Jäger, and H. Kriegel, "A fast parallel clustering algorithm for large spatial databases," High Performance Data Mining, pp. 263--290, 2002. Google ScholarDigital Library
A. Zhou, S. Zhou, J. Cao, Y. Fan, and Y. Hu, "Approaches for scaling DBSCAN algorithm to large spatial databases," Journal of computer science and technology, vol. 15, no. 6, pp. 509--526, 2000. Google ScholarDigital Library
T. Cormen, Introduction to algorithms. The MIT press, 2001. Google ScholarDigital Library
M. Patwary, J. Blair, and F. Manne, "Experiments on union-find algorithms for the disjoint-set data structure," in Proceedings of the 9th International Symposium on Experimental Algorithms (SEA 2010). Springer, LNCS 6049, 2010, pp. 411--423. Google ScholarDigital Library
M. M. A. Patwary, P. Refsnes, and F. Manne, "Multi-core spanning forest algorithms using the disjoint-set data structure," in Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2012). IEEE, 2012, to appear. Google ScholarDigital Library
M. B. Kennel, "KDTREE 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space," 2004, institute for Nonlinear Science, University of California.Google Scholar
N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger, "The r*-tree: an efficient and robust access method for points and rectangles," Proceedings of the 1990 ACM SIGMOD international conference on Management of data, vol. 19, no. 2, pp. 322--331, 1990. Google ScholarDigital Library
J. Bentley, "Multidimensional binary search trees used for associative searching," Communications of the ACM, vol. 18, no. 9, pp. 509--517, 1975. Google ScholarDigital Library
R. Tarjan, "A class of algorithms which require nonlinear time to maintain disjoint sets," Journal of computer and system sciences, vol. 18, no. 2, pp. 110--127, 1979.Google ScholarCross Ref
B. Galler and M. Fisher, "An improved equivalence algorithm," Communications of the ACM, vol. 7, pp. 301--303, 1964. Google ScholarDigital Library
M. Patwary, A. Gebremedhin, and A. Pothen, "New multithreaded ordering and coloring algorithms for multicore architectures," in Proceedings of 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011). Springer, LNCS 6853, 2011, pp. 250--262. Google ScholarDigital Library
F. Manne and M. Patwary, "A scalable parallel union-find algorithm for distributed memory computers," in Parallel Processing and Applied Mathematics. Springer, LNCS, 2010, pp. 186--195. Google ScholarDigital Library
"CLUTO - clustering high-dimensional datasets," 2006, http://glaros.dtc.umn.edu/gkhome/cluto/cluto/.Google Scholar
"Parallel K-means data clustering," 2005, http://users.eecs.northwestern.edu/wkliao/Kmeans/.Google Scholar
R. Agrawal and R. Srikant, "Quest synthetic data generator," IBM Almaden Research Center, 1994.Google Scholar
J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi, "NU-MineBench 3.0," Technical Report CUCIS-2005-08-01, Northwestern University, Tech. Rep., 2010.Google Scholar
G. Lemson and the Virgo Consortium, "Halo and galaxy formation histories from the millennium simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony," Arxiv preprint astro-ph/0608019, 2006.Google Scholar
V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly et al., "Simulations of the formation, evolution and clustering of galaxies and quasars," Nature, vol. 435, no. 7042, pp. 629--636, 2005.Google ScholarCross Ref
S. Bertone, G. De Lucia, and P. Thomas, "The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model," Monthly Notices of the Royal Astronomical Society, vol. 379, no. 3, pp. 1143--1154, 2007.Google ScholarCross Ref
G. De Lucia and J. Blaizot, "The hierarchical formation of the brightest cluster galaxies," Monthly Notices of the Royal Astronomical Society, vol. 375, no. 1, pp. 2--14, 2007.Google ScholarCross Ref
R. Bower, A. Benson, R. Malbon, J. Helly, C. Frenk, C. Baugh, S. Cole, and C. Lacey, "Breaking the hierarchy of galaxy formation," Monthly Notices of the Royal Astronomical Society, vol. 370, no. 2, pp. 645--655, 2006.Google ScholarCross Ref
Y. Liu, W.-k. Liao, and A. Choudhary, "Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation," in Proceedings of the 17th International Symposium on Parallel and Distributed Processing (IPDPS 2003). Washington, DC, USA: IEEE, 2003, p. 82.1. Google ScholarDigital Library

A new scalable parallel DBSCAN algorithm using the disjoint-set data structure
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
  2. Information systems applications
    1. Data mining

Recommendations

Scalable parallel OPTICS data clustering using graph algorithmic techniques
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a ...
Read More
A new scalable parallel DBSCAN algorithm using the disjoint-set data structure
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis

DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of Dbscan is challenging as it exhibits an inherent sequential data access order. Moreover, ...
Read More
Pardicle: parallel approximate density-based clustering
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Dbscan is a widely used isodensity-based clustering algorithm for particle data well-known for its ability to isolate arbitrarily-shaped clusters and to filter noise data. The algorithm is super-linear (O(nlogn)) and computationally expensive for large ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2012
1161 pages
ISBN:9781467308045
General Chair:
Jeffrey K. Hollingsworth
University of Maryland
Sponsors
In-Cooperation
Publisher
IEEE Computer Society Press
Washington, DC, United States
Publication History
- Published: 10 November 2012
Check for updates
Author Tags
density based clustering
disjoint-set data structure
union-find algorithm
Qualifiers
- research-article
Conference

Acceptance Rates
SC '12 Paper Acceptance Rate100of461submissions,22%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 790
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

Scalable parallel OPTICS data clustering using graph algorithmic techniques

A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

Pardicle: parallel approximate density-based clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

Scalable parallel OPTICS data clustering using graph algorithmic techniques

A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

Pardicle: parallel approximate density-based clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media