ABSTRACT
Clustering algorithms in the field of data-mining are used to aggregate similar objects into common groups. One of the best-known of these algorithms is called DBSCAN. Its distinct design enables the search for an apriori unknown number of arbitrarily shaped clusters, and at the same time allows to filter out noise. Due to its sequential formulation, the parallelization of DBSCAN renders a challenge. In this paper we present a new parallel approach which we call HPDBSCAN. It employs three major techniques in order to break the sequentiality, empower workload-balancing as well as speed up neighborhood searches in distributed parallel processing environments i) a computation split heuristic for domain decomposition, ii) a data index preprocessing step and iii) a rule-based cluster merging scheme.
As a proof-of-concept we implemented HPDBSCAN as an OpenMP/MPI hybrid application. Using real-world data sets, such as a point cloud from the old town of Bremen, Germany, we demonstrate that our implementation is able to achieve a significant speed-up and scale-up in common HPC setups. Moreover, we compare our approach with previous attempts to parallelize DBSCAN showing an order of magnitude improvement in terms of computation time and memory consumption.
- H. Abdi. Coefficient of variation. Encyclopedia of research design, pages 169--171, 2010.Google Scholar
- G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the spring joint computer conference 1967, pages 483--485. ACM, 1967. Google ScholarDigital Library
- D. Arlia and M. Coppola. Experiments in parallel clustering with dbscan. In Euro-Par 2001 Parallel Processing, pages 326--331. Springer, 2001. Google ScholarDigital Library
- J. L. Bentley and J. H. Friedman. Data structures for range searching. ACM Computing Surveys (CSUR), 11(4):397--409, 1979. Google ScholarDigital Library
- S. Brecheisen, H.-P. Kriegel, and M. Pfeifle. Parallel density-based clustering of complex objects. In Advances in Knowledge Discovery and Data Mining, pages 179--188. Springer, 2006. Google ScholarDigital Library
- C. Cantrell. Modern mathematical methods for physicists and engineers. CUP, 2000. Google ScholarDigital Library
- M. Chen, X. Gao, and H. Li. Parallel DBSCAN with Priority R-Tree. In Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference, pages 508--511. IEEE, 2010.Google ScholarCross Ref
- L. Dagum and R. Menon. OpenMP: an industry standard API for shared-memory programming. Computational Science & Engineering, IEEE, 5(1):46--55, 1998. Google ScholarDigital Library
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226--231, 1996.Google ScholarDigital Library
- Forschungszentrum Jülich GmbH. Juelich Dedicated GPU Environment. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUDGE/JUDGE_node.html.Google Scholar
- Forschungszentrum Jülich GmbH. Juelich Storage Cluster. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Datamanagement/OnlineStorage/JUST/JUST_node.html.Google Scholar
- I. Foster. Designing and building parallel programs. Addison Wesley Publishing Company, 1995. Google ScholarDigital Library
- Y. X. Fu, W. Z. Zhao, and H. F. Ma. Research on parallel dbscan algorithm design based on mapreduce. Advanced Materials Research, 301:1133--1138, 2011.Google ScholarCross Ref
- Götz, Markus and Bodenstein, Christian. HPDBSCAN Benchmark test files. http://hdl.handle.net/11304/6eacaa76-c275-11e4-ac7e-860aa0063d1f.Google Scholar
- W. Gropp, E. Lusk, and A. Skjellum. Using MPI: portable parallel programming with the message-passing interface, volume 1. MIT press, 1999. Google ScholarDigital Library
- S. Habib, V. Morozov, H. Finkel, A. Pope, K. Heitmann, K. Kumaran, T. Peterka, J. Insley, D. Daniel, P. Fasel, et al. The universe at extreme scale: multi-petaflop sky simulation on the bg/q. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 4. IEEE Computer Society Press, 2012. Google ScholarDigital Library
- J. Han, M. Kamber, and J. Pei. Data Mining: concepts and techniques - 3rd ed. Morgan Kaufmann, 2011. Google ScholarDigital Library
- HDF Group. Hierachical Data Format 5. http://www.hdfgroup.org/HDF5.Google Scholar
- Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, and J. Fan. MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce. In Parallel and Distributed Systems, 2011 IEEE 17th International Conference, pages 473--480, 2011. Google ScholarDigital Library
- Jacobs University Bremen. 3D Scan Repository. http://kos.informatik.uni-osnabrueck.de/3Dscans/.Google Scholar
- A. Knöpfel, B. Gröne, and P. Tabeling. Fundamental modeling concepts, volume 154. Wiley, UK, 2005.Google Scholar
- T. Leng, R. Ali, J. Hsieh, V. Mashayekhi, and R. Rooholamini. An empirical study of hyper-threading in high performance computing clusters. Linux HPC Revolution, 2002.Google Scholar
- Markus Götz. HPDBSCAN implementation. https://bitbucket.org/markus.goetz/hpdbscan.Google Scholar
- Northwestern University. Parallel netCDF. http://cucis.ece.northwestern.edu/projects/PnetCDF/.Google Scholar
- M. M. A. Patwary, D. Palsetia, A. Agrawal, et al. A new scalable parallel dbscan algorithm using the disjoint-set data structure. In High Performance Computing, Networking, Storage and Analysis (SC), International Conference for, pages 1--11. IEEE, 2012. Google ScholarDigital Library
- D. Šidlauskas, S. Šaltenis, C. W. Christiansen, J. M. Johansen, and D. Šaulys. Trees or grids?: indexing moving objects in main memory. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 236--245. ACM, 2009. Google ScholarDigital Library
- X. XU, J. JAGER, and H.-P. KRIEGEL. A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery, 3:263--290, 1999. Google ScholarDigital Library
- A. Zhou, S. Zhou, J. Cao, Y. Fan, and Y. Hu. Approaches for scaling dbscan algorithm to large spatial databases. Journal of computer science and technology, 15(6):509--526, 2000. Google ScholarDigital Library
Index Terms
HPDBSCAN: highly parallel DBSCAN
Recommendations
Parallel multigrid solvers using OpenMP/MPI hybrid programming models on multi-core/multi-socket clusters
VECPAR'10: Proceedings of the 9th international conference on High performance computing for computational scienceOpenMP/MPI hybrid parallel programming models were implemented to 3D finite-volume based simulation code for groundwater flow problems through heterogeneous porous media using parallel iterative solvers with multigrid preconditioning. Performance and ...
A cluster-based solution for high performance hmmpfam using EARTH execution model
Hmmpfam is a widely used computation-intensive bioinformatics software for sequence classification. The contribution of this paper is the first largely scalable and robust cluster-based solution of parallel hmmpfam based on EARTH (Efficient Architecture ...
A massively parallel distributed n-body application implemented with HPX
ScalA '16: Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale SystemsOne of the major challenges in parallelization is the difficulty of improving application scalability with conventional techniques. HPX provides efficient scalable parallelism by significantly reducing node starvation and effective latencies while ...
Comments