skip to main content
10.1145/2834892.2834894acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

HPDBSCAN: highly parallel DBSCAN

Authors Info & Claims
Published:15 November 2015Publication History

ABSTRACT

Clustering algorithms in the field of data-mining are used to aggregate similar objects into common groups. One of the best-known of these algorithms is called DBSCAN. Its distinct design enables the search for an apriori unknown number of arbitrarily shaped clusters, and at the same time allows to filter out noise. Due to its sequential formulation, the parallelization of DBSCAN renders a challenge. In this paper we present a new parallel approach which we call HPDBSCAN. It employs three major techniques in order to break the sequentiality, empower workload-balancing as well as speed up neighborhood searches in distributed parallel processing environments i) a computation split heuristic for domain decomposition, ii) a data index preprocessing step and iii) a rule-based cluster merging scheme.

As a proof-of-concept we implemented HPDBSCAN as an OpenMP/MPI hybrid application. Using real-world data sets, such as a point cloud from the old town of Bremen, Germany, we demonstrate that our implementation is able to achieve a significant speed-up and scale-up in common HPC setups. Moreover, we compare our approach with previous attempts to parallelize DBSCAN showing an order of magnitude improvement in terms of computation time and memory consumption.

References

  1. H. Abdi. Coefficient of variation. Encyclopedia of research design, pages 169--171, 2010.Google ScholarGoogle Scholar
  2. G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the spring joint computer conference 1967, pages 483--485. ACM, 1967. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Arlia and M. Coppola. Experiments in parallel clustering with dbscan. In Euro-Par 2001 Parallel Processing, pages 326--331. Springer, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. L. Bentley and J. H. Friedman. Data structures for range searching. ACM Computing Surveys (CSUR), 11(4):397--409, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Brecheisen, H.-P. Kriegel, and M. Pfeifle. Parallel density-based clustering of complex objects. In Advances in Knowledge Discovery and Data Mining, pages 179--188. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Cantrell. Modern mathematical methods for physicists and engineers. CUP, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Chen, X. Gao, and H. Li. Parallel DBSCAN with Priority R-Tree. In Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference, pages 508--511. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  8. L. Dagum and R. Menon. OpenMP: an industry standard API for shared-memory programming. Computational Science & Engineering, IEEE, 5(1):46--55, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226--231, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Forschungszentrum Jülich GmbH. Juelich Dedicated GPU Environment. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUDGE/JUDGE_node.html.Google ScholarGoogle Scholar
  11. Forschungszentrum Jülich GmbH. Juelich Storage Cluster. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Datamanagement/OnlineStorage/JUST/JUST_node.html.Google ScholarGoogle Scholar
  12. I. Foster. Designing and building parallel programs. Addison Wesley Publishing Company, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. X. Fu, W. Z. Zhao, and H. F. Ma. Research on parallel dbscan algorithm design based on mapreduce. Advanced Materials Research, 301:1133--1138, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  14. Götz, Markus and Bodenstein, Christian. HPDBSCAN Benchmark test files. http://hdl.handle.net/11304/6eacaa76-c275-11e4-ac7e-860aa0063d1f.Google ScholarGoogle Scholar
  15. W. Gropp, E. Lusk, and A. Skjellum. Using MPI: portable parallel programming with the message-passing interface, volume 1. MIT press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Habib, V. Morozov, H. Finkel, A. Pope, K. Heitmann, K. Kumaran, T. Peterka, J. Insley, D. Daniel, P. Fasel, et al. The universe at extreme scale: multi-petaflop sky simulation on the bg/q. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 4. IEEE Computer Society Press, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Han, M. Kamber, and J. Pei. Data Mining: concepts and techniques - 3rd ed. Morgan Kaufmann, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. HDF Group. Hierachical Data Format 5. http://www.hdfgroup.org/HDF5.Google ScholarGoogle Scholar
  19. Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, and J. Fan. MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce. In Parallel and Distributed Systems, 2011 IEEE 17th International Conference, pages 473--480, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jacobs University Bremen. 3D Scan Repository. http://kos.informatik.uni-osnabrueck.de/3Dscans/.Google ScholarGoogle Scholar
  21. A. Knöpfel, B. Gröne, and P. Tabeling. Fundamental modeling concepts, volume 154. Wiley, UK, 2005.Google ScholarGoogle Scholar
  22. T. Leng, R. Ali, J. Hsieh, V. Mashayekhi, and R. Rooholamini. An empirical study of hyper-threading in high performance computing clusters. Linux HPC Revolution, 2002.Google ScholarGoogle Scholar
  23. Markus Götz. HPDBSCAN implementation. https://bitbucket.org/markus.goetz/hpdbscan.Google ScholarGoogle Scholar
  24. Northwestern University. Parallel netCDF. http://cucis.ece.northwestern.edu/projects/PnetCDF/.Google ScholarGoogle Scholar
  25. M. M. A. Patwary, D. Palsetia, A. Agrawal, et al. A new scalable parallel dbscan algorithm using the disjoint-set data structure. In High Performance Computing, Networking, Storage and Analysis (SC), International Conference for, pages 1--11. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Šidlauskas, S. Šaltenis, C. W. Christiansen, J. M. Johansen, and D. Šaulys. Trees or grids?: indexing moving objects in main memory. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 236--245. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. X. XU, J. JAGER, and H.-P. KRIEGEL. A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery, 3:263--290, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Zhou, S. Zhou, J. Cao, Y. Fan, and Y. Hu. Approaches for scaling dbscan algorithm to large spatial databases. Journal of computer science and technology, 15(6):509--526, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. HPDBSCAN: highly parallel DBSCAN

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MLHPC '15: Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments
        November 2015
        40 pages
        ISBN:9781450340069
        DOI:10.1145/2834892

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 November 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        MLHPC '15 Paper Acceptance Rate5of7submissions,71%Overall Acceptance Rate5of7submissions,71%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader