skip to main content
10.1145/2723372.2737792acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation

Authors Info & Claims
Published:27 May 2015Publication History

ABSTRACT

DBSCAN is a popular method for clustering multi-dimensional objects. Just as notable as the method's vast success is the research community's quest for its efficient computation. The original KDD'96 paper claimed an algorithm with O(n log n) running time, where n is the number of objects. Unfortunately, this is a mis-claim; and that algorithm actually requires O(n2) time. There has been a fix in 2D space, where a genuine O(n log n)-time algorithm has been found. Looking for a fix for dimensionality d ≥ 3 is currently an important open problem.

In this paper, we prove that for d ≥ 3, the DBSCAN problem requires Ω(n4/3) time to solve, unless very significant breakthroughs---ones widely believed to be impossible---could be made in theoretical computer science. This (i) explains why the community's search for fixing the aforementioned mis-claim has been futile for d ≥ 3, and (ii) indicates (sadly) that all DBSCAN algorithms must be intolerably slow even on moderately large n in practice. Surprisingly, we show that the running time can be dramatically brought down to O(n) in expectation regardless of the dimensionality d, as soon as slight inaccuracy in the clustering results is permitted. We formalize our findings into the new notion of ρ-approximate DBSCAN, which we believe should replace DBSCAN on big data due to the latter's computational intractability.

References

  1. P. K. Agarwal, H. Edelsbrunner, and O. Schwarzkopf. Euclidean minimum spanning trees and bichromatic closest pairs. Discrete & Computational Geometry, 6:407--422, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS: Ordering points to identify the clustering structure. In SIGMOD, pages 49--60, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Arya and D. M. Mount. Approximate range searching. Computational Geometry, 17(3--4):135--152, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Bache and M. Lichman. UCI machine learning repository, 2013.Google ScholarGoogle Scholar
  5. C. Böhm, K. Kailing, P. Kröger, and A. Zimek. Computing clusters of correlation connected objects. In SIGMOD, pages 455--466, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Borah and D. K. Bhattacharyya. An improved sampling-based DBSCAN for large spatial databases. In Proceedings of Intelligent Sensing and Information Processing, pages 92--96, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  7. V. Chaoji, M. A. Hasan, S. Salem, and M. J. Zaki. Sparcl: Efficient and effective shape-based clustering. In ICDM, pages 93--102, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Erickson. On the relative complexities of some geometric problems. In CCCG, pages 85--90, 1995.Google ScholarGoogle Scholar
  9. J. Erickson. New lower bounds for Hopcroft's problem. Discrete & Computational Geometry, 16(4):389--418, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In SIGKDD, pages 226--231, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Gunawan. A faster algorithm for DBSCAN. Master';s thesis, Technische University Eindhoven, March 2013.Google ScholarGoogle Scholar
  12. J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Klusch, S. Lodi, and G. Moro. Distributed clustering based on sampling local density estimates. In IJCAI, pages 485--490, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Z. Li, B. Ding, J. Han, and R. Kays. Swarm: Mining relaxed temporal moving object clusters. PVLDB, 3(1):723--734, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Liu. A fast density-based clustering algorithm for large databases. In Proceedings of International Conference on Machine Learning and Cybernetics, pages 996--1000, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  16. E. H.-C. Lu, V. S. Tseng, and P. S. Yu. Mining cluster-based temporal mobile sequential patterns in location-based service environments. TKDE, 23(6):914--927, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Mahran and K. Mahar. Using grid for accelerating density-based clustering. In CIT, pages 35--40, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  18. J. Matousek. Range searching with efficient hiearchical cutting. Discrete & Computational Geometry, 10:157--182, 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. L. Milenova and M. M. Campos. O-Cluster: Scalable clustering of large high dimensional data sets. In ICDM, pages 290--297, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. K. Pal and P. Mitra. Pattern Recognition Algorithms for Data Mining. Chapman and Hall/CRC, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Pei, A.-X. Zhu, C. Zhou, B. Li, and C. Qin. A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. International Journal of Geographical Information Science, 20(2):153--168, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  22. A. Reiss and D. Stricker. Introducing a new benchmarked dataset for activity monitoring. In International Symposium on Wearable Computers, pages 108--109, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Roy and D. K. Bhattacharyya. An approach to find embedded clusters using density based techniques. In Proceedings of Distributed Computing and Internet Technology, pages 523--535, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A wavelet based clustering approach for spatial data in very large databases. VLDB J., 8(3-4):289--304, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C.-F. Tsai and C.-T. Wu. GF-DBSCAN: A new efficient and effective data clustering technique for large databases. In Proceedings of International Conference on Multimedia Systems and Signal Processing, pages 231--236, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Varma and A. Zisserman. Texture classification: Are filter banks necessary? In CVPR, pages 691--698, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  28. W. Wang, J. Yang, and R. R. Muntz. STING: A statistical information grid approach to spatial data mining. In VLDB, pages 186--195, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J.-R. Wen, J.-Y. Nie, and H. Zhang. Query clustering using user logs. TOIS, 20(1):59--81, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Zhou, D. Frankowski, P. J. Ludford, S. Shekhar, and L. G. Terveen. Discovering personally meaningful places: An interactive clustering approach. TOIS, 25(3), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
        May 2015
        2110 pages
        ISBN:9781450327589
        DOI:10.1145/2723372

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 May 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader