research-article

DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation

Authors:
Junhao Gan

Chinese University of Hong Kong, New Territories , Hong Kong

Chinese University of Hong Kong, New Territories , Hong Kong
View Profile

,
Yufei Tao

Chinese University of Hong Kong, New Territories , Hong Kong

Chinese University of Hong Kong, New Territories , Hong Kong
View Profile

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataMay 2015Pages 519–530https://doi.org/10.1145/2723372.2737792

Published:27 May 2015Publication History

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 519–530

ABSTRACT

DBSCAN is a popular method for clustering multi-dimensional objects. Just as notable as the method's vast success is the research community's quest for its efficient computation. The original KDD'96 paper claimed an algorithm with O(n log n) running time, where n is the number of objects. Unfortunately, this is a mis-claim; and that algorithm actually requires O(n²) time. There has been a fix in 2D space, where a genuine O(n log n)-time algorithm has been found. Looking for a fix for dimensionality d ≥ 3 is currently an important open problem.

In this paper, we prove that for d ≥ 3, the DBSCAN problem requires Ω(n4/³) time to solve, unless very significant breakthroughs---ones widely believed to be impossible---could be made in theoretical computer science. This (i) explains why the community's search for fixing the aforementioned mis-claim has been futile for d ≥ 3, and (ii) indicates (sadly) that all DBSCAN algorithms must be intolerably slow even on moderately large n in practice. Surprisingly, we show that the running time can be dramatically brought down to O(n) in expectation regardless of the dimensionality d, as soon as slight inaccuracy in the clustering results is permitted. We formalize our findings into the new notion of ρ-approximate DBSCAN, which we believe should replace DBSCAN on big data due to the latter's computational intractability.

References

P. K. Agarwal, H. Edelsbrunner, and O. Schwarzkopf. Euclidean minimum spanning trees and bichromatic closest pairs. Discrete & Computational Geometry, 6:407--422, 1991. Google ScholarDigital Library
M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS: Ordering points to identify the clustering structure. In SIGMOD, pages 49--60, 1999. Google ScholarDigital Library
S. Arya and D. M. Mount. Approximate range searching. Computational Geometry, 17(3--4):135--152, 2000. Google ScholarDigital Library
K. Bache and M. Lichman. UCI machine learning repository, 2013.Google Scholar
C. Böhm, K. Kailing, P. Kröger, and A. Zimek. Computing clusters of correlation connected objects. In SIGMOD, pages 455--466, 2004. Google ScholarDigital Library
B. Borah and D. K. Bhattacharyya. An improved sampling-based DBSCAN for large spatial databases. In Proceedings of Intelligent Sensing and Information Processing, pages 92--96, 2004.Google ScholarCross Ref
V. Chaoji, M. A. Hasan, S. Salem, and M. J. Zaki. Sparcl: Efficient and effective shape-based clustering. In ICDM, pages 93--102, 2008. Google ScholarDigital Library
J. Erickson. On the relative complexities of some geometric problems. In CCCG, pages 85--90, 1995.Google Scholar
J. Erickson. New lower bounds for Hopcroft's problem. Discrete & Computational Geometry, 16(4):389--418, 1996.Google ScholarDigital Library
M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In SIGKDD, pages 226--231, 1996.Google ScholarDigital Library
A. Gunawan. A faster algorithm for DBSCAN. Master';s thesis, Technische University Eindhoven, March 2013.Google Scholar
J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2012. Google ScholarDigital Library
M. Klusch, S. Lodi, and G. Moro. Distributed clustering based on sampling local density estimates. In IJCAI, pages 485--490, 2003. Google ScholarDigital Library
Z. Li, B. Ding, J. Han, and R. Kays. Swarm: Mining relaxed temporal moving object clusters. PVLDB, 3(1):723--734, 2010. Google ScholarDigital Library
B. Liu. A fast density-based clustering algorithm for large databases. In Proceedings of International Conference on Machine Learning and Cybernetics, pages 996--1000, 2006.Google ScholarCross Ref
E. H.-C. Lu, V. S. Tseng, and P. S. Yu. Mining cluster-based temporal mobile sequential patterns in location-based service environments. TKDE, 23(6):914--927, 2011. Google ScholarDigital Library
S. Mahran and K. Mahar. Using grid for accelerating density-based clustering. In CIT, pages 35--40, 2008.Google ScholarCross Ref
J. Matousek. Range searching with efficient hiearchical cutting. Discrete & Computational Geometry, 10:157--182, 1993.Google ScholarDigital Library
B. L. Milenova and M. M. Campos. O-Cluster: Scalable clustering of large high dimensional data sets. In ICDM, pages 290--297, 2002. Google ScholarDigital Library
S. K. Pal and P. Mitra. Pattern Recognition Algorithms for Data Mining. Chapman and Hall/CRC, 2004. Google ScholarDigital Library
T. Pei, A.-X. Zhu, C. Zhou, B. Li, and C. Qin. A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. International Journal of Geographical Information Science, 20(2):153--168, 2006.Google ScholarCross Ref
A. Reiss and D. Stricker. Introducing a new benchmarked dataset for activity monitoring. In International Symposium on Wearable Computers, pages 108--109, 2012. Google ScholarDigital Library
S. Roy and D. K. Bhattacharyya. An approach to find embedded clusters using density based techniques. In Proceedings of Distributed Computing and Internet Technology, pages 523--535, 2005. Google ScholarDigital Library
G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A wavelet based clustering approach for spatial data in very large databases. VLDB J., 8(3-4):289--304, 2000. Google ScholarDigital Library
P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson, 2006. Google ScholarDigital Library
C.-F. Tsai and C.-T. Wu. GF-DBSCAN: A new efficient and effective data clustering technique for large databases. In Proceedings of International Conference on Multimedia Systems and Signal Processing, pages 231--236, 2009. Google ScholarDigital Library
M. Varma and A. Zisserman. Texture classification: Are filter banks necessary? In CVPR, pages 691--698, 2003.Google ScholarCross Ref
W. Wang, J. Yang, and R. R. Muntz. STING: A statistical information grid approach to spatial data mining. In VLDB, pages 186--195, 1997. Google ScholarDigital Library
J.-R. Wen, J.-Y. Nie, and H. Zhang. Query clustering using user logs. TOIS, 20(1):59--81, 2002. Google ScholarDigital Library
C. Zhou, D. Frankowski, P. J. Ludford, S. Shekhar, and L. G. Terveen. Discovering personally meaningful places: An interactive clustering approach. TOIS, 25(3), 2007. Google ScholarDigital Library

Index Terms

DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

On the Hardness and Approximation of Euclidean DBSCAN
Invited Paper from SIGMOD 2015, Invited Paper from PODS 2015, Regular Papers and Technical Correspondence

DBSCAN is a method proposed in 1996 for clustering multi-dimensional points, and has received extensive applications. Its computational hardness is still unsolved to this date. The original KDD‚96 paper claimed an algorithm of O(n log n) ”average ...
Read More
AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
Read More
A new hybrid method based on partitioning-based DBSCAN and ant clustering

Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
algorithm
dbscan
density-based clustering
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 124
  Total Citations
  View Citations
- 3,252
  Total Downloads
- Downloads (Last 12 months)304
- Downloads (Last 6 weeks)47
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Hardness and Approximation of Euclidean DBSCAN

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

A new hybrid method based on partitioning-based DBSCAN and ant clustering