ABSTRACT
DBSCAN is a popular method for clustering multi-dimensional objects. Just as notable as the method's vast success is the research community's quest for its efficient computation. The original KDD'96 paper claimed an algorithm with O(n log n) running time, where n is the number of objects. Unfortunately, this is a mis-claim; and that algorithm actually requires O(n2) time. There has been a fix in 2D space, where a genuine O(n log n)-time algorithm has been found. Looking for a fix for dimensionality d ≥ 3 is currently an important open problem.
In this paper, we prove that for d ≥ 3, the DBSCAN problem requires Ω(n4/3) time to solve, unless very significant breakthroughs---ones widely believed to be impossible---could be made in theoretical computer science. This (i) explains why the community's search for fixing the aforementioned mis-claim has been futile for d ≥ 3, and (ii) indicates (sadly) that all DBSCAN algorithms must be intolerably slow even on moderately large n in practice. Surprisingly, we show that the running time can be dramatically brought down to O(n) in expectation regardless of the dimensionality d, as soon as slight inaccuracy in the clustering results is permitted. We formalize our findings into the new notion of ρ-approximate DBSCAN, which we believe should replace DBSCAN on big data due to the latter's computational intractability.
- P. K. Agarwal, H. Edelsbrunner, and O. Schwarzkopf. Euclidean minimum spanning trees and bichromatic closest pairs. Discrete & Computational Geometry, 6:407--422, 1991. Google ScholarDigital Library
- M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS: Ordering points to identify the clustering structure. In SIGMOD, pages 49--60, 1999. Google ScholarDigital Library
- S. Arya and D. M. Mount. Approximate range searching. Computational Geometry, 17(3--4):135--152, 2000. Google ScholarDigital Library
- K. Bache and M. Lichman. UCI machine learning repository, 2013.Google Scholar
- C. Böhm, K. Kailing, P. Kröger, and A. Zimek. Computing clusters of correlation connected objects. In SIGMOD, pages 455--466, 2004. Google ScholarDigital Library
- B. Borah and D. K. Bhattacharyya. An improved sampling-based DBSCAN for large spatial databases. In Proceedings of Intelligent Sensing and Information Processing, pages 92--96, 2004.Google ScholarCross Ref
- V. Chaoji, M. A. Hasan, S. Salem, and M. J. Zaki. Sparcl: Efficient and effective shape-based clustering. In ICDM, pages 93--102, 2008. Google ScholarDigital Library
- J. Erickson. On the relative complexities of some geometric problems. In CCCG, pages 85--90, 1995.Google Scholar
- J. Erickson. New lower bounds for Hopcroft's problem. Discrete & Computational Geometry, 16(4):389--418, 1996.Google ScholarDigital Library
- M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In SIGKDD, pages 226--231, 1996.Google ScholarDigital Library
- A. Gunawan. A faster algorithm for DBSCAN. Master';s thesis, Technische University Eindhoven, March 2013.Google Scholar
- J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2012. Google ScholarDigital Library
- M. Klusch, S. Lodi, and G. Moro. Distributed clustering based on sampling local density estimates. In IJCAI, pages 485--490, 2003. Google ScholarDigital Library
- Z. Li, B. Ding, J. Han, and R. Kays. Swarm: Mining relaxed temporal moving object clusters. PVLDB, 3(1):723--734, 2010. Google ScholarDigital Library
- B. Liu. A fast density-based clustering algorithm for large databases. In Proceedings of International Conference on Machine Learning and Cybernetics, pages 996--1000, 2006.Google ScholarCross Ref
- E. H.-C. Lu, V. S. Tseng, and P. S. Yu. Mining cluster-based temporal mobile sequential patterns in location-based service environments. TKDE, 23(6):914--927, 2011. Google ScholarDigital Library
- S. Mahran and K. Mahar. Using grid for accelerating density-based clustering. In CIT, pages 35--40, 2008.Google ScholarCross Ref
- J. Matousek. Range searching with efficient hiearchical cutting. Discrete & Computational Geometry, 10:157--182, 1993.Google ScholarDigital Library
- B. L. Milenova and M. M. Campos. O-Cluster: Scalable clustering of large high dimensional data sets. In ICDM, pages 290--297, 2002. Google ScholarDigital Library
- S. K. Pal and P. Mitra. Pattern Recognition Algorithms for Data Mining. Chapman and Hall/CRC, 2004. Google ScholarDigital Library
- T. Pei, A.-X. Zhu, C. Zhou, B. Li, and C. Qin. A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. International Journal of Geographical Information Science, 20(2):153--168, 2006.Google ScholarCross Ref
- A. Reiss and D. Stricker. Introducing a new benchmarked dataset for activity monitoring. In International Symposium on Wearable Computers, pages 108--109, 2012. Google ScholarDigital Library
- S. Roy and D. K. Bhattacharyya. An approach to find embedded clusters using density based techniques. In Proceedings of Distributed Computing and Internet Technology, pages 523--535, 2005. Google ScholarDigital Library
- G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A wavelet based clustering approach for spatial data in very large databases. VLDB J., 8(3-4):289--304, 2000. Google ScholarDigital Library
- P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson, 2006. Google ScholarDigital Library
- C.-F. Tsai and C.-T. Wu. GF-DBSCAN: A new efficient and effective data clustering technique for large databases. In Proceedings of International Conference on Multimedia Systems and Signal Processing, pages 231--236, 2009. Google ScholarDigital Library
- M. Varma and A. Zisserman. Texture classification: Are filter banks necessary? In CVPR, pages 691--698, 2003.Google ScholarCross Ref
- W. Wang, J. Yang, and R. R. Muntz. STING: A statistical information grid approach to spatial data mining. In VLDB, pages 186--195, 1997. Google ScholarDigital Library
- J.-R. Wen, J.-Y. Nie, and H. Zhang. Query clustering using user logs. TOIS, 20(1):59--81, 2002. Google ScholarDigital Library
- C. Zhou, D. Frankowski, P. J. Ludford, S. Shekhar, and L. G. Terveen. Discovering personally meaningful places: An interactive clustering approach. TOIS, 25(3), 2007. Google ScholarDigital Library
Index Terms
- DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation
Recommendations
On the Hardness and Approximation of Euclidean DBSCAN
Invited Paper from SIGMOD 2015, Invited Paper from PODS 2015, Regular Papers and Technical CorrespondenceDBSCAN is a method proposed in 1996 for clustering multi-dimensional points, and has received extensive applications. Its computational hardness is still unsolved to this date. The original KDD‚96 paper claimed an algorithm of O(n log n) ”average ...
AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities
Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
A new hybrid method based on partitioning-based DBSCAN and ant clustering
Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based ...
Comments