Abstract
Data represented geometrically in high-dimensional vector spaces can be found in many applications. Images and videos, are often represented by assigning a dimension for every pixel (and time). Text documents may be represented in a vector space where each word in the dictionary incurs a dimension. The need to manipulate such data in huge corpora such as the web and to support various query types gives rise to the question of how to represent the data in a lower-dimensional space to allow more space and time efficient computation. Linear mappings are an attractive approach to this problem because the mapped input can be readily fed into popular algorithms that operate on linear spaces (such as principal-component analysis, PCA) while avoiding the curse of dimensionality.
The fact that such mappings even exist became known in computer science following seminal work by Johnson and Lindenstrauss in the early 1980s. The underlying technique is often called "random projection." The complexity of the mapping itself, essentially the product of a vector with a dense matrix, did not attract much attention until recently. In 2006, we discovered a way to "sparsify" the matrix via a computational version of Heisenberg's Uncertainty Principle. This led to a significant speedup, which also retained the practical simplicity of the standard Johnson-Lindenstrauss projection. We describe the improvement in this article, together with some of its applications.
- Achlioptas, D. Database-friendly random projections: Johnson--Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66, 4 (2003), 671--687. Google ScholarDigital Library
- Ailon, N., Chazelle, B. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. SIAM J. Comput. 39, 1 (2009), 302--322. Google ScholarDigital Library
- Ailon, N., Liberty, E. Fast dimension reduction using rademacher series on dual bch codes. Discrete and Computational Geometry (2008). Google ScholarDigital Library
- Ailon, N., Liberty, E., Singer, A. Dense fast random projections and lean walsh transforms. APPROX-RANDOM, 2008, 512--522. Google ScholarDigital Library
- Alon, N. Problems and results in extremal combinatorics-I. Discrete Math. 273, 1--3 (2003), 31--53.Google ScholarDigital Library
- Alon, N., Spencer, J. The Probabilistic Method. John Wiley, 2nd edition, 2000.Google Scholar
- Arora, S. Polynomial time approximation schemes for euclidean traveling salesman and other geometric problems. J. ACM 45, 5 (1998), 753--782. Google ScholarDigital Library
- Arya, S., Mount, D.M. Approximate nearest neighbor queries in fixed dimensions. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (Austin, TX, United States, 1993), 271--280. Google ScholarDigital Library
- Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45, 6 (1998), 891--923. Google ScholarDigital Library
- Bern, M.W. Approximate closest-point queries in high dimensions. Inf. Process. Lett. 45, 2 (1993), 95--99. Google ScholarDigital Library
- Bingham, E., Mannila, H. Random projection in dimensionality reduction: Applications to image and text data. In Knowledge Discovery and Data Mining, 2001, 245--250. Google ScholarDigital Library
- Borodin, A., Ostrovsky, R., Rabani, Y. Lower bounds for high dimensional nearest neighbor search and related problems. In Proceedings of the 31st Annual Symposium on the Theory of Computing (STOC) (1999), 312--321. Google ScholarDigital Library
- Chan, T.M. Approximate nearest neighbor queries revisited. Discrete Comput. Geometry 20, 3 (1998), 359--373.Google ScholarCross Ref
- Christofides, N. Worst-case analysis of a new heuristic for the travelling salesman problem. Technical Report, Graduate School of Industrial Administration, Carnegie-Mellon University, Pittsburgh, 1976, 388.Google Scholar
- Clarkson, K.L. An algorithm for approximate closest-point queries. In Proceedings of the 10th Annual ACM Symposium on Computational Geometry (SoCG) (1994), 160--164. Google ScholarDigital Library
- Clarkson, K.L. Nearest neighbor queries in metric spaces. Discrete Comput. Geometry 22, 1 (1999), 63--93.Google ScholarCross Ref
- DasGupta, S., Gupta, A. An elementary proof of the Johnson--Lindenstrauss lemma. Technical Report, UC Berkeley, 1999, 99--106.Google Scholar
- Drineas, P., Mahoney, M.W., Muthukrishnan, S. Sampling algorithms for ℓ2 regression and applications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (Miami, FL, United States, 2006). Google ScholarDigital Library
- Farach-Colton, M., Indyk, P. Approximate nearest neighbor algorithms for Hausdorff metrics via embeddings. In Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Science (FOCS) (1999), 171--180. Google ScholarDigital Library
- Frankl, P., Maehara, H. The Johnson--Lindenstrauss lemma and the sphericity of some graphs. J. Comb. Theory Ser. A 44 (1987), 355--362. Google ScholarDigital Library
- Indyk, P. On approximate nearest neighbors in non-Euclidean spaces. In Proceedings of the 39th Annual IEEE Symposium on Foundations of Computer Science (FOCS) (1998), 148--155. Google ScholarDigital Library
- Indyk, P. Dimensionality reduction techniques for proximity problems. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2000), 371--378. Google ScholarDigital Library
- Indyk, P. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Computational Geometry. CRC, 2004.Google ScholarCross Ref
- Indyk P., Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC) (1998), 604--613. Google ScholarDigital Library
- Johnson, W.B., Lindenstrauss, J. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26 (1984), 189--206.Google ScholarCross Ref
- Kleinberg, J.M. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC) (1997), 599--608. Google ScholarDigital Library
- Kushilevitz, E., Ostrovsky, R., Rabani Y. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput. 30, 2 (2000), 457--474. Google ScholarDigital Library
- Liberty, E., Woolfe, F., Martinsson, P.-G., Rokhlin, V., Tygert, M. Randomized algorithms for the low-rank approximation of matrices. Proc. Natl. Acad. Sci. (PNAS) 104, 51 (2007), 20167--20172.Google ScholarCross Ref
- Linial, N., London, E., Rabinovich, Y. The geometry of graphs and some of its algorithmic applications. Combinatorica 15, 2 (1995), 215--245.Google ScholarCross Ref
- Mitchell, J.S.B. Guillotine subdivisions approximate polygonal subdivisions: A simple new method for the geometric k-MST problem. SIAM J. Comput. 28, 4 (1999), 1298--1309. Google ScholarDigital Library
- Morgenstern, J. Note on a lower bound on the linear complexity of the fast fourier transform. J. ACM 20, 2 (1973), 305--306. Google ScholarDigital Library
- Morgenstern, J. The linear complexity of computation. J. ACM 22, 2 (1975), 184--194. Google ScholarDigital Library
- Muthukrishnan, S., Sahinalp, S.C. Simple and practical sequence nearest neighbors with block operations. In Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching (CPM) (2002), 262--278. Google ScholarDigital Library
- Papadimitriou, C., Raghavan, P., Tamaki, H., Vempala, S. Latent semantic indexing: A probabilistic analysis. In Proceedings of the 17th Annual Symposium of Database Systems (1998), 159--168. Google ScholarDigital Library
- Rokhlin, V., Tygert, M. A fast randomized algorithm for overdetermined linear least-squares regression. Proceedings of the National Academy of Science (PNAS) 105, 36 (2008), 13212--13217.Google ScholarCross Ref
- Sarlós, T. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS) (Berkeley, CA, 2006). Google ScholarDigital Library
- Schulman, L. Clustering for edge-cost minimization. In Proceedings of the 32nd Annual Symposium on Theory of Computing (STOC) (2000), 547--555. Google ScholarDigital Library
- Woolfe, F., Liberty, E., Rokhlin, V., Tygert, M. A fast randomized algorithm for the approximation of matrices. Appl. Comput. Harmon. Anal. 25 (2008), 335--366.Google ScholarCross Ref
- Yianilos, P.N. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (1993), 311--321. Google ScholarDigital Library
- Yianilos, P.N. Locally lifting the curse of dimensionality for nearest neighbor search (extended abstract). In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2000), 361--370. Google ScholarDigital Library
Index Terms
- Faster dimension reduction
Recommendations
Geometrically local embedding in manifolds for dimension reduction
In this paper, geometrically local embedding (GLE) is presented to discover the intrinsic structure of manifolds as a method in nonlinear dimension reduction. GLE is able to reveal the inner features of the input data in the lower dimension space while ...
Max-K-Min Distance Analysis for Dimension Reduction
ICPR '14: Proceedings of the 2014 22nd International Conference on Pattern RecognitionWe propose a new criterion for discriminative dimension reduction, Max-K-Min Distance Analysis (MKMDA). Given a data set with C classes, MKMDA maximizes the sum of the K minimum pair wise distance of these C classes on the selected one-dimensional ...
Weighted locally linear embedding for dimension reduction
The low-dimensional representation of high-dimensional data and the concise description of its intrinsic structures are central problems in data analysis. In this paper, an unsupervised learning algorithm called weighted locally linear embedding (WLLE) ...
Comments