Abstract
Given a set of n d-dimensional Boolean vectors with the promise that the vectors are chosen uniformly at random with the exception of two vectors that have Pearson correlation coefficient ρ (Hamming distance d· 1-ρ/ 2), how quickly can one find the two correlated vectors? We present an algorithm which, for any constant ϵ>0, and constant ρ>0, runs in expected time O(n5-ω / 4-ω+ϵ +nd) < O(n1.62 +nd), where ω < 2.4 is the exponent of matrix multiplication. This is the first subquadratic--time algorithm for this problem for which ρ does not appear in the exponent of n, and improves upon O(n2-O(ρ)), given by Paturi et al. [1989], the Locality Sensitive Hashing approach of Motwani [1998] and the Bucketing Codes approach of Dubiner [2008].
Applications and extensions of this basic algorithm yield significantly improved algorithms for several other problems.
Approximate Closest Pair. For any sufficiently small constant ϵ>0, given n d-dimensional vectors, there exists an algorithm that returns a pair of vectors whose Euclidean (or Hamming) distance differs from that of the closest pair by a factor of at most 1+ϵ, and runs in time O(n2-Θ(√ϵ)). The best previous algorithms (including Locality Sensitive Hashing) have runtime O(n2-O(ϵ)).
Learning Sparse Parities with Noise. Given samples from an instance of the learning parities with noise problem where each example has length n, the true parity set has size at most k « n, and the noise rate is η, there exists an algorithm that identifies the set of k indices in time nω+ϵ/3 k poly(1/1-2η) < n0.8k poly(1/1-2 η). This is the first algorithm with no dependence on η in the exponent of n, aside from the trivial O((nk)) ≈ O(nk) brute-force algorithm, and for large noise rates (η > 0.4), improves upon the results of Grigorescu et al. [2011] that give a runtime of n(1+(2 η)2 + o(1))k/2 poly(1/1-2η).
Learning k-Juntas with Noise. Given uniformly random length n Boolean vectors, together with a label, which is some function of just k « n of the bits, perturbed by noise rate η, return the set of relevant indices. Leveraging the reduction of Feldman et al. [2009], our result for learning k-parities implies an algorithm for this problem with runtime nω+ϵ/3 k poly(1/1-2η) < n0.8k poly(1/1-2 η), which is the first runtime for this problem of the form nck with an absolute constant c < 1.
Learning k-Juntas without Noise. Given uniformly random length n Boolean vectors, together with a label, which is some function of k « n of the bits, return the set of relevant indices. Using a modification of the algorithm of Mossel et al. [2004], and employing our algorithm for learning sparse parities with noise via the reduction of Feldman et al. [2009], we obtain an algorithm for this problem with runtime nω+ ϵ/4 k poly(n) < n0.6k poly(n), which improves on the previous best of nω+1/ωk ≈ n0.7k poly(n) of Mossel et al. [2004].
- M. Ajtai, R. Kumar, and D. Sivakumar. 2001. A sieve algorithm for the shortest lattice vector problem. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 601--610. Google ScholarDigital Library
- N. Alon and A. Naor. 2004. Approximating the cut-norm via GrothendieckÕs inequality. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 72--80. Google ScholarDigital Library
- A. Andoni and P. Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). 459--468. Google ScholarDigital Library
- A. Andoni and P. Indyk. 2008. Near--optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 1, 117--122. Google ScholarDigital Library
- S. Arora and R. Ge. 2011. New algorithms for learning in presence of errors. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP). 403--415. Google ScholarDigital Library
- J. L. Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9, 509--517. Google ScholarDigital Library
- A. Blum, M. Furst, J. Jackson, M. Kearns, Y. Mansour, and S. Rudich. 1994. Weakly learning DNF and characterizing statistical query learning using Fourier analysis. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 253--262. Google ScholarDigital Library
- A. Blum, A. Kalai, and H. Wasserman. 2003. Noise-tolerant learning, the parity problem, and the statistical query model. J. ACM 50, 4, 507--519. Google ScholarDigital Library
- Z. Brakerski and V. Vaikuntanathan. 2011. Efficient fully homomorphic encryption from (standard) LWE. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). Google ScholarDigital Library
- M. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Symposium on Theory of Computing (STOC). Google ScholarDigital Library
- K. Clarkson. 1988. A randomized algorithm for closest--point queries. SIAM J. Comput. 17, 4, 830--847. Google ScholarDigital Library
- D. Coppersmith. 1997. Rectangular matrix multiplication revisited. J. Complex. 13, 1, 42--49. Google ScholarDigital Library
- M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. 2004. Locality--sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th ACM Symposium on Computational Geometry (SoCG). 253--262. Google ScholarDigital Library
- M. Dubiner. 2008. Bucketing coding and information theory for the statistical high dimensional nearest neighbor problem. CoRR abs/0810.4182. Google ScholarDigital Library
- V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. 2009. On agnostic learning of parities, monomials and halfspaces. SIAM J. Comput. 39, 2, 606--645. Google ScholarDigital Library
- E. Grigorescu, L. Reyzin, and S. Vempala. 2011. On noise-tolerant learning of sparse parities and related problems. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT). Google ScholarDigital Library
- N. J. Hopper and M. Blum. 2001. Secure human identification protocols. In Proceedings of the ASIACRYPT. 52--66. Google ScholarDigital Library
- R. Impagliazzo and D. Zuckerman. 1989. How to recycle random bits. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). 248--253. Google ScholarDigital Library
- P. Indyk and R. Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the ACM Symposium on Theory of Computing (STOC). Google ScholarDigital Library
- M. Kearns. 1998. Efficient noise-tolerant learning from statistical queries. J. ACM 45, 6, 983--1006. Google ScholarDigital Library
- E. Kushilevitz, R. Ostrovsky, and Y. Rabani. 2000. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput. 30, 2, 457--474. Google ScholarDigital Library
- V. Lyubashevsky. 2005. The parity problem in the presence of noise, decoding random linear codes, and the subset sum problem. In Proceedings of the RANDOM. 378--389. Google ScholarDigital Library
- J. Marchini, P. Donnelly, and L. R. Cardon. 2005. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 37, 4, 413--417.Google ScholarCross Ref
- S. Meiser. 1993. Point location in arrangements of hyperplanes. Inf. Computat. 106, 2, 286--303. Google ScholarDigital Library
- E. Mossel, R. O'Donnell, and R. Servedio. 2004. Learning functions of k relevant variables. J. Comput. System Sci. 69, 3, 421--434. Google ScholarDigital Library
- R. Motwani, A. Naor, and R. Panigrahy. 2006. Lower bounds on locality sensitive hashing. In Proceedings of the ACM Symposium on Computational Geometry (SoCG). 154--157. Google ScholarDigital Library
- R. O'Donnell, Y. Wu, and Y. Zhou. 2011. Optimal lower bounds for locality sensitive hashing (except when q is tiny). In Proceedings of the Innovations in Theoretical Computer Science Conference (ITCS). 275--283.Google Scholar
- R. Pagh. 2012. Compressed matrix multiplication. In Proceedings of the Innovations in Theoretical Computer Science Conference (ITCS). Google ScholarDigital Library
- R. Panigrahy. 2006. Entropy-based nearest neighbor search in high dimensions. In Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA). Google ScholarDigital Library
- Ramamohan Paturi, Sanguthevar Rajasekaran, and John H. Reif. 1989. The light bulb problem. In Proceedings of the Conference on Learning Theory (COLT). 261--268. Google ScholarDigital Library
- C. Peikert. 2009. Public--key cryptosystems from the worst-case shortest vector problem. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 333--342. Google ScholarDigital Library
- O. Regev. 2009. On lattices, learning with errors, random linear codes, and cryptography. J. ACM 56, 6, 1--40. Google ScholarDigital Library
- O. Regev. 2010. The learning with errors problem. In Proceedings of the IEEE Conference on Computational Complexity (CCC) (Invited Survey). Google ScholarDigital Library
- T. J. Rivlin. 1974. The Chebyshev Polynomials. Wiley.Google Scholar
- H. Samet. 2006. Foundations of Multidimensional and Metric Data Structures. Elsevier. Google ScholarDigital Library
- I. J. Schoenberg. 1942. Positive definite functions on spheres. Duke Math. J. 9, 1, 96--108.Google ScholarCross Ref
- G. Szegö. 1975. Orthogonal Polynomials, 4th Ed. American Mathematical Society, Colloquium Publications, 23. Providence, RI.Google Scholar
- G. Valiant. 2012. Finding correlations in subquadratic time, with applications to learning parities and juntas. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). Google ScholarDigital Library
- L. Valiant. 1988. Functionality in neural nets. In Proceedings of the 1st Workshop on Computational Learning Theory. 28--39. Google ScholarDigital Library
- K. A. Verbeurgt. 1990. Learning DNF under the uniform distribution in quasipolynomial time. In Proceedings of the Conference on Learning Theory (COLT). 314--326. Google ScholarDigital Library
- X. Wan, C. Yang, H. Xue, N. Tang, and W. Yu. 2010. Detecting two-locus associations allowing for interactions in genome-wide association studies. Bioinformatics 26, 20, 2517--2525. Google ScholarDigital Library
- R. Weber, H. J. Schek, and S. Blott. 1998. A quantitative analysis and performance study for similarity--search methods in high--dimensional spaces. In Proceedings of the 24th International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- V. Vassilevska Williams. 2012. Multiplying matrices faster than Coppersmith--Winograd. In Proceedings of the ACM Symposium on Theory of Computing (STOC). Google ScholarDigital Library
Index Terms
- Finding Correlations in Subquadratic Time, with Applications to Learning Parities and the Closest Pair Problem
Recommendations
A Faster Subquadratic Algorithm for Finding Outlier Correlations
Special Issue on SODA’16 and Regular PapersWe study the problem of detecting outlier pairs of strongly correlated variables among a collection of n variables with otherwise weak pairwise correlations. After normalization, this task amounts to the geometric task where we are given as input a set ...
Finding Correlations in Subquadratic Time, with Applications to Learning Parities and Juntas
FOCS '12: Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer ScienceGiven a set of $n$ $d$-dimensional Boolean vectors with the promise that the vectors are chosen uniformly at random with the exception of two vectors that have Pearson -- correlation $\rho$ (Hamming distance $d\cdot \frac{1-\rho}{2}$), how quickly can ...
A Reliable Randomized Algorithm for the Closest-Pair Problem
The following two computational problems are studied:Duplicate grouping:Assume thatnitems are given, each of which is labeled by an integer key from the set {0, ,U 1}. Store the items in an array of sizensuch that items with the same key occupy a ...
Comments