research-article

Finding Correlations in Subquadratic Time, with Applications to Learning Parities and the Closest Pair Problem

Author:
Gregory Valiant

Stanford University

Stanford University
View Profile

Authors Info & Claims

Journal of the ACM Volume 62 Issue 2Article No.: 13pp 1–45https://doi.org/10.1145/2728167

Published:06 May 2015Publication History

Journal of the ACM

Abstract

Given a set of n d-dimensional Boolean vectors with the promise that the vectors are chosen uniformly at random with the exception of two vectors that have Pearson correlation coefficient ρ (Hamming distance d· 1-ρ/ 2), how quickly can one find the two correlated vectors? We present an algorithm which, for any constant ϵ>0, and constant ρ>0, runs in expected time O(n^{5-ω / 4-ω+ϵ} +nd) < O(n^1.62 +nd), where ω < 2.4 is the exponent of matrix multiplication. This is the first subquadratic--time algorithm for this problem for which ρ does not appear in the exponent of n, and improves upon O(n^2-O(ρ)), given by Paturi et al. [1989], the Locality Sensitive Hashing approach of Motwani [1998] and the Bucketing Codes approach of Dubiner [2008].

Applications and extensions of this basic algorithm yield significantly improved algorithms for several other problems.

Approximate Closest Pair. For any sufficiently small constant ϵ>0, given n d-dimensional vectors, there exists an algorithm that returns a pair of vectors whose Euclidean (or Hamming) distance differs from that of the closest pair by a factor of at most 1+ϵ, and runs in time O(n^2-Θ(√ϵ)). The best previous algorithms (including Locality Sensitive Hashing) have runtime O(n^2-O(ϵ)).

Learning Sparse Parities with Noise. Given samples from an instance of the learning parities with noise problem where each example has length n, the true parity set has size at most k « n, and the noise rate is η, there exists an algorithm that identifies the set of k indices in time n^{ω+ϵ/3 k} poly(1/1-2η) < n^0.8k poly(1/1-2 η). This is the first algorithm with no dependence on η in the exponent of n, aside from the trivial O((ⁿ_k)) ≈ O(n^k) brute-force algorithm, and for large noise rates (η > 0.4), improves upon the results of Grigorescu et al. [2011] that give a runtime of n^{(1+(2 η)}² + o(1))k/2 poly(1/1-2η).

Learning k-Juntas with Noise. Given uniformly random length n Boolean vectors, together with a label, which is some function of just k « n of the bits, perturbed by noise rate η, return the set of relevant indices. Leveraging the reduction of Feldman et al. [2009], our result for learning k-parities implies an algorithm for this problem with runtime n^{ω+ϵ/3 k} poly(1/1-2η) < n^0.8k poly(1/1-2 η), which is the first runtime for this problem of the form n^ck with an absolute constant c < 1.

Learning k-Juntas without Noise. Given uniformly random length n Boolean vectors, together with a label, which is some function of k « n of the bits, return the set of relevant indices. Using a modification of the algorithm of Mossel et al. [2004], and employing our algorithm for learning sparse parities with noise via the reduction of Feldman et al. [2009], we obtain an algorithm for this problem with runtime n^{ω+ ϵ/4 k} poly(n) < n^0.6k poly(n), which improves on the previous best of n^ω+1/ωk ≈ n^0.7k poly(n) of Mossel et al. [2004].

References

M. Ajtai, R. Kumar, and D. Sivakumar. 2001. A sieve algorithm for the shortest lattice vector problem. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 601--610. Google ScholarDigital Library
N. Alon and A. Naor. 2004. Approximating the cut-norm via GrothendieckÕs inequality. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 72--80. Google ScholarDigital Library
A. Andoni and P. Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). 459--468. Google ScholarDigital Library
A. Andoni and P. Indyk. 2008. Near--optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 1, 117--122. Google ScholarDigital Library
S. Arora and R. Ge. 2011. New algorithms for learning in presence of errors. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP). 403--415. Google ScholarDigital Library
J. L. Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9, 509--517. Google ScholarDigital Library
A. Blum, M. Furst, J. Jackson, M. Kearns, Y. Mansour, and S. Rudich. 1994. Weakly learning DNF and characterizing statistical query learning using Fourier analysis. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 253--262. Google ScholarDigital Library
A. Blum, A. Kalai, and H. Wasserman. 2003. Noise-tolerant learning, the parity problem, and the statistical query model. J. ACM 50, 4, 507--519. Google ScholarDigital Library
Z. Brakerski and V. Vaikuntanathan. 2011. Efficient fully homomorphic encryption from (standard) LWE. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). Google ScholarDigital Library
M. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Symposium on Theory of Computing (STOC). Google ScholarDigital Library
K. Clarkson. 1988. A randomized algorithm for closest--point queries. SIAM J. Comput. 17, 4, 830--847. Google ScholarDigital Library
D. Coppersmith. 1997. Rectangular matrix multiplication revisited. J. Complex. 13, 1, 42--49. Google ScholarDigital Library
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. 2004. Locality--sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th ACM Symposium on Computational Geometry (SoCG). 253--262. Google ScholarDigital Library
M. Dubiner. 2008. Bucketing coding and information theory for the statistical high dimensional nearest neighbor problem. CoRR abs/0810.4182. Google ScholarDigital Library
V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. 2009. On agnostic learning of parities, monomials and halfspaces. SIAM J. Comput. 39, 2, 606--645. Google ScholarDigital Library
E. Grigorescu, L. Reyzin, and S. Vempala. 2011. On noise-tolerant learning of sparse parities and related problems. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT). Google ScholarDigital Library
N. J. Hopper and M. Blum. 2001. Secure human identification protocols. In Proceedings of the ASIACRYPT. 52--66. Google ScholarDigital Library
R. Impagliazzo and D. Zuckerman. 1989. How to recycle random bits. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). 248--253. Google ScholarDigital Library
P. Indyk and R. Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the ACM Symposium on Theory of Computing (STOC). Google ScholarDigital Library
M. Kearns. 1998. Efficient noise-tolerant learning from statistical queries. J. ACM 45, 6, 983--1006. Google ScholarDigital Library
E. Kushilevitz, R. Ostrovsky, and Y. Rabani. 2000. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput. 30, 2, 457--474. Google ScholarDigital Library
V. Lyubashevsky. 2005. The parity problem in the presence of noise, decoding random linear codes, and the subset sum problem. In Proceedings of the RANDOM. 378--389. Google ScholarDigital Library
J. Marchini, P. Donnelly, and L. R. Cardon. 2005. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 37, 4, 413--417.Google ScholarCross Ref
S. Meiser. 1993. Point location in arrangements of hyperplanes. Inf. Computat. 106, 2, 286--303. Google ScholarDigital Library
E. Mossel, R. O'Donnell, and R. Servedio. 2004. Learning functions of k relevant variables. J. Comput. System Sci. 69, 3, 421--434. Google ScholarDigital Library
R. Motwani, A. Naor, and R. Panigrahy. 2006. Lower bounds on locality sensitive hashing. In Proceedings of the ACM Symposium on Computational Geometry (SoCG). 154--157. Google ScholarDigital Library
R. O'Donnell, Y. Wu, and Y. Zhou. 2011. Optimal lower bounds for locality sensitive hashing (except when q is tiny). In Proceedings of the Innovations in Theoretical Computer Science Conference (ITCS). 275--283.Google Scholar
R. Pagh. 2012. Compressed matrix multiplication. In Proceedings of the Innovations in Theoretical Computer Science Conference (ITCS). Google ScholarDigital Library
R. Panigrahy. 2006. Entropy-based nearest neighbor search in high dimensions. In Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA). Google ScholarDigital Library
Ramamohan Paturi, Sanguthevar Rajasekaran, and John H. Reif. 1989. The light bulb problem. In Proceedings of the Conference on Learning Theory (COLT). 261--268. Google ScholarDigital Library
C. Peikert. 2009. Public--key cryptosystems from the worst-case shortest vector problem. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 333--342. Google ScholarDigital Library
O. Regev. 2009. On lattices, learning with errors, random linear codes, and cryptography. J. ACM 56, 6, 1--40. Google ScholarDigital Library
O. Regev. 2010. The learning with errors problem. In Proceedings of the IEEE Conference on Computational Complexity (CCC) (Invited Survey). Google ScholarDigital Library
T. J. Rivlin. 1974. The Chebyshev Polynomials. Wiley.Google Scholar
H. Samet. 2006. Foundations of Multidimensional and Metric Data Structures. Elsevier. Google ScholarDigital Library
I. J. Schoenberg. 1942. Positive definite functions on spheres. Duke Math. J. 9, 1, 96--108.Google ScholarCross Ref
G. Szegö. 1975. Orthogonal Polynomials, 4th Ed. American Mathematical Society, Colloquium Publications, 23. Providence, RI.Google Scholar
G. Valiant. 2012. Finding correlations in subquadratic time, with applications to learning parities and juntas. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). Google ScholarDigital Library
L. Valiant. 1988. Functionality in neural nets. In Proceedings of the 1st Workshop on Computational Learning Theory. 28--39. Google ScholarDigital Library
K. A. Verbeurgt. 1990. Learning DNF under the uniform distribution in quasipolynomial time. In Proceedings of the Conference on Learning Theory (COLT). 314--326. Google ScholarDigital Library
X. Wan, C. Yang, H. Xue, N. Tang, and W. Yu. 2010. Detecting two-locus associations allowing for interactions in genome-wide association studies. Bioinformatics 26, 20, 2517--2525. Google ScholarDigital Library
R. Weber, H. J. Schek, and S. Blott. 1998. A quantitative analysis and performance study for similarity--search methods in high--dimensional spaces. In Proceedings of the 24th International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
V. Vassilevska Williams. 2012. Multiplying matrices faster than Coppersmith--Winograd. In Proceedings of the ACM Symposium on Theory of Computing (STOC). Google ScholarDigital Library

Index Terms

Finding Correlations in Subquadratic Time, with Applications to Learning Parities and the Closest Pair Problem
1. Theory of computation
  1. Design and analysis of algorithms

Recommendations

A Faster Subquadratic Algorithm for Finding Outlier Correlations
Special Issue on SODA’16 and Regular Papers

We study the problem of detecting outlier pairs of strongly correlated variables among a collection of n variables with otherwise weak pairwise correlations. After normalization, this task amounts to the geometric task where we are given as input a set ...
Read More
Finding Correlations in Subquadratic Time, with Applications to Learning Parities and Juntas
FOCS '12: Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science

Given a set of $n$ $d$-dimensional Boolean vectors with the promise that the vectors are chosen uniformly at random with the exception of two vectors that have Pearson -- correlation $\rho$ (Hamming distance $d\cdot \frac{1-\rho}{2}$), how quickly can ...
Read More
A Reliable Randomized Algorithm for the Closest-Pair Problem

The following two computational problems are studied:Duplicate grouping:Assume thatnitems are given, each of which is labeled by an integer key from the set {0, ,U 1}. Store the items in an array of sizensuch that items with the same key occupy a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of the ACM Volume 62, Issue 2
May 2015
304 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/2772377
Editor:
Victor Vianu
University of California, San Diego
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 May 2015
- Accepted: 1 January 2015
- Revised: 1 August 2014
- Received: 1 August 2013
Published in jacm Volume 62, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Correlations
approximate closest pair
asymmetric embeddings
learning juntas
locality sensitive hashing
metric embedding
nearest neighbor
parity with noise
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 637
  Total Downloads
- Downloads (Last 12 months)45
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Finding Correlations in Subquadratic Time, with Applications to Learning Parities and the Closest Pair Problem

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

A Faster Subquadratic Algorithm for Finding Outlier Correlations

Finding Correlations in Subquadratic Time, with Applications to Learning Parities and Juntas

A Reliable Randomized Algorithm for the Closest-Pair Problem