Abstract
We propose a novel method of dimensionality reduction for supervised learning problems. Given a regression or classification problem in which we wish to predict a response variable Y from an explanatory variable X, we treat the problem of dimensionality reduction as that of finding a low-dimensional "effective subspace" for X which retains the statistical relationship between X and Y. We show that this problem can be formulated in terms of conditional independence. To turn this formulation into an optimization problem we establish a general nonparametric characterization of conditional independence using covariance operators on reproducing kernel Hilbert spaces. This characterization allows us to derive a contrast function for estimation of the effective subspace. Unlike many conventional methods for dimensionality reduction in supervised learning, the proposed method requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y. We present experiments that compare the performance of the method with conventional methods.
- Daniel Alpay. The Schur Algorithm, Reproducing Kernel Spaces and System Theory. American Mathematical Society, 2001.Google Scholar
- Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 69(3):337-404, 1950.Google ScholarCross Ref
- Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1-48, 2002. Google ScholarDigital Library
- Francis R. Bach and Michael I. Jordan. Beyond independent components: trees and clusters. Journal of Machine Learning Research, 2003a. In press. Google ScholarDigital Library
- Francis R. Bach and Michael I. Jordan. Learning graphical models with Mercer kernels. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15. MIT Press, 2003b.Google Scholar
- Charles R. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186:273-289, 1973.Google ScholarCross Ref
- Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. Google ScholarDigital Library
- Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Fifth Annual ACM Workshop on Computational Learning Theory, pages 144-152. ACM Press, 1992. Google ScholarDigital Library
- Leo Breiman and Jerome H. Friedman. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80:580-598, 1985.Google ScholarCross Ref
- R. Dennis Cook. Regression Graphics. Wiley Inter-Science, 1998.Google Scholar
- R. Dennis Cook and Hakbae Lee. Dimension reduction in regression with a binary response. Journal of the American Statistical Association, 94:1187-1200, 1999.Google ScholarCross Ref
- R. Dennis Cook and S. Weisberg. Discussion of Li (1991). Journal of the American Statistical Association, 86:328-332, 1991.Google Scholar
- R. Dennis Cook and Xiangrong Yin. Dimension reduction and visualization in discriminant analysis (with discussion). Australian & New Zealand Journal of Statistics, 43(2):147-199, 2001.Google ScholarCross Ref
- Jerome H. Friedman and Werner Stuetzle. Projection pursuit regression. Journal of the American Statistical Association, 76:817-823, 1981.Google ScholarCross Ref
- Wing Kam Fung, Xuming He, Li Liu, and Peide Shi. Dimension reduction based on canonical correlation. Statistica Sinica, 12(4):1093-1114, 2002.Google Scholar
- Otis W. Gilley and R. Kelly Pace. On the Harrison and Rubingeld data. Journal of Environmental Economics Management, 31:403-405, 1996.Google ScholarCross Ref
- Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157-1182, 2003. Google ScholarDigital Library
- David Harrison and Daniel L. Rubinfeld. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics Management, 5:81-102, 1978.Google ScholarCross Ref
- Trevor Hastie and Robert Tibshirani. Generalized additive models. Statistical Science, 1: 297-318, 1986.Google ScholarCross Ref
- Inge S. Helland. On the structure of partial least squares. Communications in Statistics - Simulations and Computation, 17(2):581-607, 1988.Google Scholar
- Agnar Höskuldsson. PLS regression methods. Journal of Chemometrics, 2:211-228, 1988.Google ScholarCross Ref
- Marian Hristache, Anatoli Juditsky, Jörg Polzehl, and Vladimir Spokoiny. Structure adaptive approach for dimension reduction. The Annals of Statistics, 29(6):1537-1566, 2001.Google ScholarCross Ref
- Ker-Chau Li. Sliced inverse regression for dimension reduction (with discussion). Journal of the American Statistical Association, 86:316-342, 1991.Google ScholarCross Ref
- Ker-Chau Li. On principal Hessian directions for data visualization and dimension reduction: Another application of Stein's lemma. Journal of the American Statistical Association , 87:1025-1039, 1992.Google ScholarCross Ref
- Ker-Chau Li, Heng-Hui Lue, and Chun-Houh Chen. Interactive tree-structured regression via principal Hessian directions. Journal of the American Statistical Association, 95(450): 547-560, 2000.Google ScholarCross Ref
- Patrick M. Murphy and David W. Aha. UCI repository of machine learning databases. Technical report, University of California, Irvine, Department of Information and Computer Science. http://www.ics.uci.edu/~mlearn/MLRepository.html, 1994.Google Scholar
- Radford M. Neal. Bayesian Learning for Neural Networks. Springer Verlag, 1996. Google ScholarDigital Library
- Danh V. Nguyen and David M. Rocke. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18(1):39-50, 2002.Google Scholar
- Michael Reed and Barry Simon. Functional Analysis. Academic Press, 1980.Google Scholar
- Alexander M. Samarov. Exploring regression structure using nonparametric functional estimation. Journal of the American Statistical Association, 88(423):836-847, 1993.Google ScholarCross Ref
- Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299-1319, 1998. Google ScholarDigital Library
- Kari Torkkola. Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research, 3:1415-1438, 2003. Google ScholarDigital Library
- Nikolai N. Vakhania, Vazha I. Tarieladze, and Sergei A. Chobanyan. Probability Distributions on Banach Spaces. D. Reidel Publishing Company, 1987.Google ScholarCross Ref
- Vladimir N. Vapnik, Steven E. Golowich, and Alexander J. Smola. Support vector method for function approximation, regression estimation, and signal processing. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 281-287. MIT Press, 1997.Google Scholar
- Francesco Vivarelli and Christopher K.I. Williams. Discovering hidden features with Gaussian process regression. In Michael Kearns, Sara Solla, and David Cohn, editors, Advances in Neural Processing Systems, volume 11, pages 613-619. MIT Press, 1999. Google ScholarDigital Library
- Sanford Weisberg. Dimension reduction regression in R. Journal of Statistical Software, 7 (1), 2002.Google ScholarCross Ref
Index Terms
- Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces
Recommendations
Local Fisher discriminant analysis for supervised dimensionality reduction
ICML '06: Proceedings of the 23rd international conference on Machine learningDimensionality reduction is one of the important preprocessing steps in high-dimensional data analysis. In this paper, we consider the supervised dimensionality reduction problem where samples are accompanied with class labels. Traditional Fisher ...
A unified framework for semi-supervised dimensionality reduction
In practice, many applications require a dimensionality reduction method to deal with the partially labeled problem. In this paper, we propose a semi-supervised dimensionality reduction framework, which can efficiently handle the unlabeled data. Under ...
Two-stage multiple kernel learning for supervised dimensionality reduction
In supervised dimensionality reduction methods for pattern recognition tasks, the information of the class labels is considered through the process of reducing the input dimensionality, to improve the classification accuracy. Using nonlinear mappings ...
Comments