article

Free Access

Why Does Unsupervised Pre-training Help Deep Learning?

Authors:
Dumitru Erhan

View Profile

,
Yoshua Bengio

View Profile

,
Aaron Courville

View Profile

,
Pierre-Antoine Manzagol

View Profile

,
Pascal Vincent

View Profile

,
Samy Bengio

View Profile

Authors Info & Claims

The Journal of Machine Learning Research Volume 11pp 625–660

Published:01 March 2010Publication History

The Journal of Machine Learning Research

Abstract

Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training.

References

Shun-ichi Amari, Noboru Murata, Klaus-Robert Müller, Michael Finke, and Howard Hua Yang. Asymptotic statistical theory of overtraining and cross-validation. IEEE Transactions on Neural Networks, 8(5):985-996, 1997. Google ScholarDigital Library
Lalit Bahl, Peter Brown, Peter deSouza, and Robert Mercer. Maximum mutual information estimation of hidden markov parameters for speech recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 49-52, Tokyo, Japan, 1986.Google ScholarCross Ref
Andrew E. Barron. Complexity regularization with application to artificial neural networks. In G. Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages 561-576. Kluwer Academic Publishers, 1991.Google ScholarCross Ref
Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NIPS'01), Cambridge, MA, 2002. MIT Press.Google Scholar
Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1-127, 2009. Also published as a book. Now Publishers, 2009. Google ScholarDigital Library
Yoshua Bengio and Olivier Delalleau. Justifying and generalizing contrastive divergence. Neural Computation, 21(6):1601-1621, June 2009. Google ScholarDigital Library
Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, pages 321-360. MIT Press, 2007.Google Scholar
Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. The curse of highly variable functions for local kernel machines. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18 (NIPS'05), pages 107-114. MIT Press, Cambridge, MA, 2006.Google Scholar
Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Bernhard Schölkopf, John Platt, and Thomas Hoffman, editors, Advances in Neural Information Processing Systems 19 (NIPS'06), pages 153-160. MIT Press, 2007.Google Scholar
Marc H. Bornstein. Sensitive periods in development : interdisciplinary perspectives / edited by Marc H. Bornstein. Lawrence Erlbaum Associates, Hillsdale, N.J.:, 1987.Google Scholar
Olivier Chapelle, Jason Weston, and Bernhard Schölkopf. Cluster kernels for semi-supervised learning. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15 (NIPS'02), pages 585-592, Cambridge, MA, 2003. MIT Press.Google Scholar
Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Semi-Supervised Learning. MIT Press, 2006. Google ScholarDigital Library
Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML'08), pages 160-167. ACM, 2008. Google ScholarDigital Library
Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. Technical Report 1341, Université de Montréal, 2009.Google Scholar
Patrick Gallinari, Yann LeCun, Sylvie Thiria, and Francoise Fogelman-Soulie. Memoires associatives distribuees. In Proceedings of COGNITIVA 87, Paris, La Villette, 1987.Google Scholar
Ian Goodfellow, Quoc Le, Andrew Saxe, and Andrew Ng. Measuring invariances in deep networks. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 646-654. 2009.Google Scholar
Raia Hadsell, Ayse Erkan, Pierre Sermanet, Marco Scoffier, Urs Muller, and Yann LeCun. Deep belief net learning in a long-range vision system for autonomous off-road driving. In Proc. Intelligent Robots and Systems (IROS'08), pages 628-633, 2008.Google ScholarCross Ref
Johan Håstad. Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th annual ACM Symposium on Theory of Computing, pages 6-20, Berkeley, California, 1986. ACM Press. Google ScholarDigital Library
Johan Håstad and Mikael Goldmann. On the power of small-depth threshold circuits. Computational Complexity, 1:113-129, 1991.Google ScholarCross Ref
Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771-1800, 2002. Google ScholarDigital Library
Geoffrey E. Hinton. To recognize shapes, first learn to generate images. In Paul Cisek, Trevor Drew, and John Kalaska, editors, Computational Neuroscience: Theoretical Insights into Brain Function. Elsevier, 2007.Google Scholar
Geoffrey E. Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504-507, July 2006.Google ScholarDigital Library
Goeffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554, 2006. Google ScholarDigital Library
Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann machines. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML'08), pages 536-543. ACM, 2008. Google ScholarDigital Library
Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Int. Conf. Mach. Learn., pages 473-480, 2007. Google ScholarDigital Library
Hugo Larochelle, Yoshua Bengio, Jerome Louradour, and Pascal Lamblin. Exploring strategies for training deep neural networks. The Journal of Machine Learning Research, 10:1-40, January 2009. Google ScholarDigital Library
Julia A. Lasserre, Christopher M. Bishop, and Thomas P. Minka. Principled hybrids of generative and discriminative models. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR'06), pages 87-94, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
Yann LeCun. Modèles connexionistes de l'apprentissage. PhD thesis, Université de Paris VI, 1987.Google Scholar
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.Google ScholarDigital Library
Honglak Lee, Chaitanya Ekanadham, and Andrew Ng. Sparse deep belief net model for visual area V2. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS'07), pages 873-880. MIT Press, Cambridge, MA, 2008.Google Scholar
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Léon Bottou and Michael Littman, editors, Proceedings of the Twenty-sixth International Conference on Machine Learning (ICML'09). ACM, Montreal (Qc), Canada, 2009. Google ScholarDigital Library
Gaëlle Loosli, Stéphane Canu, and Léon Bottou. Training invariant support vector machines using selective sampling. In Léon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston, editors, Large Scale Kernel Machines, pages 301-320. MIT Press, Cambridge, MA., 2007.Google Scholar
Hossein Mobahi, Ronan Collobert, and Jason Weston. Deep learning from temporal coherence in video. In Léon Bottou and Michael Littman, editors, Proceedings of the 26th International Conference on Machine Learning, pages 737-744, Montreal, June 2009. Omnipress. Google ScholarDigital Library
Andrew Y. Ng and Michael I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NIPS'01), pages 841-848, 2002.Google Scholar
Simon Osindero and Geoffrey E. Hinton. Modeling image patches with a directed hierarchy of markov random field. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS'07), pages 1121-1128, Cambridge, MA, 2008. MIT Press.Google Scholar
Dan Povey and Philip C. Woodland. Minimum phone error and i-smoothing for improved discriminative training. In Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP '02). IEEE International Conference on, volume 1, pages I-105-I-108 vol.1, 2002.Google Scholar
Marc'Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann LeCun. Efficient learning of sparse representations with an energy-based model. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19 (NIPS'06), pages 1137-1144. MIT Press, 2007.Google Scholar
Marc'Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun. Sparse feature learning for deep belief networks. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS'07), pages 1185-1192, Cambridge, MA, 2008. MIT Press.Google Scholar
Ruslan Salakhutdinov and Geoffrey E. Hinton. Using deep belief nets to learn covariance kernels for Gaussian processes. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS'07), pages 1249-1256, Cambridge, MA, 2008. MIT Press.Google Scholar
Ruslan Salakhutdinov and Geoffrey E. Hinton. Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam, 2007. Elsevier.Google Scholar
Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey E. Hinton. Restricted Boltzmann machines for collaborative filtering. In Zoubin Ghahramani, editor, Proceedings of the Twenty-fourth International Conference on Machine Learning (ICML'07), pages 791-798, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
Sebastian H. Seung. Learning continuous attractors in recurrent networks. In M.I. Jordan, M.J. Kearns, and S.A. Solla, editors, Advances in Neural Information Processing Systems 10 (NIPS'97), pages 654-660. MIT Press, 1998. Google ScholarDigital Library
Jonas Sjöberg and Lennart Ljung. Overtraining, regularization and searching for a minimum, with application to neural networks. International Journal of Control, 62(6):1391-1407, 1995.Google ScholarCross Ref
Joshua M. Susskind, Geoffrey E., Javier R. Movellan, and Adam K. Anderson. Generating facial expressions with deep belief nets. In V. Kordic, editor, Affective Computing, Emotion Modelling, Synthesis and Recognition, pages 421-440. ARS Publishers, 2008.Google Scholar
Joshua Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319-2323, December 2000.Google Scholar
Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579-2605, November 2008.Google Scholar
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Andrew McCallum and Sam Roweis, editors, Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008), pages 1096-1103. Omnipress, 2008. Google ScholarDigital Library
Max Welling, Michal Rosen-Zvi, and Geoffrey E. Hinton. Exponential family harmoniums with an application to information retrieval. In L.K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17 (NIPS'04), pages 1481-1488, Cambridge, MA, 2005. MIT Press.Google Scholar
Jason Weston, Frédéric Ratle, and Ronan Collobert. Deep learning via semi-supervised embedding. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML'08), pages 1168-1175, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
Andrew Yao. Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1-10, 1985. Google ScholarDigital Library
Long Zhu, Yuanhao Chen, and Alan Yuille. Unsupervised learning of probabilistic grammar-markov models for object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):114-128, 2009. Google ScholarDigital Library

Index Terms

Why Does Unsupervised Pre-training Help Deep Learning?

Recommendations

Neural speech enhancement with unsupervised pre-training and mixture training
Abstract
Supervised neural speech enhancement methods always require a large scale of paired noisy and clean speech data. Since collecting adequate paired data from real-world applications is infeasible, simulated data is always adopted in ...
Read More
Improve Deep Learning with Unsupervised Objective
Neural Information Processing
Abstract
We propose a novel approach capable of embedding the unsupervised objective into hidden layers of the deep neural network (DNN) for preserving important unsupervised information. To this end, we exploit a very simple yet effective unsupervised ...
Read More
Deeply Unsupervised Patch Re-Identification for Pre-Training Object Detectors
Unsupervised pre-training aims at learning transferable features that are beneficial for downstream tasks. However, most state-of-the-art unsupervised methods concentrate on learning <italic>global</italic> representations for <italic>image-level</italic> ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

The Journal of Machine Learning Research Volume 11, Issue
3/1/2010
3637 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
JMLR.org
Publication History
- Published: 1 March 2010
Published in jmlr Volume 11, Issue
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 292
  Total Citations
  View Citations
- 5,413
  Total Downloads
- Downloads (Last 12 months)180
- Downloads (Last 6 weeks)37
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Why Does Unsupervised Pre-training Help Deep Learning?

The Journal of Machine Learning Research

Abstract

References

Cited By

Index Terms

Recommendations

Neural speech enhancement with unsupervised pre-training and mixture training

Improve Deep Learning with Unsupervised Objective

Deeply Unsupervised Patch Re-Identification for Pre-Training Object Detectors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Why Does Unsupervised Pre-training Help Deep Learning?

The Journal of Machine Learning Research

Abstract

References

Cited By

Index Terms

Recommendations

Neural speech enhancement with unsupervised pre-training and mixture training

Improve Deep Learning with Unsupervised Objective

Deeply Unsupervised Patch Re-Identification for Pre-Training Object Detectors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media