skip to main content
article
Free Access

Why Does Unsupervised Pre-training Help Deep Learning?

Published:01 March 2010Publication History
Skip Abstract Section

Abstract

Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training.

References

  1. Shun-ichi Amari, Noboru Murata, Klaus-Robert Müller, Michael Finke, and Howard Hua Yang. Asymptotic statistical theory of overtraining and cross-validation. IEEE Transactions on Neural Networks, 8(5):985-996, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Lalit Bahl, Peter Brown, Peter deSouza, and Robert Mercer. Maximum mutual information estimation of hidden markov parameters for speech recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 49-52, Tokyo, Japan, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  3. Andrew E. Barron. Complexity regularization with application to artificial neural networks. In G. Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages 561-576. Kluwer Academic Publishers, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  4. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NIPS'01), Cambridge, MA, 2002. MIT Press.Google ScholarGoogle Scholar
  5. Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1-127, 2009. Also published as a book. Now Publishers, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Yoshua Bengio and Olivier Delalleau. Justifying and generalizing contrastive divergence. Neural Computation, 21(6):1601-1621, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, pages 321-360. MIT Press, 2007.Google ScholarGoogle Scholar
  8. Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. The curse of highly variable functions for local kernel machines. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18 (NIPS'05), pages 107-114. MIT Press, Cambridge, MA, 2006.Google ScholarGoogle Scholar
  9. Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Bernhard Schölkopf, John Platt, and Thomas Hoffman, editors, Advances in Neural Information Processing Systems 19 (NIPS'06), pages 153-160. MIT Press, 2007.Google ScholarGoogle Scholar
  10. Marc H. Bornstein. Sensitive periods in development : interdisciplinary perspectives / edited by Marc H. Bornstein. Lawrence Erlbaum Associates, Hillsdale, N.J.:, 1987.Google ScholarGoogle Scholar
  11. Olivier Chapelle, Jason Weston, and Bernhard Schölkopf. Cluster kernels for semi-supervised learning. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15 (NIPS'02), pages 585-592, Cambridge, MA, 2003. MIT Press.Google ScholarGoogle Scholar
  12. Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Semi-Supervised Learning. MIT Press, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML'08), pages 160-167. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. Technical Report 1341, Université de Montréal, 2009.Google ScholarGoogle Scholar
  15. Patrick Gallinari, Yann LeCun, Sylvie Thiria, and Francoise Fogelman-Soulie. Memoires associatives distribuees. In Proceedings of COGNITIVA 87, Paris, La Villette, 1987.Google ScholarGoogle Scholar
  16. Ian Goodfellow, Quoc Le, Andrew Saxe, and Andrew Ng. Measuring invariances in deep networks. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 646-654. 2009.Google ScholarGoogle Scholar
  17. Raia Hadsell, Ayse Erkan, Pierre Sermanet, Marco Scoffier, Urs Muller, and Yann LeCun. Deep belief net learning in a long-range vision system for autonomous off-road driving. In Proc. Intelligent Robots and Systems (IROS'08), pages 628-633, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  18. Johan Håstad. Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th annual ACM Symposium on Theory of Computing, pages 6-20, Berkeley, California, 1986. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Johan Håstad and Mikael Goldmann. On the power of small-depth threshold circuits. Computational Complexity, 1:113-129, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  20. Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771-1800, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Geoffrey E. Hinton. To recognize shapes, first learn to generate images. In Paul Cisek, Trevor Drew, and John Kalaska, editors, Computational Neuroscience: Theoretical Insights into Brain Function. Elsevier, 2007.Google ScholarGoogle Scholar
  22. Geoffrey E. Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504-507, July 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Goeffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann machines. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML'08), pages 536-543. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Int. Conf. Mach. Learn., pages 473-480, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hugo Larochelle, Yoshua Bengio, Jerome Louradour, and Pascal Lamblin. Exploring strategies for training deep neural networks. The Journal of Machine Learning Research, 10:1-40, January 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Julia A. Lasserre, Christopher M. Bishop, and Thomas P. Minka. Principled hybrids of generative and discriminative models. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR'06), pages 87-94, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yann LeCun. Modèles connexionistes de l'apprentissage. PhD thesis, Université de Paris VI, 1987.Google ScholarGoogle Scholar
  29. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Honglak Lee, Chaitanya Ekanadham, and Andrew Ng. Sparse deep belief net model for visual area V2. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS'07), pages 873-880. MIT Press, Cambridge, MA, 2008.Google ScholarGoogle Scholar
  31. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Léon Bottou and Michael Littman, editors, Proceedings of the Twenty-sixth International Conference on Machine Learning (ICML'09). ACM, Montreal (Qc), Canada, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Gaëlle Loosli, Stéphane Canu, and Léon Bottou. Training invariant support vector machines using selective sampling. In Léon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston, editors, Large Scale Kernel Machines, pages 301-320. MIT Press, Cambridge, MA., 2007.Google ScholarGoogle Scholar
  33. Hossein Mobahi, Ronan Collobert, and Jason Weston. Deep learning from temporal coherence in video. In Léon Bottou and Michael Littman, editors, Proceedings of the 26th International Conference on Machine Learning, pages 737-744, Montreal, June 2009. Omnipress. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Andrew Y. Ng and Michael I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NIPS'01), pages 841-848, 2002.Google ScholarGoogle Scholar
  35. Simon Osindero and Geoffrey E. Hinton. Modeling image patches with a directed hierarchy of markov random field. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS'07), pages 1121-1128, Cambridge, MA, 2008. MIT Press.Google ScholarGoogle Scholar
  36. Dan Povey and Philip C. Woodland. Minimum phone error and i-smoothing for improved discriminative training. In Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP '02). IEEE International Conference on, volume 1, pages I-105-I-108 vol.1, 2002.Google ScholarGoogle Scholar
  37. Marc'Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann LeCun. Efficient learning of sparse representations with an energy-based model. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19 (NIPS'06), pages 1137-1144. MIT Press, 2007.Google ScholarGoogle Scholar
  38. Marc'Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun. Sparse feature learning for deep belief networks. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS'07), pages 1185-1192, Cambridge, MA, 2008. MIT Press.Google ScholarGoogle Scholar
  39. Ruslan Salakhutdinov and Geoffrey E. Hinton. Using deep belief nets to learn covariance kernels for Gaussian processes. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS'07), pages 1249-1256, Cambridge, MA, 2008. MIT Press.Google ScholarGoogle Scholar
  40. Ruslan Salakhutdinov and Geoffrey E. Hinton. Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam, 2007. Elsevier.Google ScholarGoogle Scholar
  41. Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey E. Hinton. Restricted Boltzmann machines for collaborative filtering. In Zoubin Ghahramani, editor, Proceedings of the Twenty-fourth International Conference on Machine Learning (ICML'07), pages 791-798, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Sebastian H. Seung. Learning continuous attractors in recurrent networks. In M.I. Jordan, M.J. Kearns, and S.A. Solla, editors, Advances in Neural Information Processing Systems 10 (NIPS'97), pages 654-660. MIT Press, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jonas Sjöberg and Lennart Ljung. Overtraining, regularization and searching for a minimum, with application to neural networks. International Journal of Control, 62(6):1391-1407, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  44. Joshua M. Susskind, Geoffrey E., Javier R. Movellan, and Adam K. Anderson. Generating facial expressions with deep belief nets. In V. Kordic, editor, Affective Computing, Emotion Modelling, Synthesis and Recognition, pages 421-440. ARS Publishers, 2008.Google ScholarGoogle Scholar
  45. Joshua Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319-2323, December 2000.Google ScholarGoogle Scholar
  46. Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579-2605, November 2008.Google ScholarGoogle Scholar
  47. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Andrew McCallum and Sam Roweis, editors, Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008), pages 1096-1103. Omnipress, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Max Welling, Michal Rosen-Zvi, and Geoffrey E. Hinton. Exponential family harmoniums with an application to information retrieval. In L.K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17 (NIPS'04), pages 1481-1488, Cambridge, MA, 2005. MIT Press.Google ScholarGoogle Scholar
  49. Jason Weston, Frédéric Ratle, and Ronan Collobert. Deep learning via semi-supervised embedding. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML'08), pages 1168-1175, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Andrew Yao. Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1-10, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Long Zhu, Yuanhao Chen, and Alan Yuille. Unsupervised learning of probabilistic grammar-markov models for object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):114-128, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Why Does Unsupervised Pre-training Help Deep Learning?

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image The Journal of Machine Learning Research
          The Journal of Machine Learning Research  Volume 11, Issue
          3/1/2010
          3637 pages
          ISSN:1532-4435
          EISSN:1533-7928
          Issue’s Table of Contents

          Publisher

          JMLR.org

          Publication History

          • Published: 1 March 2010
          Published in jmlr Volume 11, Issue

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader