ABSTRACT
A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented. The technique is applicable to a wide variety of the classification functions, including Perceptrons, polynomials, and Radial Basis Functions. The effective number of parameters is adjusted automatically to match the complexity of the problem. The solution is expressed as a linear combination of supporting patterns. These are the subset of training patterns that are closest to the decision boundary. Bounds on the generalization performance based on the leave-one-out method and the VC-dimension are given. Experimental results on optical character recognition problems demonstrate the good generalization obtained when compared with other learning algorithms.
- ABR64.M.A. Aizerman, E.M. Braverman, and L.I. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821-837, 1964.Google Scholar
- BH89.E.B. Baum and D. Haussler. What size net gives valid generalization? Neural Computation, 1(1):151-160, 1989. Google ScholarDigital Library
- BL88.D.S. Broomhead and D. Lowe. Multivariate functional interpolation and adaptive networks. Complex Systems, 2:321 - 355, 1988.Google Scholar
- CBD+90.Yann Le Cun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Larry D. Jackel. Handwritten digit recognition with a back-propagation network. In David S. Touretzky, editor, Neural Information Processing Systems, volume 2, pages 396-404. Morgan Kaufmann Publishers, San Mateo, CA, 1990. Google ScholarDigital Library
- CH53.R. Courant and D. Hilbert. Methods of mathematical physics. Interscience, New York, 1953.Google Scholar
- DH73.R.O. Duda and P.E. Hart. Pattern Classification And Scene Analysis. Wiley and Son, 1973.Google Scholar
- GBD92.S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4 (1):1 - 58, 1992. Google ScholarDigital Library
- GPP+89.I. Guyon, I. Poujaud, L. Personnaz, G. Dreyfus, J. Denker, and Y. LeCun. Comparing different neural network architectures for classifying handwritten digits. In Proc. Int. Joint Conf. Neural Networks. Int. Joint Conference on Neural Networks, 1989.Google ScholarCross Ref
- GVB+92.Isabelle Guyon, Vladimir Vapnik, Bernhard Boser, Leon Bottou, and Sara Solla. Structural risk minimization for character recognition. In David S. Touretzky, editor, Neural Information Processing Systems, volume 4. Morgan Kaufmann Publishers, San Mateo, CA, 1992. To appear.Google Scholar
- HLW88.David Haussler, Nick Littlestone, and Manfred Warmuth. Predicting 0,1-functions on randomly drawn points. In Proceedings of the 29th Annual Symposium on the Foundations of Computer Science, pages 100-109. IEEE, 1988.Google Scholar
- KM87.W. Krauth and M. Mezard. Learning algorithms with optimal stability in neural networks. J. Phys. A: Math. gen., 20:L745, 1987.Google Scholar
- Loo72.F.A. Lootsma, editor. Numerical Methods for Non-linear Optimization. Academic Press, London, 1972.Google Scholar
- Lue84.David Luenberger. Linear and Nonlinear Programming. Addison-Wesley, 1984.Google Scholar
- Mac92.D. MacKay. A practical bayesian framework for backprop networks. In David S. Touretzky, editor, Neural Information Processing Systems, volume 4. Morgan Kaufmann Publishers, San Mateo, CA, 1992. To appear.Google Scholar
- MD89.J. Moody and C. Darken. Fast learning in networks of locally tuned processing units. Neural Computation, i (2):281 - 294, 1989.Google Scholar
- MGB+92.N. Matte, I. Guyon, L. Bottou, J. Denker, and V. Vapnik. Computer-aided cleaning of large databases for character recognition. In Digest ICPR. ICPR, Amsterdam, August 1992.Google Scholar
- Moo92.J. Moody. Generalization, weight decay, and architecture selection for nonlinear learning systems. In David S. Touretzky, editor, Neural Information Processing Systems, volume 4. Morgan Kaufrnann Publishers, San Mateo, CA, 1992. To appear.Google Scholar
- NY83.A.S. Nemirovsky and D. Do Yudin. Problem Complexzty and Method Efficiency in Optimization. Wiley, New York, 1983.Google Scholar
- Omo91.S.M. Omohundro. Bumptrees for efficient function, constraint and classification learning. In R.P. Lippmann and et al., editors, NIPS-90, San Mateo CA, 1991. IEEE, Morgan Kaufmann. Google ScholarDigital Library
- PG90.T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to nmltilayer networks. Science, 247:978 - 982, February 1990.Google Scholar
- Pog75.T. Poggio. On optimal nonlinear associative recall. Biol. Cybernetics, Vol. 19:201-209, 1975.Google ScholarCross Ref
- Ros62.F. Rosenblatt. Princzples of neurodynamics. Spartan Books, New York, 1962.Google Scholar
- SLD92.P. Simard, Y. LeCun, and J. Denker. Tangent prop--a formalism for specifying selected invariances in an adaptive network. In David S. Touretzky, editor, Neural Information Processing Systems, volume 4. Morgan Kaufmann Publishers, San Mateo, CA, 1992. To appear.Google Scholar
- TLS89.N. Tishby, E. Levin, and S. A. Solla. Consistent inference of probabilities in layered networks: Predictions and generalization. In Proceedings of the International Joint Conference on Neural Networks, Washington DC, 1989.Google ScholarCross Ref
- Vap82.Vladimir Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, New York, 1982. Google ScholarDigital Library
- VC74.V.N. Vapnik and A.Ya. Chervonenkis. The theory of pattern recognition. Nauka, Moscow, 1974.Google Scholar
Index Terms
- A training algorithm for optimal margin classifiers
Recommendations
Bootstrapped training of event extraction classifiers
EACL '12: Proceedings of the 13th Conference of the European Chapter of the Association for Computational LinguisticsMost event extraction systems are trained with supervised learning and rely on a collection of annotated documents. Due to the domain-specificity of this task, event extraction systems must be retrained with new annotated data for each domain. In this ...
Maximum margin partial label learning
Partial label learning aims to learn from training examples each associated with a set of candidate labels, among which only one label is valid for the training example. The basic strategy to learn from partial label examples is disambiguation, i.e. by ...
Comments