Abstract
While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox <tt>SHOGUN</tt> for which the source code is publicly available at <tt>http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun</tt>.
- F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In C. E. Brodley, editor, Twenty-first international conference on Machine learning. ACM, 2004. Google ScholarDigital Library
- K. P. Bennett, M. Momma, and M. J. Embrechts. MARK: a boosting algorithm for heterogeneous kernel models. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 24-31. ACM, 2002. Google ScholarDigital Library
- J. Bi, T. Zhang, and K. P. Bennett. Column-generation boosting methods for mixture of kernels. In W. Kim, R. Kohavi, J. Gehrke, and W. DuMouchel, editors, Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 521-526. ACM, 2004. Google ScholarDigital Library
- S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. Google ScholarDigital Library
- O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1-3):131-159, 2002. Google ScholarDigital Library
- C. Cortes and V. N. Vapnik. Support-vector networks. Machine Learning, 20(3):273-297, 1995. Google ScholarDigital Library
- J. Davis and M. Goadrich. The relationship between precision-recall and roc curves. Technical report #1551, University of Wisconsin Madison, January 2006.Google Scholar
- T. Fawcett. Roc graphs: Notes and practical considerations for data mining researchers. Technical report hpl-2003-4, HP Laboratories, Palo Alto, CA, USA, January 2003.Google Scholar
- E. Fredkin. Trie memory. Communications of the ACM, 3(9):490-499, 1960. Google ScholarDigital Library
- Y. Grandvalet and S. Canu. Adaptive scaling for feature selection in SVMs. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 553- 560, Cambridge, MA, 2003. MIT Press.Google Scholar
- R. Hettich and K. O. Kortanek. Semi-infinite programming: Theory, methods and applications. SIAM Review, 3:380-429, 1993. Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, ECML '98: Proceedings of the 10th European Conference on Machine Learning, Lecture Notes in Computer Science, pages 137-142, Berlin / Heidelberg, 1998. Springer-Verlag. Google ScholarDigital Library
- T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C.J.C. Burges, and A. J. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, pages 169-184, Cambridge, MA, USA, 1999. MIT Press. Google ScholarDigital Library
- G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomic data fusion. Bioinformatics, 20:2626-2635, 2004. Google ScholarDigital Library
- C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale, and T. E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing, pages 564-575, Kaua'i, Hawaii, 2002.Google Scholar
- C. Leslie, R. Kuang, and E. Eskin. Inexact matching string kernels for protein classification. In Kernel Methods in Computational Biology, MIT Press series on Computational Molecular Biology, pages 95-112. MIT Press, 2004.Google Scholar
- R. Meir and G. Rätsch. An introduction to boosting and leveraging. In S. Mendelson and A. Smola, editors, Proc. of the first Machine Learning Summer School in Canberra, LNCS, pages 119-184. Springer, 2003. Google ScholarDigital Library
- C. E. Metz. Basic principles of ROC analysis. Seminars in Nuclear Medicine, VIII(4), October 1978.Google Scholar
- C. S. Ong, A. J. Smola, and R. C. Williamson. Hyperkernels. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, volume 15, pages 478- 485, Cambridge, MA, 2003. MIT Press.Google Scholar
- J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, pages 185-208, Cambridge, MA, USA, 1999. MIT Press. Google ScholarDigital Library
- G. Rätsch. Robust Boosting via Convex Optimization. PhD thesis, University of Potsdam, Potsdam, Germany, 2001.Google Scholar
- G. Rätsch and S. Sonnenburg. Accurate Splice Site Prediction for Caenorhabditis Elegans, pages 277-298. MIT Press series on Computational Molecular Biology. MIT Press, 2004.Google Scholar
- G. Rätsch and M. K. Warmuth. Efficient margin maximization with boosting. Journal of Machine Learning Research, 6:2131-2152, 2005. Google ScholarDigital Library
- G. Rätsch, A. Demiriz, and K. Bennett. Sparse regression ensembles in infinite and finite hypothesis spaces. Machine Learning, 48(1-3):193-221, 2002. Special Issue on New Methods for Model Selection and Model Combination. Also NeuroCOLT2 Technical Report NC-TR-2000-085. Google ScholarDigital Library
- G. Rätsch, S. Sonnenburg, and B. Schölkopf. RASE: Recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21:i369-i377, 2005. Google ScholarDigital Library
- B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.Google Scholar
- S. Sonnenburg, G. Rätsch, and C. Schäfer. Learning interpretable SVMs for biological sequence classification. In S. Miyano, J. P. Mesirov, S. Kasif, S. Istrail, P. A. Pevzner, and M. Waterman, editors, Research in Computational Molecular Biology, 9th Annual International Conference, RECOMB 2005, volume 3500, pages 389-407. Springer-Verlag Berlin Heidelberg, 2005a. Google ScholarDigital Library
- S. Sonnenburg, G. Rätsch, and B. Schölkopf. Large scale genomic sequence SVM classifiers. In L. D. Raedt and S. Wrobel, editors, ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 849-856, New York, NY, USA, 2005b. ACM Press. Google ScholarDigital Library
- M. K. Warmuth, J. Liao, and G. Rätsch. Totally corrective boosting algorithms that maximize the margin. In ICML '06: Proceedings of the 23nd international conference on Machine learning. ACM Press, 2006. Google ScholarDigital Library
Index Terms
- Large Scale Multiple Kernel Learning
Recommendations
Localized multiple kernel learning
ICML '08: Proceedings of the 25th international conference on Machine learningRecently, instead of selecting a single kernel, multiple kernel learning (MKL) has been proposed which uses a convex combination of kernels, where the weight of each kernel is optimized during training. However, MKL assigns the same weight to a kernel ...
A pre-selecting base kernel method in multiple kernel learning
The pre-defined base kernel greatly affects the performance of multiple kernel learning (MKL), but selecting the pre-defined base kernel still has no theoretical guidance. In practice, it is very difficult to select a set of appropriate base kernels ...
Two-stage multiple kernel learning with multiclass kernel polarization
The success of kernel methods is very much dependent on the choice of kernels. Multiple kernel learning (MKL) aims at learning a combination of different kernels in order to better match the underlying problem instead of using a single fixed kernel. In ...
Comments