article

Free Access

Do we need hundreds of classifiers to solve real world classification problems?

Authors:
Manuel Fernández-Delgado

Centro de Investigación en Tecnoloxías da Información da USC, University of Santiago de Compostela, Santiago de Compostela, Spain

Centro de Investigación en Tecnoloxías da Información da USC, University of Santiago de Compostela, Santiago de Compostela, Spain
View Profile

,
Eva Cernadas

Centro de Investigación en Tecnoloxías da Información da USC, University of Santiago de Compostela, Santiago de Compostela, Spain

Centro de Investigación en Tecnoloxías da Información da USC, University of Santiago de Compostela, Santiago de Compostela, Spain
View Profile

,
Senén Barro

Centro de Investigación en Tecnoloxías da Información da USC, University of Santiago de Compostela, Santiago de Compostela, Spain

Centro de Investigación en Tecnoloxías da Información da USC, University of Santiago de Compostela, Santiago de Compostela, Spain
View Profile

,
Dinani Amorim

Departamento de Tecnologia e Ciências Sociais, Universidade do Estado da Bahia, Juazeiro, Brasil

Departamento de Tecnologia e Ciências Sociais, Universidade do Estado da Bahia, Juazeiro, Brasil
View Profile

Authors Info & Claims

The Journal of Machine Learning Research Volume 15 Issue 1pp 3133–3181

Published:01 January 2014Publication History

The Journal of Machine Learning Research

Abstract

We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest-neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R (with and without the caret package), C and Matlab, including all the relevant classifiers available today. We use 121 data sets, which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behavior, not dependent on the data set collection. The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

References

David W. Aha, Dennis Kibler, and Marc K. Albert. Instance-based learning algorithms. Machine Learning, 6:37-66, 1991. Google Scholar
Miika Ahdesmäki and Korbinian Strimmer. Feature selection in omics prediction problems using cat scores and false non-discovery rate control. Annals of Applied Stat., 4:503-519, 2010.Google Scholar
Esteban Alfaro, Matías Gámez, and Noelia García. Multiclass corporate failure prediction by Adaboost.M1. Int. Advances in Economic Research, 13:301-312, 2007.Google Scholar
Peter Auer, Harald Burgsteiner, and Wolfang Maass. A learning rule for very simple universal approximators consisting of a single layer of perceptrons. Neural Networks, 1(21): 786-795, 2008. Google Scholar
Kevin Bache and Moshe Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.Google Scholar
Laurent Bergé, Charles Bouveyron, and Stéphane Girard. HDclassif: an R package for model-based clustering and discriminant analysis of high-dimensional data. J. Stat. Softw., 46(6):1-29, 2012.Google Scholar
Michael R. Berthold and Jay Diamond. Boosting the performance of RBF networks with dynamic decay adjustment. In Advances in Neural Information Processing Systems, pages 521-528. MIT Press, 1995.Google Scholar
Leo Breiman. Bagging predictors. Machine Learning, 24(2):123-140, 1996. Google Scholar
Leo Breiman. Random forests. Machine Learning, 45(1):5-32, 2001. Google Scholar
Leo Breiman, Jerome Friedman, R.A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth and Brooks, 1984.Google Scholar
Jean Carletta. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249-254, 1996. Google Scholar
Jadzia Cendrowska. PRISM: An algorithm for inducing modular rules. Int. J. of Man-Machine Studies, 27(4):349-370, 1987.Google Scholar
S. Le Cessie and J.C. Van Houwelingen. Ridge estimators in logistic regression. Applied Stat., 41(1):191-201, 1992.Google Scholar
Chih-Chung Chang and Chih-Jen. Lin. Libsvm: a library for support vector machines, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google Scholar
Hyonho Chun and Sunduz Keles. Sparse partial least squares for simultaneous dimension reduction and variable selection. J. of the Royal Stat. Soc. - Series B, 72:3-25, 2010.Google Scholar
John G. Cleary and Leonard E. Trigg. K*: an instance-based learner using an entropic distance measure. In Int. Conf. on Machine Learning, pages 108-114, 1995.Google Scholar
Line H. Clemensen, Trevor Hastie, Daniela Witten, and Bjarne Ersboll. Sparse discriminant analysis. Technometrics, 53(4):406-413, 2011.Google Scholar
William W. Cohen. Fast effective rule induction. In Int. Conf. on Machine Learning, pages 115-123, 1995.Google Scholar
Bhupinder S. Dayal and John F. MacGregor. Improved PLS algorithms. J. of Chemometrics, 11:73-85, 1997.Google Scholar
Gülsen Demiroz and H. Altay Guvenir. Classification by voting feature intervals. In European Conf. on Machine Learning, pages 85-92. Springer, 1997. Google Scholar
Houtao Deng and George Runger. Feature selection via regularized trees. In Int. Joint Conf. on Neural Networks, pages 1-8, 2012.Google Scholar
Beijing Ding and Robert Gentleman. Classification using generalized partial least squares. J. of Computational and Graphical Stat., 14(2):280-298, 2005.Google Scholar
Annette J. Dobson. An Introduction to Generalized Linear Models. Chapman and Hall, 1990.Google Scholar
Pedro Domingos. Metacost: A general method for making classifiers cost-sensitive. In Int. Conf. on Knowledge Discovery and Data Mining, pages 155-164, 1999. Google Scholar
Richard Duda, Peter Hart, and David Stork. Pattern Classification. Wiley, 2001. Google Scholar
Manuel J.A. Eugster, Torsten Hothorn, and Friedrich Leisch. Domain-based benchmark experiments: exploratory and inferential analysis. Austrian J. of Stat., 41:5-26, 2014.Google Scholar
Scott E. Fahlman. Faster-learning variations on back-propagation: an empirical study. In 1988 Connectionist Models Summer School, pages 38-50. Morgan-Kaufmann, 1988.Google Scholar
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res., 9:1871-1874, 2008. Google Scholar
Manuel Fernández-Delgado, Jorge Ribeiro, Eva Cernadas, and Senén Barro. Direct parallel perceptrons (DPPs): fast analytical calculation of the parallel perceptrons weights with margin control for classification tasks. IEEE Trans. on Neural Networks, 22:1837-1848, 2011. Google Scholar
Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Jorge Ribeiro, and José Neves. Direct kernel perceptron (DKP): ultra-fast kernel ELM-based classification with noniterative closed-form weight calculation. Neural Networks, 50:60-71, 2014. Google Scholar
Eibe Frank and Mark Hall. A simple approach to ordinal classification. In European Conf. on Machine Learning, pages 145-156, 2001. Google Scholar
Eibe Frank and Stefan Kramer. Ensembles of nested dichotomies for multi-class problems. In Int. Conf. on Machine Learning, pages 305-312. ACM, 2004. Google Scholar
Eibe Frank and Ian H. Witten. Generating accurate rule sets without global optimization. In Int. Conf. on Machine Learning, pages 144-151, 1999. Google Scholar
Eibe Frank, Yong Wang, Stuart Inglis, Geoffrey Holmes, and Ian H. Witten. Using model trees for classification. Machine Learning, 32(1):63-76, 1998. Google Scholar
Eibe Frank, Geoffrey Holmes, Richard Kirkby, and Mark Hall. Racing committees for large datasets. In Int. Conf. on Discovery Science, pages 153-164, 2002. Google Scholar
Eibe Frank, Mark Hall, and Bernhard Pfahringer. Locally weighted naive Bayes. In Conf. on Uncertainty in Artificial Intelligence, pages 249-256, 2003. Google Scholar
Yoav Freund and Llew Mason. The alternating decision tree learning algorithm. In Int. Conf. on Machine Learning, pages 124-133, 1999. Google Scholar
Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Int. Conf. on Machine Learning, pages 148-156. Morgan Kaufmann, 1996.Google Scholar
Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron algorithm. In Conf. on Computational Learning Theory, pages 209-217, 1998. Google Scholar
Jerome Friedman. Regularized discriminant analysis. J. of the American Stat. Assoc., 84: 165-175, 1989.Google Scholar
Jerome Friedman. Multivariate adaptive regression splines. Annals of Stat., 19(1):1-141, 1991.Google Scholar
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Stat., 28:2000, 1998.Google Scholar
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models via coordinate descent. J. of Stat. Softw., 33(1):1-22, 2010.Google Scholar
Brian R. Gaines and Paul Compton. Induction of ripple-down rules applied to modeling large databases. J. Intell. Inf. Syst., 5(3):211-228, 1995. Google Scholar
Andrew Gelman, Aleks Jakulin, Maria G. Pittau, and Yu-Sung Su. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Stat., 2(4):1360-1383, 2009.Google Scholar
Mark Girolami and Simon Rogers. Variational bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18:1790-1817, 2006. Google Scholar
Ekkehard Glimm, Siegfried Kropf, and Jürgen Läuter. Multivariate tests based on left-spherically distributed linear scores. The Annals of Stat., 26(5):1972-1988, 1998.Google Scholar
Encarnación González-Rufino, Pilar Carrión, Eva Cernadas, Manuel Fernández-Delgado, and Rosario Domínguez-Petit. Exhaustive comparison of colour texture features and classification methods to discriminate cells categories in histological images of fish ovary. Pattern Recognition, 46:2391-2407, 2013. Google Scholar
Mark Hall. Correlation-Based Feature Subset Selection for Machine Learning. PhD thesis, University of Waikato, 1998.Google Scholar
Mark Hall and Eibe Frank. Combining naive Bayes and decision tables. In Florida Artificial Intel. Soc. Conf., pages 318-319. AAAI press, 2008.Google Scholar
Trevor Hastie and Robert Tibshirani. Discriminant analysis by Gaussian mixtures. J. of the Royal Stat. Soc. series B, 58:158-176, 1996.Google Scholar
Trevor Hastie, Robert Tibshirani, and Andreas Buja. Flexible discriminant analysis by optimal scoring. J. of the American Stat. Assoc., 89:1255-1270, 1993.Google Scholar
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2009.Google Scholar
Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(8):832-844, 1998. Google Scholar
Tin Kam Ho and Mitra Basu. Complexity measures of supervised classification problems. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(3):289-300, 2002. Google Scholar
Geoffrey Holmes, Mark Hall, and Eibe Frank. Generating rule sets from model trees. In Australian Joint Conf. on Artificial Intelligence, pages 1-12, 1999. Google Scholar
Robert C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11:63-91, 1993. Google Scholar
Torsten Hothorn, Friedrich Leisch, Achim Zeileis, and Kurt Hornik. The design and analysis of benchmark experiments. J. Computational and Graphical Stat., 14:675-699, 2005.Google Scholar
Guang-Bin Huang, Hongming Zhou, Xiaojian Ding, and Rui Zhang. Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. - Part B: Cybernetics, 42:513-529, 2012.Google Scholar
Torsten Joachims. Making Large-Scale Support Vector Machine Learning Practical. In Bernhard Scholköpf, Cristopher J.C. Burges, and Alexander Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169-184. MIT-Press, 1999. Google Scholar
George H. John and Pat Langley. Estimating continuous distributions in Bayesian classifiers. In Conf. on Uncertainty in Artificial Intelligence, pages 338-345, 1995. Google Scholar
Sijmen De Jong. SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18:251-263, 1993.Google Scholar
Josef Kittler, Mohammad Hatef, Robert P.W. Duin, and Jiri Matas. On combining classifiers. IEEE Trans. on Pat. Anal. and Machine Intel., 20:226-239, 1998. Google Scholar
Ron Kohavi. The power of decision tables. In European Conf. on Machine Learning, pages 174-189. Springer, 1995. Google Scholar
Ron Kohavi. Scaling up the accuracy of naive-Bayes classifiers: a decision-tree hybrid. In Int. Conf. on Knoledge Discovery and Data Mining, pages 202-207, 1996.Google Scholar
Max Kuhn. Building predictive models in R using the caret package. J. Stat. Softw., 28(5): 1-26, 2008.Google Scholar
Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.Google Scholar
Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine Learning, 95 (1-2):161-205, 2005. Google Scholar
Nick Littlestone. Learning quickly when irrelevant attributes are abound: a new linear threshold algorithm. Machine Learning, 2:285-318, 1988. Google Scholar
Nuria Macià and Ester Bernadó-Mansilla. Towards UCI+: a mindful repository design. Information Sciences, 261(10):237-262, 2014. Google Scholar
Nuria Macià, Ester Bernadó-Mansilla, Albert Orriols-Puig, and Tin Kam Ho. Learner excellence biased by data set selection: a case for data characterisation and artificial data sets. Pattern Recognition, 46:1054-1066, 2013. Google Scholar
Harald Martens. Multivariate Calibration. Wiley, 1989.Google Scholar
Brent Martin. Instance-Based Learning: Nearest Neighbor with Generalization. PhD thesis, Univ. of Waikato, Hamilton, New Zealand, 1995.Google Scholar
Willem Melssen, Ron Wehrens, and Lutgarde Buydens. Supervised Kohonen networks for classification problems. Chemom. Intell. Lab. Syst., 83:99-113, 2006.Google Scholar
Prem Melville and Raymond J. Mooney. Creating diversity in ensembles using artificial data. Information Fusion: Special Issue on Diversity in Multiclassifier Systems, 6(1): 99-111, 2004.Google Scholar
John C. Platt. Fast training of support vector machines using sequential minimal optimization. In Bernhard Scholköpf, Cristopher J.C. Burges, and Alexander Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 185-208. MIT Press, 1998. Google Scholar
Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81-106, 1986. Google Scholar
Ross Quinlan. Learning with continuous classes. In Australian Joint Conf. on Artificial Intelligence, pages 343-348, 1992.Google Scholar
Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. Google Scholar
Brian D. Ripley. Pattern Recognition and Neural Networks. Cambridge Univ. Press, 1996. Google Scholar
Juan J. Rodríguez, Ludmila I. Kuncheva, and Carlos J. Alonso. Rotation forest: a new classifier ensemble method. IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(10):1619-1630, 2006. Google Scholar
Alexander K. Seewald. How to make stacking better and faster while also taking care of an unknown weakness. In Int. Conf. on Machine Learning, pages 554-561. Morgan Kaufmann Publishers, 2002. Google Scholar
Alexander K. Seewald and Johannes Fuernkranz. An evaluation of grading classifiers. In Int. Conf. on Advances in Intelligent Data Analysis, pages 115-124, 2001. Google Scholar
David J. Sheskin. Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, 2006. Google Scholar
Donald F. Specht. Probabilistic neural networks. Neural Networks, 3(1):109-118, 1990. Google Scholar
Johan A.K. Suykens and Joos Vandewalle. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293-300, 1999. Google Scholar
Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. of the National Academy of Sciences, 99(10):6567-6572, 2002.Google Scholar
Kai M. Ting and Ian H. Witten. Stacking bagged and dagged models. In Int. Conf. on Machine Learning, pages 367-375, 1997. Google Scholar
Valentin Todorov and Peter Filzmoser. An object oriented framework for robust multivariate analysis. J. Stat. Softw., 32(3):1-47, 2009.Google Scholar
Alfred Truong. Fast Growing and Interpretable Oblique Trees via Probabilistic Models. PhD thesis, Univ. Oxford, 2009.Google Scholar
Joaquin Vanschoren, Hendrik Blockeel, Bernhard. Pfahringer, and Geoffrey Holmes. Experiment databases. A new way to share, organize and learn from experiments. Machine Learning, 87(2):127-158, 2012. Google Scholar
William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Springer, 2002. Google Scholar
Geoffrey Webb, Janice Boughton, and Zhihai Wang. Not so naive Bayes: aggregating one-dependence estimators. Machine Learning, 58(1):5-24, 2005. Google Scholar
Geoffrey I. Webb. Multiboosting: a technique for combining boosting and wagging. Machine Learning, 40(2):159-196, 2000. Google Scholar
Daniela M. Witten and Robert Tibshirani. Penalized classification using Fisher's linear discriminant. J. of the Royal Stat. Soc. Series B, 73(5):753-772, 2011.Google Scholar
David H. Wolpert. Stacked generalization. Neural Networks, 5:241-259, 1992. Google Scholar
David H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 9:1341-1390, 1996. Google Scholar
Zijian Zheng and Goeffrey I. Webb. Lazy learning of Bayesian rules. Machine Learning, 4 (1):53-84, 2000. Google Scholar

Index Terms

Do we need hundreds of classifiers to solve real world classification problems?
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Index terms have been assigned to the content through auto-classification.

Recommendations

AdaBoost classifiers for pecan defect classification

Highlights The performance of AdaBoost algorithms were compared with support vector machine and Bayesian classifiers for pecan defect classification. AdaBoost classifiers took least time and gave best classification accuracy. AdaBoost classifiers ...
Read More
Novel ensemble methods for regression via classification problems

Regression via classification (RvC) is a method in which a regression problem is converted into a classification problem. A discretization process is used to covert continuous target value to classes. The discretized data can be used with classifiers as ...
Read More
Using boosting to prune bagging ensembles

Boosting is used to determine the order in which classifiers are aggregated in a bagging ensemble. Early stopping in the aggregation of the classifiers in the ordered bagging ensemble allows the identification of subensembles that require less memory ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

The Journal of Machine Learning Research Volume 15, Issue 1
January 2014
4085 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
JMLR.org
Publication History
- Revised: 1 April 2014
- Published: 1 January 2014
Published in jmlr Volume 15, Issue 1
Author Tags
Bayesian classifiers
UCI data base
classification
decision trees
discriminant analysis
ensembles
generalized linear models
logistic and multinomial regression
multiple adaptive regression splines
nearest-neighbors
neural networks
partial least squares and principal component regression
random forest
rule-based classifiers
support vector machine
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 319
  Total Citations
  View Citations
- 3,305
  Total Downloads
- Downloads (Last 12 months)615
- Downloads (Last 6 weeks)59
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Do we need hundreds of classifiers to solve real world classification problems?

The Journal of Machine Learning Research

Abstract

References

Cited By

Index Terms

Recommendations

AdaBoost classifiers for pecan defect classification

Novel ensemble methods for regression via classification problems

Using boosting to prune bagging ensembles

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Do we need hundreds of classifiers to solve real world classification problems?

The Journal of Machine Learning Research

Abstract

References

Cited By

Index Terms

Recommendations

AdaBoost classifiers for pecan defect classification

Novel ensemble methods for regression via classification problems

Using boosting to prune bagging ensembles

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media