Abstract
We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest-neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R (with and without the caret package), C and Matlab, including all the relevant classifiers available today. We use 121 data sets, which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behavior, not dependent on the data set collection. The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
- David W. Aha, Dennis Kibler, and Marc K. Albert. Instance-based learning algorithms. Machine Learning, 6:37-66, 1991. Google Scholar
- Miika Ahdesmäki and Korbinian Strimmer. Feature selection in omics prediction problems using cat scores and false non-discovery rate control. Annals of Applied Stat., 4:503-519, 2010.Google Scholar
- Esteban Alfaro, Matías Gámez, and Noelia García. Multiclass corporate failure prediction by Adaboost.M1. Int. Advances in Economic Research, 13:301-312, 2007.Google Scholar
- Peter Auer, Harald Burgsteiner, and Wolfang Maass. A learning rule for very simple universal approximators consisting of a single layer of perceptrons. Neural Networks, 1(21): 786-795, 2008. Google Scholar
- Kevin Bache and Moshe Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.Google Scholar
- Laurent Bergé, Charles Bouveyron, and Stéphane Girard. HDclassif: an R package for model-based clustering and discriminant analysis of high-dimensional data. J. Stat. Softw., 46(6):1-29, 2012.Google Scholar
- Michael R. Berthold and Jay Diamond. Boosting the performance of RBF networks with dynamic decay adjustment. In Advances in Neural Information Processing Systems, pages 521-528. MIT Press, 1995.Google Scholar
- Leo Breiman. Bagging predictors. Machine Learning, 24(2):123-140, 1996. Google Scholar
- Leo Breiman. Random forests. Machine Learning, 45(1):5-32, 2001. Google Scholar
- Leo Breiman, Jerome Friedman, R.A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth and Brooks, 1984.Google Scholar
- Jean Carletta. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249-254, 1996. Google Scholar
- Jadzia Cendrowska. PRISM: An algorithm for inducing modular rules. Int. J. of Man-Machine Studies, 27(4):349-370, 1987.Google Scholar
- S. Le Cessie and J.C. Van Houwelingen. Ridge estimators in logistic regression. Applied Stat., 41(1):191-201, 1992.Google Scholar
- Chih-Chung Chang and Chih-Jen. Lin. Libsvm: a library for support vector machines, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google Scholar
- Hyonho Chun and Sunduz Keles. Sparse partial least squares for simultaneous dimension reduction and variable selection. J. of the Royal Stat. Soc. - Series B, 72:3-25, 2010.Google Scholar
- John G. Cleary and Leonard E. Trigg. K*: an instance-based learner using an entropic distance measure. In Int. Conf. on Machine Learning, pages 108-114, 1995.Google Scholar
- Line H. Clemensen, Trevor Hastie, Daniela Witten, and Bjarne Ersboll. Sparse discriminant analysis. Technometrics, 53(4):406-413, 2011.Google Scholar
- William W. Cohen. Fast effective rule induction. In Int. Conf. on Machine Learning, pages 115-123, 1995.Google Scholar
- Bhupinder S. Dayal and John F. MacGregor. Improved PLS algorithms. J. of Chemometrics, 11:73-85, 1997.Google Scholar
- Gülsen Demiroz and H. Altay Guvenir. Classification by voting feature intervals. In European Conf. on Machine Learning, pages 85-92. Springer, 1997. Google Scholar
- Houtao Deng and George Runger. Feature selection via regularized trees. In Int. Joint Conf. on Neural Networks, pages 1-8, 2012.Google Scholar
- Beijing Ding and Robert Gentleman. Classification using generalized partial least squares. J. of Computational and Graphical Stat., 14(2):280-298, 2005.Google Scholar
- Annette J. Dobson. An Introduction to Generalized Linear Models. Chapman and Hall, 1990.Google Scholar
- Pedro Domingos. Metacost: A general method for making classifiers cost-sensitive. In Int. Conf. on Knowledge Discovery and Data Mining, pages 155-164, 1999. Google Scholar
- Richard Duda, Peter Hart, and David Stork. Pattern Classification. Wiley, 2001. Google Scholar
- Manuel J.A. Eugster, Torsten Hothorn, and Friedrich Leisch. Domain-based benchmark experiments: exploratory and inferential analysis. Austrian J. of Stat., 41:5-26, 2014.Google Scholar
- Scott E. Fahlman. Faster-learning variations on back-propagation: an empirical study. In 1988 Connectionist Models Summer School, pages 38-50. Morgan-Kaufmann, 1988.Google Scholar
- Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res., 9:1871-1874, 2008. Google Scholar
- Manuel Fernández-Delgado, Jorge Ribeiro, Eva Cernadas, and Senén Barro. Direct parallel perceptrons (DPPs): fast analytical calculation of the parallel perceptrons weights with margin control for classification tasks. IEEE Trans. on Neural Networks, 22:1837-1848, 2011. Google Scholar
- Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Jorge Ribeiro, and José Neves. Direct kernel perceptron (DKP): ultra-fast kernel ELM-based classification with noniterative closed-form weight calculation. Neural Networks, 50:60-71, 2014. Google Scholar
- Eibe Frank and Mark Hall. A simple approach to ordinal classification. In European Conf. on Machine Learning, pages 145-156, 2001. Google Scholar
- Eibe Frank and Stefan Kramer. Ensembles of nested dichotomies for multi-class problems. In Int. Conf. on Machine Learning, pages 305-312. ACM, 2004. Google Scholar
- Eibe Frank and Ian H. Witten. Generating accurate rule sets without global optimization. In Int. Conf. on Machine Learning, pages 144-151, 1999. Google Scholar
- Eibe Frank, Yong Wang, Stuart Inglis, Geoffrey Holmes, and Ian H. Witten. Using model trees for classification. Machine Learning, 32(1):63-76, 1998. Google Scholar
- Eibe Frank, Geoffrey Holmes, Richard Kirkby, and Mark Hall. Racing committees for large datasets. In Int. Conf. on Discovery Science, pages 153-164, 2002. Google Scholar
- Eibe Frank, Mark Hall, and Bernhard Pfahringer. Locally weighted naive Bayes. In Conf. on Uncertainty in Artificial Intelligence, pages 249-256, 2003. Google Scholar
- Yoav Freund and Llew Mason. The alternating decision tree learning algorithm. In Int. Conf. on Machine Learning, pages 124-133, 1999. Google Scholar
- Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Int. Conf. on Machine Learning, pages 148-156. Morgan Kaufmann, 1996.Google Scholar
- Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron algorithm. In Conf. on Computational Learning Theory, pages 209-217, 1998. Google Scholar
- Jerome Friedman. Regularized discriminant analysis. J. of the American Stat. Assoc., 84: 165-175, 1989.Google Scholar
- Jerome Friedman. Multivariate adaptive regression splines. Annals of Stat., 19(1):1-141, 1991.Google Scholar
- Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Stat., 28:2000, 1998.Google Scholar
- Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models via coordinate descent. J. of Stat. Softw., 33(1):1-22, 2010.Google Scholar
- Brian R. Gaines and Paul Compton. Induction of ripple-down rules applied to modeling large databases. J. Intell. Inf. Syst., 5(3):211-228, 1995. Google Scholar
- Andrew Gelman, Aleks Jakulin, Maria G. Pittau, and Yu-Sung Su. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Stat., 2(4):1360-1383, 2009.Google Scholar
- Mark Girolami and Simon Rogers. Variational bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18:1790-1817, 2006. Google Scholar
- Ekkehard Glimm, Siegfried Kropf, and Jürgen Läuter. Multivariate tests based on left-spherically distributed linear scores. The Annals of Stat., 26(5):1972-1988, 1998.Google Scholar
- Encarnación González-Rufino, Pilar Carrión, Eva Cernadas, Manuel Fernández-Delgado, and Rosario Domínguez-Petit. Exhaustive comparison of colour texture features and classification methods to discriminate cells categories in histological images of fish ovary. Pattern Recognition, 46:2391-2407, 2013. Google Scholar
- Mark Hall. Correlation-Based Feature Subset Selection for Machine Learning. PhD thesis, University of Waikato, 1998.Google Scholar
- Mark Hall and Eibe Frank. Combining naive Bayes and decision tables. In Florida Artificial Intel. Soc. Conf., pages 318-319. AAAI press, 2008.Google Scholar
- Trevor Hastie and Robert Tibshirani. Discriminant analysis by Gaussian mixtures. J. of the Royal Stat. Soc. series B, 58:158-176, 1996.Google Scholar
- Trevor Hastie, Robert Tibshirani, and Andreas Buja. Flexible discriminant analysis by optimal scoring. J. of the American Stat. Assoc., 89:1255-1270, 1993.Google Scholar
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2009.Google Scholar
- Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(8):832-844, 1998. Google Scholar
- Tin Kam Ho and Mitra Basu. Complexity measures of supervised classification problems. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(3):289-300, 2002. Google Scholar
- Geoffrey Holmes, Mark Hall, and Eibe Frank. Generating rule sets from model trees. In Australian Joint Conf. on Artificial Intelligence, pages 1-12, 1999. Google Scholar
- Robert C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11:63-91, 1993. Google Scholar
- Torsten Hothorn, Friedrich Leisch, Achim Zeileis, and Kurt Hornik. The design and analysis of benchmark experiments. J. Computational and Graphical Stat., 14:675-699, 2005.Google Scholar
- Guang-Bin Huang, Hongming Zhou, Xiaojian Ding, and Rui Zhang. Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. - Part B: Cybernetics, 42:513-529, 2012.Google Scholar
- Torsten Joachims. Making Large-Scale Support Vector Machine Learning Practical. In Bernhard Scholköpf, Cristopher J.C. Burges, and Alexander Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169-184. MIT-Press, 1999. Google Scholar
- George H. John and Pat Langley. Estimating continuous distributions in Bayesian classifiers. In Conf. on Uncertainty in Artificial Intelligence, pages 338-345, 1995. Google Scholar
- Sijmen De Jong. SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18:251-263, 1993.Google Scholar
- Josef Kittler, Mohammad Hatef, Robert P.W. Duin, and Jiri Matas. On combining classifiers. IEEE Trans. on Pat. Anal. and Machine Intel., 20:226-239, 1998. Google Scholar
- Ron Kohavi. The power of decision tables. In European Conf. on Machine Learning, pages 174-189. Springer, 1995. Google Scholar
- Ron Kohavi. Scaling up the accuracy of naive-Bayes classifiers: a decision-tree hybrid. In Int. Conf. on Knoledge Discovery and Data Mining, pages 202-207, 1996.Google Scholar
- Max Kuhn. Building predictive models in R using the caret package. J. Stat. Softw., 28(5): 1-26, 2008.Google Scholar
- Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.Google Scholar
- Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine Learning, 95 (1-2):161-205, 2005. Google Scholar
- Nick Littlestone. Learning quickly when irrelevant attributes are abound: a new linear threshold algorithm. Machine Learning, 2:285-318, 1988. Google Scholar
- Nuria Macià and Ester Bernadó-Mansilla. Towards UCI+: a mindful repository design. Information Sciences, 261(10):237-262, 2014. Google Scholar
- Nuria Macià, Ester Bernadó-Mansilla, Albert Orriols-Puig, and Tin Kam Ho. Learner excellence biased by data set selection: a case for data characterisation and artificial data sets. Pattern Recognition, 46:1054-1066, 2013. Google Scholar
- Harald Martens. Multivariate Calibration. Wiley, 1989.Google Scholar
- Brent Martin. Instance-Based Learning: Nearest Neighbor with Generalization. PhD thesis, Univ. of Waikato, Hamilton, New Zealand, 1995.Google Scholar
- Willem Melssen, Ron Wehrens, and Lutgarde Buydens. Supervised Kohonen networks for classification problems. Chemom. Intell. Lab. Syst., 83:99-113, 2006.Google Scholar
- Prem Melville and Raymond J. Mooney. Creating diversity in ensembles using artificial data. Information Fusion: Special Issue on Diversity in Multiclassifier Systems, 6(1): 99-111, 2004.Google Scholar
- John C. Platt. Fast training of support vector machines using sequential minimal optimization. In Bernhard Scholköpf, Cristopher J.C. Burges, and Alexander Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 185-208. MIT Press, 1998. Google Scholar
- Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81-106, 1986. Google Scholar
- Ross Quinlan. Learning with continuous classes. In Australian Joint Conf. on Artificial Intelligence, pages 343-348, 1992.Google Scholar
- Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. Google Scholar
- Brian D. Ripley. Pattern Recognition and Neural Networks. Cambridge Univ. Press, 1996. Google Scholar
- Juan J. Rodríguez, Ludmila I. Kuncheva, and Carlos J. Alonso. Rotation forest: a new classifier ensemble method. IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(10):1619-1630, 2006. Google Scholar
- Alexander K. Seewald. How to make stacking better and faster while also taking care of an unknown weakness. In Int. Conf. on Machine Learning, pages 554-561. Morgan Kaufmann Publishers, 2002. Google Scholar
- Alexander K. Seewald and Johannes Fuernkranz. An evaluation of grading classifiers. In Int. Conf. on Advances in Intelligent Data Analysis, pages 115-124, 2001. Google Scholar
- David J. Sheskin. Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, 2006. Google Scholar
- Donald F. Specht. Probabilistic neural networks. Neural Networks, 3(1):109-118, 1990. Google Scholar
- Johan A.K. Suykens and Joos Vandewalle. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293-300, 1999. Google Scholar
- Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. of the National Academy of Sciences, 99(10):6567-6572, 2002.Google Scholar
- Kai M. Ting and Ian H. Witten. Stacking bagged and dagged models. In Int. Conf. on Machine Learning, pages 367-375, 1997. Google Scholar
- Valentin Todorov and Peter Filzmoser. An object oriented framework for robust multivariate analysis. J. Stat. Softw., 32(3):1-47, 2009.Google Scholar
- Alfred Truong. Fast Growing and Interpretable Oblique Trees via Probabilistic Models. PhD thesis, Univ. Oxford, 2009.Google Scholar
- Joaquin Vanschoren, Hendrik Blockeel, Bernhard. Pfahringer, and Geoffrey Holmes. Experiment databases. A new way to share, organize and learn from experiments. Machine Learning, 87(2):127-158, 2012. Google Scholar
- William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Springer, 2002. Google Scholar
- Geoffrey Webb, Janice Boughton, and Zhihai Wang. Not so naive Bayes: aggregating one-dependence estimators. Machine Learning, 58(1):5-24, 2005. Google Scholar
- Geoffrey I. Webb. Multiboosting: a technique for combining boosting and wagging. Machine Learning, 40(2):159-196, 2000. Google Scholar
- Daniela M. Witten and Robert Tibshirani. Penalized classification using Fisher's linear discriminant. J. of the Royal Stat. Soc. Series B, 73(5):753-772, 2011.Google Scholar
- David H. Wolpert. Stacked generalization. Neural Networks, 5:241-259, 1992. Google Scholar
- David H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 9:1341-1390, 1996. Google Scholar
- Zijian Zheng and Goeffrey I. Webb. Lazy learning of Bayesian rules. Machine Learning, 4 (1):53-84, 2000. Google Scholar
Index Terms
- Do we need hundreds of classifiers to solve real world classification problems?
Recommendations
AdaBoost classifiers for pecan defect classification
Highlights The performance of AdaBoost algorithms were compared with support vector machine and Bayesian classifiers for pecan defect classification. AdaBoost classifiers took least time and gave best classification accuracy. AdaBoost classifiers ...
Novel ensemble methods for regression via classification problems
Regression via classification (RvC) is a method in which a regression problem is converted into a classification problem. A discretization process is used to covert continuous target value to classes. The discretized data can be used with classifiers as ...
Using boosting to prune bagging ensembles
Boosting is used to determine the order in which classifiers are aggregated in a bagging ensemble. Early stopping in the aggregation of the classifiers in the ordered bagging ensemble allows the identification of subensembles that require less memory ...
Comments