skip to main content
article
Free Access

Do we need hundreds of classifiers to solve real world classification problems?

Authors Info & Claims
Published:01 January 2014Publication History
Skip Abstract Section

Abstract

We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest-neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R (with and without the caret package), C and Matlab, including all the relevant classifiers available today. We use 121 data sets, which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behavior, not dependent on the data set collection. The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

References

  1. David W. Aha, Dennis Kibler, and Marc K. Albert. Instance-based learning algorithms. Machine Learning, 6:37-66, 1991. Google ScholarGoogle Scholar
  2. Miika Ahdesmäki and Korbinian Strimmer. Feature selection in omics prediction problems using cat scores and false non-discovery rate control. Annals of Applied Stat., 4:503-519, 2010.Google ScholarGoogle Scholar
  3. Esteban Alfaro, Matías Gámez, and Noelia García. Multiclass corporate failure prediction by Adaboost.M1. Int. Advances in Economic Research, 13:301-312, 2007.Google ScholarGoogle Scholar
  4. Peter Auer, Harald Burgsteiner, and Wolfang Maass. A learning rule for very simple universal approximators consisting of a single layer of perceptrons. Neural Networks, 1(21): 786-795, 2008. Google ScholarGoogle Scholar
  5. Kevin Bache and Moshe Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  6. Laurent Bergé, Charles Bouveyron, and Stéphane Girard. HDclassif: an R package for model-based clustering and discriminant analysis of high-dimensional data. J. Stat. Softw., 46(6):1-29, 2012.Google ScholarGoogle Scholar
  7. Michael R. Berthold and Jay Diamond. Boosting the performance of RBF networks with dynamic decay adjustment. In Advances in Neural Information Processing Systems, pages 521-528. MIT Press, 1995.Google ScholarGoogle Scholar
  8. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123-140, 1996. Google ScholarGoogle Scholar
  9. Leo Breiman. Random forests. Machine Learning, 45(1):5-32, 2001. Google ScholarGoogle Scholar
  10. Leo Breiman, Jerome Friedman, R.A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth and Brooks, 1984.Google ScholarGoogle Scholar
  11. Jean Carletta. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249-254, 1996. Google ScholarGoogle Scholar
  12. Jadzia Cendrowska. PRISM: An algorithm for inducing modular rules. Int. J. of Man-Machine Studies, 27(4):349-370, 1987.Google ScholarGoogle Scholar
  13. S. Le Cessie and J.C. Van Houwelingen. Ridge estimators in logistic regression. Applied Stat., 41(1):191-201, 1992.Google ScholarGoogle Scholar
  14. Chih-Chung Chang and Chih-Jen. Lin. Libsvm: a library for support vector machines, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarGoogle Scholar
  15. Hyonho Chun and Sunduz Keles. Sparse partial least squares for simultaneous dimension reduction and variable selection. J. of the Royal Stat. Soc. - Series B, 72:3-25, 2010.Google ScholarGoogle Scholar
  16. John G. Cleary and Leonard E. Trigg. K*: an instance-based learner using an entropic distance measure. In Int. Conf. on Machine Learning, pages 108-114, 1995.Google ScholarGoogle Scholar
  17. Line H. Clemensen, Trevor Hastie, Daniela Witten, and Bjarne Ersboll. Sparse discriminant analysis. Technometrics, 53(4):406-413, 2011.Google ScholarGoogle Scholar
  18. William W. Cohen. Fast effective rule induction. In Int. Conf. on Machine Learning, pages 115-123, 1995.Google ScholarGoogle Scholar
  19. Bhupinder S. Dayal and John F. MacGregor. Improved PLS algorithms. J. of Chemometrics, 11:73-85, 1997.Google ScholarGoogle Scholar
  20. Gülsen Demiroz and H. Altay Guvenir. Classification by voting feature intervals. In European Conf. on Machine Learning, pages 85-92. Springer, 1997. Google ScholarGoogle Scholar
  21. Houtao Deng and George Runger. Feature selection via regularized trees. In Int. Joint Conf. on Neural Networks, pages 1-8, 2012.Google ScholarGoogle Scholar
  22. Beijing Ding and Robert Gentleman. Classification using generalized partial least squares. J. of Computational and Graphical Stat., 14(2):280-298, 2005.Google ScholarGoogle Scholar
  23. Annette J. Dobson. An Introduction to Generalized Linear Models. Chapman and Hall, 1990.Google ScholarGoogle Scholar
  24. Pedro Domingos. Metacost: A general method for making classifiers cost-sensitive. In Int. Conf. on Knowledge Discovery and Data Mining, pages 155-164, 1999. Google ScholarGoogle Scholar
  25. Richard Duda, Peter Hart, and David Stork. Pattern Classification. Wiley, 2001. Google ScholarGoogle Scholar
  26. Manuel J.A. Eugster, Torsten Hothorn, and Friedrich Leisch. Domain-based benchmark experiments: exploratory and inferential analysis. Austrian J. of Stat., 41:5-26, 2014.Google ScholarGoogle Scholar
  27. Scott E. Fahlman. Faster-learning variations on back-propagation: an empirical study. In 1988 Connectionist Models Summer School, pages 38-50. Morgan-Kaufmann, 1988.Google ScholarGoogle Scholar
  28. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res., 9:1871-1874, 2008. Google ScholarGoogle Scholar
  29. Manuel Fernández-Delgado, Jorge Ribeiro, Eva Cernadas, and Senén Barro. Direct parallel perceptrons (DPPs): fast analytical calculation of the parallel perceptrons weights with margin control for classification tasks. IEEE Trans. on Neural Networks, 22:1837-1848, 2011. Google ScholarGoogle Scholar
  30. Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Jorge Ribeiro, and José Neves. Direct kernel perceptron (DKP): ultra-fast kernel ELM-based classification with noniterative closed-form weight calculation. Neural Networks, 50:60-71, 2014. Google ScholarGoogle Scholar
  31. Eibe Frank and Mark Hall. A simple approach to ordinal classification. In European Conf. on Machine Learning, pages 145-156, 2001. Google ScholarGoogle Scholar
  32. Eibe Frank and Stefan Kramer. Ensembles of nested dichotomies for multi-class problems. In Int. Conf. on Machine Learning, pages 305-312. ACM, 2004. Google ScholarGoogle Scholar
  33. Eibe Frank and Ian H. Witten. Generating accurate rule sets without global optimization. In Int. Conf. on Machine Learning, pages 144-151, 1999. Google ScholarGoogle Scholar
  34. Eibe Frank, Yong Wang, Stuart Inglis, Geoffrey Holmes, and Ian H. Witten. Using model trees for classification. Machine Learning, 32(1):63-76, 1998. Google ScholarGoogle Scholar
  35. Eibe Frank, Geoffrey Holmes, Richard Kirkby, and Mark Hall. Racing committees for large datasets. In Int. Conf. on Discovery Science, pages 153-164, 2002. Google ScholarGoogle Scholar
  36. Eibe Frank, Mark Hall, and Bernhard Pfahringer. Locally weighted naive Bayes. In Conf. on Uncertainty in Artificial Intelligence, pages 249-256, 2003. Google ScholarGoogle Scholar
  37. Yoav Freund and Llew Mason. The alternating decision tree learning algorithm. In Int. Conf. on Machine Learning, pages 124-133, 1999. Google ScholarGoogle Scholar
  38. Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Int. Conf. on Machine Learning, pages 148-156. Morgan Kaufmann, 1996.Google ScholarGoogle Scholar
  39. Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron algorithm. In Conf. on Computational Learning Theory, pages 209-217, 1998. Google ScholarGoogle Scholar
  40. Jerome Friedman. Regularized discriminant analysis. J. of the American Stat. Assoc., 84: 165-175, 1989.Google ScholarGoogle Scholar
  41. Jerome Friedman. Multivariate adaptive regression splines. Annals of Stat., 19(1):1-141, 1991.Google ScholarGoogle Scholar
  42. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Stat., 28:2000, 1998.Google ScholarGoogle Scholar
  43. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models via coordinate descent. J. of Stat. Softw., 33(1):1-22, 2010.Google ScholarGoogle Scholar
  44. Brian R. Gaines and Paul Compton. Induction of ripple-down rules applied to modeling large databases. J. Intell. Inf. Syst., 5(3):211-228, 1995. Google ScholarGoogle Scholar
  45. Andrew Gelman, Aleks Jakulin, Maria G. Pittau, and Yu-Sung Su. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Stat., 2(4):1360-1383, 2009.Google ScholarGoogle Scholar
  46. Mark Girolami and Simon Rogers. Variational bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18:1790-1817, 2006. Google ScholarGoogle Scholar
  47. Ekkehard Glimm, Siegfried Kropf, and Jürgen Läuter. Multivariate tests based on left-spherically distributed linear scores. The Annals of Stat., 26(5):1972-1988, 1998.Google ScholarGoogle Scholar
  48. Encarnación González-Rufino, Pilar Carrión, Eva Cernadas, Manuel Fernández-Delgado, and Rosario Domínguez-Petit. Exhaustive comparison of colour texture features and classification methods to discriminate cells categories in histological images of fish ovary. Pattern Recognition, 46:2391-2407, 2013. Google ScholarGoogle Scholar
  49. Mark Hall. Correlation-Based Feature Subset Selection for Machine Learning. PhD thesis, University of Waikato, 1998.Google ScholarGoogle Scholar
  50. Mark Hall and Eibe Frank. Combining naive Bayes and decision tables. In Florida Artificial Intel. Soc. Conf., pages 318-319. AAAI press, 2008.Google ScholarGoogle Scholar
  51. Trevor Hastie and Robert Tibshirani. Discriminant analysis by Gaussian mixtures. J. of the Royal Stat. Soc. series B, 58:158-176, 1996.Google ScholarGoogle Scholar
  52. Trevor Hastie, Robert Tibshirani, and Andreas Buja. Flexible discriminant analysis by optimal scoring. J. of the American Stat. Assoc., 89:1255-1270, 1993.Google ScholarGoogle Scholar
  53. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2009.Google ScholarGoogle Scholar
  54. Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(8):832-844, 1998. Google ScholarGoogle Scholar
  55. Tin Kam Ho and Mitra Basu. Complexity measures of supervised classification problems. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(3):289-300, 2002. Google ScholarGoogle Scholar
  56. Geoffrey Holmes, Mark Hall, and Eibe Frank. Generating rule sets from model trees. In Australian Joint Conf. on Artificial Intelligence, pages 1-12, 1999. Google ScholarGoogle Scholar
  57. Robert C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11:63-91, 1993. Google ScholarGoogle Scholar
  58. Torsten Hothorn, Friedrich Leisch, Achim Zeileis, and Kurt Hornik. The design and analysis of benchmark experiments. J. Computational and Graphical Stat., 14:675-699, 2005.Google ScholarGoogle Scholar
  59. Guang-Bin Huang, Hongming Zhou, Xiaojian Ding, and Rui Zhang. Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. - Part B: Cybernetics, 42:513-529, 2012.Google ScholarGoogle Scholar
  60. Torsten Joachims. Making Large-Scale Support Vector Machine Learning Practical. In Bernhard Scholköpf, Cristopher J.C. Burges, and Alexander Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169-184. MIT-Press, 1999. Google ScholarGoogle Scholar
  61. George H. John and Pat Langley. Estimating continuous distributions in Bayesian classifiers. In Conf. on Uncertainty in Artificial Intelligence, pages 338-345, 1995. Google ScholarGoogle Scholar
  62. Sijmen De Jong. SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18:251-263, 1993.Google ScholarGoogle Scholar
  63. Josef Kittler, Mohammad Hatef, Robert P.W. Duin, and Jiri Matas. On combining classifiers. IEEE Trans. on Pat. Anal. and Machine Intel., 20:226-239, 1998. Google ScholarGoogle Scholar
  64. Ron Kohavi. The power of decision tables. In European Conf. on Machine Learning, pages 174-189. Springer, 1995. Google ScholarGoogle Scholar
  65. Ron Kohavi. Scaling up the accuracy of naive-Bayes classifiers: a decision-tree hybrid. In Int. Conf. on Knoledge Discovery and Data Mining, pages 202-207, 1996.Google ScholarGoogle Scholar
  66. Max Kuhn. Building predictive models in R using the caret package. J. Stat. Softw., 28(5): 1-26, 2008.Google ScholarGoogle Scholar
  67. Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.Google ScholarGoogle Scholar
  68. Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine Learning, 95 (1-2):161-205, 2005. Google ScholarGoogle Scholar
  69. Nick Littlestone. Learning quickly when irrelevant attributes are abound: a new linear threshold algorithm. Machine Learning, 2:285-318, 1988. Google ScholarGoogle Scholar
  70. Nuria Macià and Ester Bernadó-Mansilla. Towards UCI+: a mindful repository design. Information Sciences, 261(10):237-262, 2014. Google ScholarGoogle Scholar
  71. Nuria Macià, Ester Bernadó-Mansilla, Albert Orriols-Puig, and Tin Kam Ho. Learner excellence biased by data set selection: a case for data characterisation and artificial data sets. Pattern Recognition, 46:1054-1066, 2013. Google ScholarGoogle Scholar
  72. Harald Martens. Multivariate Calibration. Wiley, 1989.Google ScholarGoogle Scholar
  73. Brent Martin. Instance-Based Learning: Nearest Neighbor with Generalization. PhD thesis, Univ. of Waikato, Hamilton, New Zealand, 1995.Google ScholarGoogle Scholar
  74. Willem Melssen, Ron Wehrens, and Lutgarde Buydens. Supervised Kohonen networks for classification problems. Chemom. Intell. Lab. Syst., 83:99-113, 2006.Google ScholarGoogle Scholar
  75. Prem Melville and Raymond J. Mooney. Creating diversity in ensembles using artificial data. Information Fusion: Special Issue on Diversity in Multiclassifier Systems, 6(1): 99-111, 2004.Google ScholarGoogle Scholar
  76. John C. Platt. Fast training of support vector machines using sequential minimal optimization. In Bernhard Scholköpf, Cristopher J.C. Burges, and Alexander Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 185-208. MIT Press, 1998. Google ScholarGoogle Scholar
  77. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81-106, 1986. Google ScholarGoogle Scholar
  78. Ross Quinlan. Learning with continuous classes. In Australian Joint Conf. on Artificial Intelligence, pages 343-348, 1992.Google ScholarGoogle Scholar
  79. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. Google ScholarGoogle Scholar
  80. Brian D. Ripley. Pattern Recognition and Neural Networks. Cambridge Univ. Press, 1996. Google ScholarGoogle Scholar
  81. Juan J. Rodríguez, Ludmila I. Kuncheva, and Carlos J. Alonso. Rotation forest: a new classifier ensemble method. IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(10):1619-1630, 2006. Google ScholarGoogle Scholar
  82. Alexander K. Seewald. How to make stacking better and faster while also taking care of an unknown weakness. In Int. Conf. on Machine Learning, pages 554-561. Morgan Kaufmann Publishers, 2002. Google ScholarGoogle Scholar
  83. Alexander K. Seewald and Johannes Fuernkranz. An evaluation of grading classifiers. In Int. Conf. on Advances in Intelligent Data Analysis, pages 115-124, 2001. Google ScholarGoogle Scholar
  84. David J. Sheskin. Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, 2006. Google ScholarGoogle Scholar
  85. Donald F. Specht. Probabilistic neural networks. Neural Networks, 3(1):109-118, 1990. Google ScholarGoogle Scholar
  86. Johan A.K. Suykens and Joos Vandewalle. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293-300, 1999. Google ScholarGoogle Scholar
  87. Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. of the National Academy of Sciences, 99(10):6567-6572, 2002.Google ScholarGoogle Scholar
  88. Kai M. Ting and Ian H. Witten. Stacking bagged and dagged models. In Int. Conf. on Machine Learning, pages 367-375, 1997. Google ScholarGoogle Scholar
  89. Valentin Todorov and Peter Filzmoser. An object oriented framework for robust multivariate analysis. J. Stat. Softw., 32(3):1-47, 2009.Google ScholarGoogle Scholar
  90. Alfred Truong. Fast Growing and Interpretable Oblique Trees via Probabilistic Models. PhD thesis, Univ. Oxford, 2009.Google ScholarGoogle Scholar
  91. Joaquin Vanschoren, Hendrik Blockeel, Bernhard. Pfahringer, and Geoffrey Holmes. Experiment databases. A new way to share, organize and learn from experiments. Machine Learning, 87(2):127-158, 2012. Google ScholarGoogle Scholar
  92. William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Springer, 2002. Google ScholarGoogle Scholar
  93. Geoffrey Webb, Janice Boughton, and Zhihai Wang. Not so naive Bayes: aggregating one-dependence estimators. Machine Learning, 58(1):5-24, 2005. Google ScholarGoogle Scholar
  94. Geoffrey I. Webb. Multiboosting: a technique for combining boosting and wagging. Machine Learning, 40(2):159-196, 2000. Google ScholarGoogle Scholar
  95. Daniela M. Witten and Robert Tibshirani. Penalized classification using Fisher's linear discriminant. J. of the Royal Stat. Soc. Series B, 73(5):753-772, 2011.Google ScholarGoogle Scholar
  96. David H. Wolpert. Stacked generalization. Neural Networks, 5:241-259, 1992. Google ScholarGoogle Scholar
  97. David H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 9:1341-1390, 1996. Google ScholarGoogle Scholar
  98. Zijian Zheng and Goeffrey I. Webb. Lazy learning of Bayesian rules. Machine Learning, 4 (1):53-84, 2000. Google ScholarGoogle Scholar

Index Terms

  1. Do we need hundreds of classifiers to solve real world classification problems?
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader