research-article

Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement

Authors:
George Forman

Hewlett-Packard Labs, Palo Alto, CA

Hewlett-Packard Labs, Palo Alto, CA
View Profile

,
Martin Scholz

Hewlett-Packard Labs, Palo Alto, CA

Hewlett-Packard Labs, Palo Alto, CA
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 12 Issue 1June 2010pp 49–57https://doi.org/10.1145/1882471.1882479

Published:09 November 2010Publication History

ACM SIGKDD Explorations Newsletter

Abstract

Cross-validation is a mainstay for measuring performance and progress in machine learning. There are subtle differences in how exactly to compute accuracy, F-measure and Area Under the ROC Curve (AUC) in cross-validation studies. However, these details are not discussed in the literature, and incompatible methods are used by various papers and software packages. This leads to inconsistency across the research literature. Anomalies in performance calculations for particular folds and situations go undiscovered when they are buried in aggregated results over many folds and datasets, without ever a person looking at the intermediate performance measurements. This research note clarifies and illustrates the differences, and it provides guidance for how best to measure classification performance under cross-validation. In particular, there are several divergent methods used for computing F-measure, which is often recommended as a performance measure under class imbalance, e.g., for text classification domains and in one-vs.-all reductions of datasets having many classes. We show by experiment that all but one of these computation methods leads to biased measurements, especially under high class imbalance. This paper is of particular interest to those designing machine learning software libraries and researchers focused on high class imbalance.

References

G. Forman. BNS feature scaling: an improved representation over TF-IDF for SVM text classification. In Proceedings of the 17th ACM Conference onInformation and Knowledge Management (CIKM), pages 263--270, New York, NY, 2008. ACM. Google ScholarDigital Library
T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Caasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, and E. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531--537, 1999.Google ScholarCross Ref
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 2009. Google ScholarDigital Library
D. D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In Symposium on Document Analysis and Information Retrieval, pages 81--93, Las Vegas, NV, Apr. 1994. ISRI; Univ. of Nevada, Las Vegas.Google Scholar
D. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. volume 5, pages 361--397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf. Google ScholarDigital Library
I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid prototyping for complex data mining tasks. In L. Ungar, M. Craven, D. Gunopulos, and T. Eliassi-Rad, editors, KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 935--940, New York, NY, USA, August 2006. ACM. Google ScholarDigital Library
T. Raeder, G. Forman, and N. V. Chawla. Data Mining: Foundations and Intelligent Paradigms, chapter Learning with Imbalanced Data: Evaluation Matters. Intelligent Systems Reference Library. Springer Verlag, 2010.Google Scholar

Recommendations

Is cross-validation better than resubstitution for ranking genes?

Motivation: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to ...
Read More
Dynamic weighting ensemble classifiers based on cross-validation

Ensemble of classifiers constitutes one of the main current directions in machine learning and data mining. It is accepted that the ensemble methods can be divided into static and dynamic ones. Dynamic ensemble methods explore the use of different ...
Read More
Improving adaptive boosting with k-cross-fold validation
ICIC'06: Proceedings of the 2006 international conference on Intelligent Computing - Volume Part I

As seen in the bibliography, Adaptive Boosting (Adaboost) is one of the most known methods to increase the performance of an ensemble of neural networks. We introduce a new method based on Adaboost where we have applied Cross-Validation to increase the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 12, Issue 1
June 2010
77 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1882471
Issue’s Table of Contents

Copyright © 2010 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 November 2010
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 256
  Total Citations
  View Citations
- 1,844
  Total Downloads
- Downloads (Last 12 months)167
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Recommendations

Is cross-validation better than resubstitution for ranking genes?

Dynamic weighting ensemble classifiers based on cross-validation

Improving adaptive boosting with k-cross-fold validation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Recommendations

Is cross-validation better than resubstitution for ranking genes?

Dynamic weighting ensemble classifiers based on cross-validation

Improving adaptive boosting with k-cross-fold validation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media