article

Free Access

Learning Ensembles from Bites: A Scalable and Accurate Approach

The Journal of Machine Learning Research Volume 5pp 421–451

Published:01 December 2004Publication History

The Journal of Machine Learning Research

Abstract

Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive data sets, because the size of the data set can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") is a promising approach for learning from massive data sets, one that can utilize the power of boosting and bagging. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable.

References

Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, 2001. ACM.]]Google Scholar
R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. A new ensemble diversity measure applied to thinning ensembles. In Multiple Classifier Systems Workshop, pages 306-316, Surrey, UK, 2003.]] Google ScholarDigital Library
E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36(1,2), 1999.]] Google ScholarDigital Library
H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. Nucleic Acids Research, 28:235-242, 2000. http://www.pdb.org/.]]Google ScholarCross Ref
C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, 1998.]]Google Scholar
L. Breiman. Bagging predictors. Machine Learning, 24(2):123-140, 1996.]] Google ScholarDigital Library
L. Breiman. Pasting bites together for prediction in large data sets. Machine Learning, 36(2): 85-103, 1999.]] Google ScholarDigital Library
L. Breiman. Random forests. Machine Learning, 45(1):5-32, 2001.]] Google ScholarDigital Library
P. Chan and S. Stolfo. Towards parallel and distributed learning by meta-learning. In Working Notes AAAI Workshop on Knowledge Discovery and Databases, pages 227-240, San Mateo, CA, 1993.]]Google Scholar
N. V. Chawla, S. Eschrich, and L. O. Hall. Creating ensembles of classifiers. In First IEEE International Conference on Data Mining, pages 581-583, San Jose, CA, 2000.]] Google ScholarDigital Library
N. V. Chawla, L. O. Hall, K. W. Bowyer, T. E. Moore, and W. P. Kegelmeyer. Distributed pasting of small votes. In Third International Workshop on Multiple Classifier Systems, pages 52-61, Cagliari, Italy, 2002a.]] Google ScholarDigital Library
N. V. Chawla, T. E. Moore, L. O. Hall, K. W. Bowyer, C. Springer, and W. P. Kegelmeyer. Distributed learning with bagging-like performance. Pattern Recognition Letters, 24(1-3):455-471, 2002b.]]Google Scholar
N. V. Chawla, T. E. Moore, Jr., L. O. Hall, K. W. Bowyer, W. P. Kegelmeyer, and C. Springer. Investigation of bagging-like effects and decision trees versus neural nets in protein secondary structure prediction. In ACM SIGKDD Workshop on Data Mining in Bio-Informatics, San Francisco, CA, 2001.]]Google Scholar
D. J. Spiegelhalter D. Michie and C. C. Taylor. Machine Learning, Neural and Statistical Classification . Ellis Horwood, 1994.]] Google ScholarDigital Library
T. Dietterich. An empirical comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization. Machine Learning, 40(2):139-157, 2000.]] Google ScholarDigital Library
P. Domingos. Using partitioning to speed up specific-to-general rule induction. In AAAI Workshop on Integrating Multiple Learned Models, pages 29-34, Portland, OR, 1996.]]Google Scholar
R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley-Interscience, 2001.]] Google ScholarDigital Library
S. Eschrich, N. V. Chawla, and L. O. Hall. Learning to predict in complex biological domains. Journal of System Simulation, 2:1464-1471, 2002.]]Google Scholar
S. E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems 2, Vancouver, Canada, 1990. Morgan Kaufmann.]] Google ScholarDigital Library
U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. Advances in Knowledge Discovery and Data Mining, chapter From data mining to knowledge discovery: An overview. AAAI Press, Menlo Park, CA, 1996.]] Google ScholarDigital Library
Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In Thirteenth International Conference on Machine Learning, Bari, Italy, 1996.]]Google Scholar
G. Giacinto and F. Roli. An approach to automatic design of multiple classifier systems. Pattern Recognition Letters, 22:25-33, 2001.]] Google ScholarDigital Library
I. J. Good. The Estimation of Probabilities: An essay on modern Bayesian methods. MIT Press, 1965.]]Google Scholar
L. O. Hall, K. W. Bowyer, N. V. Chawla, T. E. Moore, and W. P. Kegelmeyer. Avatar: Adaptive Visualization Aid for Touring and Recovery. Technical Report SAND2000-8203, Sandia National Labs, 2000.]]Google ScholarCross Ref
L. O. Hall, N. V. Chawla, K. W. Bowyer, and W. P. Kegelmeyer. Learning rules from distributed data. In Workshop of Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 1999.]] Google ScholarDigital Library
T. Ho. Random subspace method for constructing decision forests. IEEE Transactions on PAMI, 20(8):832-844, 1998.]] Google ScholarDigital Library
D. T. Jones. Protein secondary structure prediction based on decision-specific scoring matrices. Journal of Molecular Biology, 292:195-202, 1999.]]Google Scholar
L. Kuncheva and C. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51:181-207, 2003.]] Google ScholarDigital Library
L. Kuncheva, C. Whitaker, C. Shipp, and R. Duin. Is independence good for combining classifiers? In Proceedings of 15th International Conference on Pattern Recognition, pages 168-171, Barcelona, Spain, September 2000.]]Google Scholar
P. Latinne, O. Debeir, and C. Decaestecker. Limiting the number of trees in random forests. In J. Kittler and F. Roli, editors, Multiple Classifier Systems, Second International Workshop, pages 178-187, Cambridge, UK, 2001. Springer.]] Google ScholarDigital Library
A. Lazarevic and Z. Obradovic. Boosting algorithms for parallel and distributed learning. Distributed and Parallel Databases Journal, Special Issue on Parallel and Distributed Data Mining, 11:203-229, 2002.]] Google ScholarDigital Library
N. Leavitt. Data mining for the corporate masses. In IEEE Computer. IEEE Computer Society, May 2002.]] Google ScholarDigital Library
Lawrence Livermore National Laboratories. ASCI Blue Pacific. http://www.llnl.gov/asci/platforms/bluepac.]]Google Scholar
Lawrence Livermore National Laboratories. Protein Structure Prediction Center. http://predictioncenter.llnl.gov/, 1999.]]Google Scholar
R. Musick, J. Catlett, and S. Russell. Decision theoretic subsampling for induction on large databases. In Proceedings of Tenth International Conference on Machine Learning, pages 212- 219, Amherst, MA, 1993.]]Google Scholar
C. Perlich, F. Provost, and J. Simonoff. Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research, 4:211-255, 2003.]] Google ScholarDigital Library
F. Provost and D. N. Hennessy. Scaling up: Distributed machine learning with cooperation. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, AAAI'96, pages 74-79, Portland, Oregon, 1996.]] Google ScholarDigital Library
F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2):131-169, 1999.]] Google ScholarDigital Library
F. J. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 23- 32, San Diego, CA, 1999.]] Google ScholarDigital Library
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1992.]] Google ScholarDigital Library
D. B. Skalak. The sources of increased accuracy for two proposed boosting algorithms. In AAAI Integrating Multiple Learned Models Workshop, Portland, Oregon, 1996.]]Google Scholar
W. N. Street and Y. Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of seventh International Conference on Knowledge Discovery and Data Mining, pages 377-382, 2001. San Francisco, CA.]] Google ScholarDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

The Journal of Machine Learning Research Volume 5, Issue
12/1/2004
1571 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
JMLR.org
Publication History
- Published: 1 December 2004
Published in jmlr Volume 5, Issue
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 731
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning Ensembles from Bites: A Scalable and Accurate Approach

The Journal of Machine Learning Research

Abstract

References

Cited By

Recommendations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Learning Ensembles from Bites: A Scalable and Accurate Approach

The Journal of Machine Learning Research

Abstract

References

Cited By

Recommendations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media