skip to main content
article
Free Access

Learning Ensembles from Bites: A Scalable and Accurate Approach

Published:01 December 2004Publication History
Skip Abstract Section

Abstract

Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive data sets, because the size of the data set can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") is a promising approach for learning from massive data sets, one that can utilize the power of boosting and bagging. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable.

References

  1. <i>Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</i>, San Francisco, CA, 2001. ACM.]]Google ScholarGoogle Scholar
  2. R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. A new ensemble diversity measure applied to thinning ensembles. In <i>Multiple Classifier Systems Workshop</i>, pages 306-316, Surrey, UK, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting and variants. <i>Machine Learning</i>, 36(1,2), 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. <i>Nucleic Acids Research</i>, 28:235-242, 2000. http://www.pdb.org/.]]Google ScholarGoogle ScholarCross RefCross Ref
  5. C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, 1998.]]Google ScholarGoogle Scholar
  6. L. Breiman. Bagging predictors. <i>Machine Learning</i>, 24(2):123-140, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Breiman. Pasting bites together for prediction in large data sets. <i>Machine Learning</i>, 36(2): 85-103, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Breiman. Random forests. <i>Machine Learning</i>, 45(1):5-32, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Chan and S. Stolfo. Towards parallel and distributed learning by meta-learning. In <i>Working Notes AAAI Workshop on Knowledge Discovery and Databases</i>, pages 227-240, San Mateo, CA, 1993.]]Google ScholarGoogle Scholar
  10. N. V. Chawla, S. Eschrich, and L. O. Hall. Creating ensembles of classifiers. In <i>First IEEE International Conference on Data Mining</i>, pages 581-583, San Jose, CA, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. V. Chawla, L. O. Hall, K. W. Bowyer, T. E. Moore, and W. P. Kegelmeyer. Distributed pasting of small votes. In <i>Third International Workshop on Multiple Classifier Systems</i>, pages 52-61, Cagliari, Italy, 2002a.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. V. Chawla, T. E. Moore, L. O. Hall, K. W. Bowyer, C. Springer, and W. P. Kegelmeyer. Distributed learning with bagging-like performance. <i>Pattern Recognition Letters</i>, 24(1-3):455-471, 2002b.]]Google ScholarGoogle Scholar
  13. N. V. Chawla, T. E. Moore, Jr., L. O. Hall, K. W. Bowyer, W. P. Kegelmeyer, and C. Springer. Investigation of bagging-like effects and decision trees versus neural nets in protein secondary structure prediction. In <i>ACM SIGKDD Workshop on Data Mining in Bio-Informatics</i>, San Francisco, CA, 2001.]]Google ScholarGoogle Scholar
  14. D. J. Spiegelhalter D. Michie and C. C. Taylor. <i>Machine Learning, Neural and Statistical Classification </i>. Ellis Horwood, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Dietterich. An empirical comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization. <i>Machine Learning</i>, 40(2):139-157, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Domingos. Using partitioning to speed up specific-to-general rule induction. In <i>AAAI Workshop on Integrating Multiple Learned Models</i>, pages 29-34, Portland, OR, 1996.]]Google ScholarGoogle Scholar
  17. R. Duda, P. Hart, and D. Stork. <i>Pattern Classification</i>. Wiley-Interscience, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Eschrich, N. V. Chawla, and L. O. Hall. Learning to predict in complex biological domains. <i>Journal of System Simulation</i>, 2:1464-1471, 2002.]]Google ScholarGoogle Scholar
  19. S. E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. In <i>Advances in Neural Information Processing Systems 2</i>, Vancouver, Canada, 1990. Morgan Kaufmann.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. <i>Advances in Knowledge Discovery and Data Mining</i>, chapter From data mining to knowledge discovery: An overview. AAAI Press, Menlo Park, CA, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In <i>Thirteenth International Conference on Machine Learning</i>, Bari, Italy, 1996.]]Google ScholarGoogle Scholar
  22. G. Giacinto and F. Roli. An approach to automatic design of multiple classifier systems. <i>Pattern Recognition Letters</i>, 22:25-33, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. I. J. Good. <i>The Estimation of Probabilities: An essay on modern Bayesian methods</i>. MIT Press, 1965.]]Google ScholarGoogle Scholar
  24. L. O. Hall, K. W. Bowyer, N. V. Chawla, T. E. Moore, and W. P. Kegelmeyer. Avatar: Adaptive Visualization Aid for Touring and Recovery. Technical Report SAND2000-8203, Sandia National Labs, 2000.]]Google ScholarGoogle ScholarCross RefCross Ref
  25. L. O. Hall, N. V. Chawla, K. W. Bowyer, and W. P. Kegelmeyer. Learning rules from distributed data. In <i>Workshop of Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</i>, San Diego, CA, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Ho. Random subspace method for constructing decision forests. <i>IEEE Transactions on PAMI</i>, 20(8):832-844, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. T. Jones. Protein secondary structure prediction based on decision-specific scoring matrices. <i>Journal of Molecular Biology</i>, 292:195-202, 1999.]]Google ScholarGoogle Scholar
  28. L. Kuncheva and C. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. <i>Machine Learning</i>, 51:181-207, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. Kuncheva, C. Whitaker, C. Shipp, and R. Duin. Is independence good for combining classifiers? In <i>Proceedings of 15th International Conference on Pattern Recognition</i>, pages 168-171, Barcelona, Spain, September 2000.]]Google ScholarGoogle Scholar
  30. P. Latinne, O. Debeir, and C. Decaestecker. Limiting the number of trees in random forests. In J. Kittler and F. Roli, editors, <i>Multiple Classifier Systems, Second International Workshop</i>, pages 178-187, Cambridge, UK, 2001. Springer.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Lazarevic and Z. Obradovic. Boosting algorithms for parallel and distributed learning. <i>Distributed and Parallel Databases Journal, Special Issue on Parallel and Distributed Data Mining</i>, 11:203-229, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. N. Leavitt. Data mining for the corporate masses. In <i>IEEE Computer</i>. IEEE Computer Society, May 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Lawrence Livermore National Laboratories. ASCI Blue Pacific. http://www.llnl.gov/asci/platforms/bluepac.]]Google ScholarGoogle Scholar
  34. Lawrence Livermore National Laboratories. Protein Structure Prediction Center. http://predictioncenter.llnl.gov/, 1999.]]Google ScholarGoogle Scholar
  35. R. Musick, J. Catlett, and S. Russell. Decision theoretic subsampling for induction on large databases. In <i>Proceedings of Tenth International Conference on Machine Learning</i>, pages 212- 219, Amherst, MA, 1993.]]Google ScholarGoogle Scholar
  36. C. Perlich, F. Provost, and J. Simonoff. Tree induction vs. logistic regression: A learning-curve analysis. <i>Journal of Machine Learning Research</i>, 4:211-255, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. F. Provost and D. N. Hennessy. Scaling up: Distributed machine learning with cooperation. In <i>Proceedings of the Thirteenth National Conference on Artificial Intelligence, AAAI'96</i>, pages 74-79, Portland, Oregon, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. <i>Data Mining and Knowledge Discovery</i>, 3(2):131-169, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. F. J. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In <i>Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</i>, pages 23- 32, San Diego, CA, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. R. Quinlan. <i>C4.5: Programs for Machine Learning</i>. Morgan Kaufmann, San Mateo, CA, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. D. B. Skalak. The sources of increased accuracy for two proposed boosting algorithms. In <i>AAAI Integrating Multiple Learned Models Workshop</i>, Portland, Oregon, 1996.]]Google ScholarGoogle Scholar
  42. W. N. Street and Y. Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In <i>Proceedings of seventh International Conference on Knowledge Discovery and Data Mining</i>, pages 377-382, 2001. San Francisco, CA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning Ensembles from Bites: A Scalable and Accurate Approach

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image The Journal of Machine Learning Research
      The Journal of Machine Learning Research  Volume 5, Issue
      12/1/2004
      1571 pages
      ISSN:1532-4435
      EISSN:1533-7928
      Issue’s Table of Contents

      Publisher

      JMLR.org

      Publication History

      • Published: 1 December 2004
      Published in jmlr Volume 5, Issue

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader