skip to main content
10.1145/2908812.2908931acmconferencesArticle/Chapter ViewAbstractPublication PagesgeccoConference Proceedingsconference-collections
research-article
Public Access

Evolving Probabilistically Significant Epistatic Classification Rules for Heterogeneous Big Datasets

Published:20 July 2016Publication History

ABSTRACT

We develop an algorithm to evolve sets of probabilistically significant multivariate feature interactions, with co-evolved feature ranges, for classification in large, complex datasets. The datasets may include nominal, ordinal, and/or continuous features, missing data, imbalanced classes, and other complexities. Our age-layered evolutionary algorithm generates conjunctive clauses to model multivariate interactions in datasets that are too large to be analyzed using traditional methods such as logistic regression. Using a novel hypergeometric probability mass function for fitness evaluation, the algorithm automatically archives conjunctive clauses that are probabilistically significant at a given threshold, thus identifying strong complex multivariate interactions. The method is validated on two synthetic epistatic datasets and applied to a complex real-world survey dataset aimed at determining the drivers of household infestation for an insect that transmits Chagas disease. We identify a set of 178,719 predictive feature interactions that are associated with household infestation, thus dramatically reducing the size of the search space for future analysis.

References

  1. Bustamante Zamora, D. M., Hernández, M. M., Torres, N., Zúniga, C., Sosa, W., de Abrego, V. & Monroy Escobar, M. C. Information to Act: Household Characteristics are Predictors of Domestic Infestation with the Chagas Vector Triatoma dimidiata in Central America. American Journal of Tropical Medicine and Hygiene 93, 97--107 (2015).Google ScholarGoogle ScholarCross RefCross Ref
  2. Butz, M. V. Rule-based evolutionary online learning systems a principled approach to LCS analysis and design. (Springer, 2006). at http://site.ebrary.com/id/10143394Google ScholarGoogle Scholar
  3. Control of Chagas disease second report of a WHO expert committee. (2002). at http://site.ebrary.com/id/10040305Google ScholarGoogle Scholar
  4. De Andrade, A. L., Zicker, F., De Oliveira, R.M., Da Silva, I.G., Silva, S. A., De Andrade, S. S. & Martelli, C. M. Evaluation of risk factors for house infestation by Triatoma infestans in Brazil. Am. J. Trop. Med. Hyg. 53, 443--447 (1995).Google ScholarGoogle ScholarCross RefCross Ref
  5. De Jong, K. A. & Spears, W. M. Learning concept classification rules using genetic algorithms. Proceedings of the Twelfth International Joint Conference on Artificial Intelligence, 651--656 (1991). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. De Jong, K. A., Spears, W. M. & Gordon, D. F. Using genetic algorithms for concept learning. Machine Learning 13, 161--188 (1993). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. DeHaas, D., Craig, J., Rickert, C., Haake, P., Stor, K. & Eppstein, M. J. Feature selection and classification in noisy epistatic problems using a hybrid evolutionary approach. In Proceedings of the 9th annual conference on Genetic and evolutionary computation, 1872--1872 (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Eiben, A. E. & Smith, J. E. Introduction to evolutionary computing. (Springer, 2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Enger, K. S., Ordoñez, R., Wilson, M. L. & Ramsey, J. M. Evaluation of risk factors for rural infestation by Triatoma pallidipennis (Hemiptera: Triatominae), a Mexican vector of Chagas disease. J. Med. Entomol. 41, 760--767 (2004).Google ScholarGoogle ScholarCross RefCross Ref
  10. Estimación cuantitativa de la enfermedad de Chagas en las Américas. (2006). at http://www.bvsops.org.uy/pdf/chagas19.pdfGoogle ScholarGoogle Scholar
  11. Hornby, G. S. ALPS: the age-layered population structure for reducing the problem of premature convergence. In Proceedings of the 8th annual conference on Genetic and evolutionary computation, 815--822 (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kendall, M. G. The advanced theory of statistics. (Hafner Publishing Company, 1952).Google ScholarGoogle Scholar
  13. Lin, M., Lucas, H. C. & Shmueli, G. Research Commentary Too Big to Fail: Large Samples and the p -Value Problem. Information Systems Research 24, 906--917 (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Lopes, C., Pacheco, M., Vellasco, M. & Passos, E. in New Directions in Rough Sets, Data Mining, and Granular-Soft Computing (eds. Zhong, N., Skowron, A. & Ohsuga, S.) 1711, 458--462 (Springer Berlin Heidelberg, 1999). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Lucero, D. E., Morrissey, L. A., Rizzo, D. M., Rodas, A., Garnica, R., Stevens, L., Bustamante, D. M. & Monroy, M. C. Ecohealth Interventions Limit Triatomine Reinfestation following Insecticide Spraying in La Brea, Guatemala. American Journal of Tropical Medicine and Hygiene 88, 630--637 (2013).Google ScholarGoogle ScholarCross RefCross Ref
  16. Schmidt, M. D. & Lipson, H. Age-fitness pareto optimization. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, 543--544 (ACM Press, 2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Stockwell, D. R. B. & Noble, I. R. Induction of sets of rules from animal distribution data: A robust and informative method of data analysis. Mathematics and Computers in Simulation 33, 385--390 (1992). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Stockwell, D. & Peters, David. The GARP modelling system: problems and solutions to automated spatial prediction. International Journal of Geographical Information Science 13, 143--158 (1999).Google ScholarGoogle ScholarCross RefCross Ref
  19. Urbanowicz, R. J., Bertasius, G. & Moore, J. H. An extended Michigan-style learning classifier system for flexible supervised learning, classification, and data mining. In Parallel Problem Solving from Nature -- PPSN XIII (eds. Bartz-Beielstein, T., Branke, J., Filipič, B. & Smith, J.) 8672, 211--221 (Springer International Publishing, 2014).Google ScholarGoogle Scholar
  20. Urbanowicz, R. J., Granizo-Mackenzie, A. & Moore, J. H. An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier systems. IEEE Computational Intelligence Magazine 7, 35--45 (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Urbanowicz, R. J. & Moore, J. H. The application of Michigan-style learning classifier systems to address genetic heterogeneity and epistasis in association studies. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, 195--210 (ACM Press, 2010). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evolving Probabilistically Significant Epistatic Classification Rules for Heterogeneous Big Datasets

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      GECCO '16: Proceedings of the Genetic and Evolutionary Computation Conference 2016
      July 2016
      1196 pages
      ISBN:9781450342063
      DOI:10.1145/2908812

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 July 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      GECCO '16 Paper Acceptance Rate137of381submissions,36%Overall Acceptance Rate1,669of4,410submissions,38%

      Upcoming Conference

      GECCO '24
      Genetic and Evolutionary Computation Conference
      July 14 - 18, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader