ABSTRACT
We develop an algorithm to evolve sets of probabilistically significant multivariate feature interactions, with co-evolved feature ranges, for classification in large, complex datasets. The datasets may include nominal, ordinal, and/or continuous features, missing data, imbalanced classes, and other complexities. Our age-layered evolutionary algorithm generates conjunctive clauses to model multivariate interactions in datasets that are too large to be analyzed using traditional methods such as logistic regression. Using a novel hypergeometric probability mass function for fitness evaluation, the algorithm automatically archives conjunctive clauses that are probabilistically significant at a given threshold, thus identifying strong complex multivariate interactions. The method is validated on two synthetic epistatic datasets and applied to a complex real-world survey dataset aimed at determining the drivers of household infestation for an insect that transmits Chagas disease. We identify a set of 178,719 predictive feature interactions that are associated with household infestation, thus dramatically reducing the size of the search space for future analysis.
- Bustamante Zamora, D. M., Hernández, M. M., Torres, N., Zúniga, C., Sosa, W., de Abrego, V. & Monroy Escobar, M. C. Information to Act: Household Characteristics are Predictors of Domestic Infestation with the Chagas Vector Triatoma dimidiata in Central America. American Journal of Tropical Medicine and Hygiene 93, 97--107 (2015).Google ScholarCross Ref
- Butz, M. V. Rule-based evolutionary online learning systems a principled approach to LCS analysis and design. (Springer, 2006). at http://site.ebrary.com/id/10143394Google Scholar
- Control of Chagas disease second report of a WHO expert committee. (2002). at http://site.ebrary.com/id/10040305Google Scholar
- De Andrade, A. L., Zicker, F., De Oliveira, R.M., Da Silva, I.G., Silva, S. A., De Andrade, S. S. & Martelli, C. M. Evaluation of risk factors for house infestation by Triatoma infestans in Brazil. Am. J. Trop. Med. Hyg. 53, 443--447 (1995).Google ScholarCross Ref
- De Jong, K. A. & Spears, W. M. Learning concept classification rules using genetic algorithms. Proceedings of the Twelfth International Joint Conference on Artificial Intelligence, 651--656 (1991). Google ScholarDigital Library
- De Jong, K. A., Spears, W. M. & Gordon, D. F. Using genetic algorithms for concept learning. Machine Learning 13, 161--188 (1993). Google ScholarDigital Library
- DeHaas, D., Craig, J., Rickert, C., Haake, P., Stor, K. & Eppstein, M. J. Feature selection and classification in noisy epistatic problems using a hybrid evolutionary approach. In Proceedings of the 9th annual conference on Genetic and evolutionary computation, 1872--1872 (2007). Google ScholarDigital Library
- Eiben, A. E. & Smith, J. E. Introduction to evolutionary computing. (Springer, 2010). Google ScholarDigital Library
- Enger, K. S., Ordoñez, R., Wilson, M. L. & Ramsey, J. M. Evaluation of risk factors for rural infestation by Triatoma pallidipennis (Hemiptera: Triatominae), a Mexican vector of Chagas disease. J. Med. Entomol. 41, 760--767 (2004).Google ScholarCross Ref
- Estimación cuantitativa de la enfermedad de Chagas en las Américas. (2006). at http://www.bvsops.org.uy/pdf/chagas19.pdfGoogle Scholar
- Hornby, G. S. ALPS: the age-layered population structure for reducing the problem of premature convergence. In Proceedings of the 8th annual conference on Genetic and evolutionary computation, 815--822 (2006). Google ScholarDigital Library
- Kendall, M. G. The advanced theory of statistics. (Hafner Publishing Company, 1952).Google Scholar
- Lin, M., Lucas, H. C. & Shmueli, G. Research Commentary Too Big to Fail: Large Samples and the p -Value Problem. Information Systems Research 24, 906--917 (2013). Google ScholarDigital Library
- Lopes, C., Pacheco, M., Vellasco, M. & Passos, E. in New Directions in Rough Sets, Data Mining, and Granular-Soft Computing (eds. Zhong, N., Skowron, A. & Ohsuga, S.) 1711, 458--462 (Springer Berlin Heidelberg, 1999). Google ScholarDigital Library
- Lucero, D. E., Morrissey, L. A., Rizzo, D. M., Rodas, A., Garnica, R., Stevens, L., Bustamante, D. M. & Monroy, M. C. Ecohealth Interventions Limit Triatomine Reinfestation following Insecticide Spraying in La Brea, Guatemala. American Journal of Tropical Medicine and Hygiene 88, 630--637 (2013).Google ScholarCross Ref
- Schmidt, M. D. & Lipson, H. Age-fitness pareto optimization. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, 543--544 (ACM Press, 2010). Google ScholarDigital Library
- Stockwell, D. R. B. & Noble, I. R. Induction of sets of rules from animal distribution data: A robust and informative method of data analysis. Mathematics and Computers in Simulation 33, 385--390 (1992). Google ScholarDigital Library
- Stockwell, D. & Peters, David. The GARP modelling system: problems and solutions to automated spatial prediction. International Journal of Geographical Information Science 13, 143--158 (1999).Google ScholarCross Ref
- Urbanowicz, R. J., Bertasius, G. & Moore, J. H. An extended Michigan-style learning classifier system for flexible supervised learning, classification, and data mining. In Parallel Problem Solving from Nature -- PPSN XIII (eds. Bartz-Beielstein, T., Branke, J., Filipič, B. & Smith, J.) 8672, 211--221 (Springer International Publishing, 2014).Google Scholar
- Urbanowicz, R. J., Granizo-Mackenzie, A. & Moore, J. H. An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier systems. IEEE Computational Intelligence Magazine 7, 35--45 (2012). Google ScholarDigital Library
- Urbanowicz, R. J. & Moore, J. H. The application of Michigan-style learning classifier systems to address genetic heterogeneity and epistasis in association studies. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, 195--210 (ACM Press, 2010). Google ScholarDigital Library
Index Terms
- Evolving Probabilistically Significant Epistatic Classification Rules for Heterogeneous Big Datasets
Recommendations
A tunable model for multi-objective, epistatic, rugged, and neutral fitness landscapes
GECCO '08: Proceedings of the 10th annual conference on Genetic and evolutionary computationThe fitness landscape of a problem is the relation between the solution candidates and their reproduction probability. In order to understand optimization problems, it is essential to also understand the features of fitness landscapes and their ...
Analysis of a triploid genetic algorithm over deceptive and epistatic landscapes
This paper examines the performance of a canonical genetic algorithm (CGA) against that of the triploid genetic algorithm (TGA) introduced in [14], over a number of well known deceptive landscapes and a series of NK landscapes in order to increase our ...
Radical epistasis and the genotype-phenotype relationship
Models of evolution often assume that the offspring of two genotypes, which are genetically intermediate by definition, are also phenotypically intermediate. The continuity between genotype and phenotype interferes with the process of evolution on ...
Comments