research-article

Where are we now?: a large benchmark study of recent symbolic regression methods

Authors:
Patryk Orzechowski

University of Pennsylvania

University of Pennsylvania
View Profile

,
William La Cava

University of Pennsylvania

University of Pennsylvania
View Profile

,
Jason H. Moore

University of Pennsylvania

University of Pennsylvania
View Profile

GECCO '18: Proceedings of the Genetic and Evolutionary Computation ConferenceJuly 2018Pages 1183–1190https://doi.org/10.1145/3205455.3205539

Published:02 July 2018Publication History

GECCO '18: Proceedings of the Genetic and Evolutionary Computation Conference

Pages 1183–1190

ABSTRACT

In this paper we provide a broad benchmarking of recent genetic programming approaches to symbolic regression in the context of state of the art machine learning approaches. We use a set of nearly 100 regression benchmark problems culled from open source repositories across the web. We conduct a rigorous benchmarking of four recent symbolic regression approaches as well as nine machine learning approaches from scikit-learn. The results suggest that symbolic regression performs strongly compared to state-of-the-art gradient boosting algorithms, although in terms of running times is among the slowest of the available methodologies. We discuss the results in detail and point to future research directions that may allow symbolic regression to gain wider adoption in the machine learning community.

References

Ignacio Arnaldo, Krzysztof Krawiec, and Una-May O'Reilly. 2014. Multiple regression genetic programming. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation. ACM, 879--886. Google ScholarDigital Library
J.C. Bongard and H. Lipson. 2005. Nonlinear System Identification Using Coevolution of Models and Tests. IEEE Transactions on Evolutionary Computation 9, 4 (Aug. 2005), 361--384. Google ScholarDigital Library
Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5--32. Google ScholarDigital Library
Mauro Castelli, Sara Silva, and Leonardo Vanneschi. 2015. A C++ framework for geometric semantic genetic programming. Genetic Programming and Evolvable Machines 16, 1 (March 2015), 73--81. Google ScholarDigital Library
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 785--794. Google ScholarDigital Library
Grant Dick, Aysha P. Rimoni, and Peter A. Whigham. 2015. A Re-Examination of the Use of Genetic Programming on the Oral Bioavailability Problem. ACM Press, 1015--1022. Google ScholarDigital Library
Pedro Domingos. 2012. A few useful things to know about machine learning. Commun. ACM 55, 10 (2012), 78--87. Google ScholarDigital Library
Harris Drucker. 1997. Improving regressors using boosting techniques. In ICML, Vol. 97. 107--115. Google ScholarDigital Library
Chris Drummond and Nathalie Japkowicz. 2010. Warning: statistical benchmarking is addictive. Kicking the habit in machine learning. Journal of Experimental & Theoretical Artificial Intelligence 22, 1 (March 2010), 67--80. Google ScholarDigital Library
Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, and others. 2004. Least angle regression. The Annals of statistics 32, 2 (2004), 407--499.Google ScholarCross Ref
Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res 15, 1 (2014), 3133--3181. Google ScholarDigital Library
Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119--139. Google ScholarDigital Library
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.Google Scholar
Geoffrey E Hinton. 1989. Connectionist Learning Procedures. Artificial Intelligence 40 (1989), 185--234. Google ScholarDigital Library
Gregory S Hornby. 2006. ALPS: the age-layered population structure for reducing the problem of premature convergence. In Proceedings of the 8th annual conference on Genetic and evolutionary computation. ACM, 815--822. Google ScholarDigital Library
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Michael F. Korns. 2011. Accuracy in symbolic regression. In Genetic Programming Theory and Practice IX. Springer, 129--151. http://link.springer.com/chapter/10.1007/978-1-4614-1770-5_8Google Scholar
William La Cava, Kourosh Danai, and Lee Spector. 2016. Inference of compact nonlinear dynamic models by epigenetic local search. Engineering Applications of Artificial Intelligence 55 (Oct. 2016), 292--306. Google ScholarDigital Library
William La Cava, Kourosh Danai, Lee Spector, Paul Fleming, Alan Wright, and Matthew Lackner. 2016. Automatic identification of wind turbine models using evolutionary multiobjective optimization. Renewable Energy 87, Part 2 (March 2016), 892--902.Google Scholar
William La Cava, Lee Spector, and Kourosh Danai. 2016. Epsilon-Lexicase Selection for Regression. In Proceedings of the Genetic and Evolutionary Computation Conference 2016 (GECCO '16). ACM, New York, NY, USA, 741--748. Google ScholarDigital Library
James McDermott, David R. White, Sean Luke, Luca Manzoni, Mauro Castelli, Leonardo Vanneschi, Wojciech Jaskowski, Krzysztof Krawiec, Robin Harper, and Kenneth De Jong. 2012. Genetic programming needs better benchmarks. In Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference. ACM, 791--798. http://dl.acm.org/citation.cfm?id=2330273 Google ScholarDigital Library
Alberto Moraglio, Krzysztof Krawiec, and Colin G. Johnson. 2012. Geometric semantic genetic programming. In Parallel Problem Solving from Nature-PPSN XII. Springer, 21--31. http://link.springer.com/chapter/10.1007/978-3-642-32937-1_3 Google ScholarDigital Library
Quang Uy Nguyen, Tuan Anh Pham, Xuan Hoai Nguyen, and James McDermott. 2015. Subtree semantic geometric crossover for genetic programming. Genetic Programming and Evolvable Machines (Oct. 2015), 1--29.Google Scholar
Randal S. Olson, William La Cava, Zairah Mustahsan, Akshay Varik, and Jason H. Moore. 2017. Data-driven Advice for Applying Machine Learning to Bioinformatics Problems. In Pacific Symposium on Biocomputing (PSB). http://arxiv.org/abs/1708.05070 arXiv: 1708.05070.Google Scholar
Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison. BioData Mining (2017). https://arxiv.org/abs/1703.00512 arXiv preprint arXiv:1703.00512.Google Scholar
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825--2830. Google ScholarDigital Library
Christian Robert. 2014. Machine learning, a probabilistic perspective. (2014).Google Scholar
Michael Schmidt and Hod Lipson. 2009. Distilling free-form natural laws from experimental data. Science 324, 5923 (2009), 81--85. http://www.sciencemag.org/content/324/5923/81.shortGoogle Scholar
Michael Schmidt and Hod Lipson. 2011. Age-fitness pareto optimization. In Genetic Programming Theory and Practice VIII. Springer, 129--146. http://link.springer.com/chapter/10.1007/978-1-4419-7747-2_8 Google ScholarDigital Library
Michael D Schmidt, Ravishankar R Vallabhajosyula, Jerry W Jenkins, Jonathan E Hood, Abhishek S Soni, John P Wikswo, and Hod Lipson. 2011. Automated refinement and inference of analytical models for metabolic networks. Physical Biology 8, 5 (Oct. 2011), 055011.Google ScholarCross Ref
Alex J Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and computing 14, 3 (2004), 199--222. Google ScholarDigital Library
Lee Spector. 2012. Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference companion. 401--408. http://dl.acm.org/citation.cfm?id=2330846 Google ScholarDigital Library
Karolina Stanislawska, Krzysztof Krawiec, and Zbigniew W. Kundzewicz. 2012. Modeling global temperature changes with genetic programming. Computers & Mathematics with Applications 64, 12 (Dec. 2012), 3717--3728. Google ScholarDigital Library
Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), 267--288.Google Scholar
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2014. OpenML: Networked Science in Machine Learning. SIGKDD Explor. Newsl. 15, 2 (June 2014), 49--60. Google ScholarDigital Library
E.J. Vladislavleva, G.F. Smits, and D. den Hertog. 2009. Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming. IEEE Transactions on Evolutionary Computation 13, 2 (2009), 333--349. Google ScholarDigital Library
Jan Žegklitz and Petr Pošík. 2017. Symbolic Regression Algorithms with Built-in Linear Regression. arXiv:1701.03641 {cs} (Jan. 2017). http://arxiv.org/abs/1701.03641 arXiv: 1701.03641.Google Scholar
David R. White, James McDermott, Mauro Castelli, Luca Manzoni, Brian W. Goldman, Gabriel Kronberger, Wojciech Jaśkowski, Una-May O'Reilly, and Sean Luke. 2012. Better GP benchmarks: community survey results and proposals. Genetic Programming and Evolvable Machines 14, 1 (Dec. 2012), 3--29. D0I Google ScholarDigital Library

Index Terms

Where are we now?: a large benchmark study of recent symbolic regression methods
1. Computing methodologies
  1. Machine learning

Recommendations

PSB2: the second program synthesis benchmark suite
GECCO '21: Proceedings of the Genetic and Evolutionary Computation Conference

For the past six years, researchers in genetic programming and other program synthesis disciplines have used the General Program Synthesis Benchmark Suite to benchmark many aspects of automatic program synthesis systems. These problems have been used to ...
Read More
A comparative study of GP-based and state-of-the-art classifiers on a synthetic machine learning benchmark
GECCO '22: Proceedings of the Genetic and Evolutionary Computation Conference Companion

In this paper we compare performance of genetic programming-based symbolic classifiers on a novel synthetic machine learning benchmark called DIGEN. This framework and collection of 40 different classification problems was designed specifically to ...
Read More
Parameter identification for symbolic regression using nonlinear least squares
Abstract
In this paper we analyze the effects of using nonlinear least squares for parameter identification of symbolic regression models and integrate it as local search mechanism in tree-based genetic programming. We employ the Levenberg–Marquardt ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GECCO '18: Proceedings of the Genetic and Evolutionary Computation Conference
July 2018
1578 pages
ISBN:9781450356183
DOI:10.1145/3205455
Editor:
Hernan Aguirre
Shinshu University
,
General Chair:
Keiki Takadama
The University of Electro-Communications
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
benchmarking
genetic programming
machine learning
symbolic regression
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,669of4,410submissions,38%
Upcoming Conference
GECCO '24

Sponsor:

sigevo

Genetic and Evolutionary Computation Conference

July 14 - 18, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 74
  Total Citations
  View Citations
- 1,170
  Total Downloads
- Downloads (Last 12 months)269
- Downloads (Last 6 weeks)36
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Where are we now?: a large benchmark study of recent symbolic regression methods

GECCO '18: Proceedings of the Genetic and Evolutionary Computation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

PSB2: the second program synthesis benchmark suite

A comparative study of GP-based and state-of-the-art classifiers on a synthetic machine learning benchmark

Parameter identification for symbolic regression using nonlinear least squares

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Where are we now?: a large benchmark study of recent symbolic regression methods

GECCO '18: Proceedings of the Genetic and Evolutionary Computation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

PSB2: the second program synthesis benchmark suite

A comparative study of GP-based and state-of-the-art classifiers on a synthetic machine learning benchmark

Parameter identification for symbolic regression using nonlinear least squares

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media