Evolutionary Computation Model Selection Project
|
Dr. James P. Hoffmann | ||
Chris D. Ellingwood | ||
Department of Plant Biology | ||
Burlington, VT 05405 |
Burlington, VT 05405 |
Willimantic, CT 06226 |
Sponsored by: DOE EPSCoR Computational Biology and USDA Hatch grants
Summary:
We are pursuing a new approach to model selection that combines information theory and evolutionary algorithms (both genetic algorithms and genetic programming) to help us in the building of optimally-specified models of complex biological systems. Our ultimate goal is to better understand causality in these systems. Specifically, we use the measured system behavior to drive the evolution of appropriate mechanistic-based models. To avoid model over fitting we integrate into our fitness function the Akaike Information Criterion to implement the principle of parsimony. We are currently using this approach to model invasive species dynamics and aquatic systems, however we expect this technique can be useful in modeling many other complex systems.
Site Map:
Overview of our genetic algorithm method:
A community of candidate models is randomly initialized. Some gene loci (model parameters) are effectively turned off (0 indicates an inactive gene), depending on the initialized state of an evolvable internal switch. Therefore, although all models have the same fixed genome length, their structure when evaluated by the fitness function is functionally variable. The observations are divided into two data subsets; a training set used by the fitness function for evaluating the candidate models, and a test set used for model validation.
We have modified PGAPack, a public domain parallel genetic algorithm library available at Argonne National laboratory, to include a GUI, additional output metrics, 2D and 3D visualization, and other features.
Here are some screenshots of our software interface (click to enlarge):
The fitness function and models are coded in C, optimized, and parallelized for SMP (Symmetrical Multiple Processors).
Method: We chose a 'correct' model from among the set of possible models,
and used that model to generate the “true” data (with or without added Gaussian
noise). All of the evolving models’ predictions are compared to the "true" data.
Using this synthetic-data approach we tested our modified genetic algorithm on
dynamic physiological-ecology models we built that simulate some of the
biochemistry and biophysics that occur in a leaf undergoing photosynthesis.
Note: the genetic algorithm does not "know" what the correct model is that
produced the data; similar to evolution it operates blindly and the most fit
models tend to survive and reproduce their structure.
A Leaf Photosynthesis Test Model: These experiments used a set of complex models that simulate the physiological ecology of a leaf undergoing photosynthesis, specifically the dynamics of the carbon, water and heat budgets of the leaf over time. Several sub-models simulate the leaf’s response to variation of different environmental factors. Soil water potential, herbivory, and ozone effects are also be included in the models.
The model comprised six ordinary-differential equations that describe the state variables and fluxes. External forcing functions accounted for the influence of light intensity and duration, temperature, humidity and wind velocity, and feedback loops linked the various model subcomponents together. The nonlinearities and interdependencies in the model produced complex behaviors in leaf temperature, heat content, and water and carbon content. Here is the complete model structure depicted as a STELLA diagram:
Effect of Parsimony (P) and Noise (N) on the Success of the GA Evolving the Correct Data-Generating Model.
Treatment→ | - P - N | - P + N | + P - N | + P + N |
0/100 | 0/100 | 96/100 | 93/100 |
Note: the numerator is the number of correct models evolved and the denominator is the total number of replicate runs.
These preliminary results show that our approach is successful at evolving the correct model structure when the Akaike Information Criterion (+ P) is used to insure parsimonious models, even in the presence of noisy data. Without the Akaike Information Criterion the models that are evolved are mis-specified, and overly-complex incorrect models that overfit the data.
Here we describe some of our results with this new approach, both with synthetic test data and real field data of the zebra mussel invasion of Lake Champlain (see #6).
For information on our 2003 Evolutionary Computation workshop with Dr. David Goldberg go here.
Last updated: December 4, 2006