Title: Computational Discovery of Communicable Knowledge
1Computational Discovery of Communicable
Scientific Knowledge
Pat Langley Institute for the Study of Learning
and Expertise Palo Alto, California and Center
for the Study of Language and Information Stanford
University, Stanford, California http//www.isle.
org/langley langley_at_isle.org
Thanks to S. Bay, V. Brooks, S. Klooster, A.
Pohorille, C. Potter, K. Saito, J. Shrager, M.
Schwabacher, and A. Torregrosa.
2Motivations for Computational Discovery
Humans strive to discover new knowledge from
experience so that they can
- better predict and control future events
- understand both previous and future events
- communicate that understanding to others
Computational techniques should let us automate
and/or assist this discovery process.
Recent research on computer-aided discovery has
focused on some of these issues but downplayed
others.
3The Data Mining Paradigm
One computational discovery paradigm, known as
data mining or KDD, can be best characterized as
- emphasizing the availability of vast amounts of
data - drawing on heuristic search methods to find
regularities in these data - using formalisms like decision trees, association
rules, and Bayes nets to describe those
regularities.
Thus, most KDD researchers favor their own
formalisms over those used by scientists and
engineers. As a result, their discoveries are
seldom very communicable to members of those
communities.
4Myths About Understandability
Within the data mining paradigm, one quite
popular myth is that
- decision trees and rules are inherently
understandable - because logical formalisms are easier to
interpret than other notations.
However, Kononenko found that doctors felt that
naïve Bayesian classifiers were easier to
interpret than decision trees. Conclusion Any
formalisms understandability depends on the
interpreters familiarity with that formalism.
5Myths About Understandability
Another popular myth in the data mining community
is that
- connectionist methods produce results that are
opaque - because the set of weights they learn cannot be
easily interpreted.
However, Saito and Nakano (1997) have shown that
one can use such methods to discover explicit
numeric equations. Conclusion Understandability
depends on the resulting formalism, not on the
search method used to discover knowledge.
6Computational Scientific Discovery
An older paradigm, computational scientific
discovery, can be characterized as
- drawing on heuristic search to find regularities
in scientific data, either historical or novel - using formalisms like numeric laws, structural
models, and reaction pathways to describe
regularities.
Thus, researchers in this framework favor
representations used by scientists and
engineers. As a result, their systems
discoveries are more communicable to members of
those communities.
7Time Line for Research on Computational
Scientific Discovery
Legend
8Successes of Computational Scientific Discovery
Over the past decade, systems of this type have
helped discover new knowledge in many scientific
fields
- stellar taxonomies from infrared spectra
(Cheeseman et al., 1989) - qualitative chemical factors in mutagenesis (King
et al., 1996) - quantitative laws of metallic behavior (Sleeman
et al., 1997) - qualitative conjectures in number theory (Colton
et al., 2000) - temporal laws of ecological behavior (Todorovski
et al., 2000) - reaction pathways in catalytic chemistry
(Valdes-Perez, 1994, 1997)
Each of these has led to publications in the
refereed literature of the relevant scientific
field (see Langley, 2000).
9The Developers Role in Computational Discovery
problem formulation
representation engineering
data manipulation
algorithm manipulation
filtering and interpretation
algorithm invocation
10Themes of the Research
We aim to extend previous approaches to
computational scientific discovery by
- generating explanations that involve hidden
objects/variables - revising existing models rather than starting
from scratch - drawing on domain knowledge to constrain the
search process - developing interactive discovery tools for use by
scientists
As in earlier work, the notation for discovered
knowledge will be the same as that used by domain
scientists.
Two promising fields in which to pursue this
research agenda are Earth science and molecular
biology.
11Some Interesting Questions in Earth Science
- What environmental variables determine the
production of carbon and the generation of
various gases? - What functional forms relate these predictive
variables to the ones they influence? - How do extreme values of these variables affect
behavior of the ecosystem? - Are the Earth ecosystem parameters constant or
have values changed in recent years?
12The Task of Ecological Model Revision
Given A model of Earths ecosystem (CASA) stated
as equations that involve observable and hidden
variables.
Given Inferred values for global parameters and
intrinsic properties associated with discrete
variables (e.g., ground cover).
Given Observations about numeric variables
(rainfall, sunlight, temperature, NPPc) as they
change over space and time.
Find A revised ecosystem model with altered
equations and/or parametric values that fits the
data better.
13The NPPc Portion of CASA
NPPc Smonth max (E IPAR, 0) E 0.56 T1
T2 W T1 0.8 0.02 Topt 0.0005
Topt2 T2 1.18 / (1 e 0.2 (Topt
Tempc 10) ) (1 e 0.3 (Tempc Topt 10)
) W 0.5 0.5 EET / PET PET
1.6 (10 Tempc / AHI)A PET-TW-M if Tempc gt
0 PET 0 if Tempc lt 0 A
0.00000068 AHI3 0.000077 AHI2 0.018 AHI
0.49 IPAR 0.5 FPAR-FAS Monthly-Solar
Sol-Conver FPAR-FAS min (SR-FAS 1.08)
/ SR (UMD-VEG) , 0.95 SR-FAS
(Mon-FAS-NDVI 1000) / (Mon-FAS-NDVI 1000)
14The NPPc Portion of CASA
NPPc
IPAR
E
T1
T2
W
e_max
SOLAR
FPAR
PET
EET
Topt
A
SR
Tempc
NDVI
AHI
PETTWM
VEG
15Improving the NPPc Portion of CASA
One way to improve the NPPc models fit to
observed data is to
1. Transform the model into a multilayer neural
network that makes the same predictions.
2. Identify portions of the model that are
candidates for revision.
3. Use an error-driven connectionist learning
algorithm to revise those portions of the model.
4. Transform the revised multilayer network back
into numeric equations using the improved
components.
This approach is similar to Towells (1991)
method for revising qualitative models.
16The RF6 Discovery Algorithm
Saito and Nakano (2000) describe RF6, a
discovery system that
1. Creates a multilayer neural network that links
predictive with predicted variables using
additive and product units.
2. Invokes the BPQ algorithm to search through
the weight space defined by this network.
3. Transforms the resulting network into a
polynomial equation of the form y S ci P
x jd ij .
They have shown this approach can discover an
impressive class of numeric equations from noisy
data.
17Three Facets of Model Revision
We have adapted RF6 to revise an existing
quantitative model in three distinct ways
- Altering the value of parameters in a specified
equation - Changing the associated values for an intrinsic
property and - Replacing the equation for a term with another
expression.
Rather than initializing weights randomly, the
system starts with weights based on parameters in
the original model. We have applied this strategy
to revise six different portions of the NPPc
submodel.
18Altering Parameters in the NPPc Model
Initial model T2 1.18 / (1 e 0.2 (Topt
Tempc 10) ) (1 e 0.3 (Tempc Topt
10) ) Cross-validated RMSE 467.910 Behavior
Gaussian-like function of temperature
difference. Revised model T2 1.80 / (1
e 0.05 (Topt Tempc 10.8) ) (1 e 0.3
(Tempc Topt 90.33) ) Cross-validated RMSE
461.466 one percent reduction Behavior
nearly flat function in actual range of
temperature difference. Conclusion The T2
temperature stress term contributes little to
the overall predictive ability of the NPPc
submodel.
19Revising Intrinsic Values in the Model
The NPPc submodel includes one intrinsic
property, SR, associated with the variable for
vegetation type, UMD-VEG. The corresponding RF6
network includes one hidden node for SR and one
dummy input variable for each vegetation type.
Veg type A B C D E F
G H I J K Initial
3.06 4.35 4.35 4.05 5.09 3.06 4.05
4.05 4.05 5.09 4.05 Revised 2.57 4.77
2.20 3.99 3.70 3.46 2.34 0.34 2.72 3.46
1.60 RMSE 467.910 for the original
model RMSE 448.376 for the revised model, an
improvement of four percent. Observation Nearly
all intrinsic values are lower in the revised
model.
20Revising Equations in the NPPc Model
Initial model E 0.56 T1 T2
W Cross-validated RMSE 467.910 Behavior Each
stress term decreases the photosynthetic
efficiency E. Revised model E 0.521
T10.00 T2 0.03 W 0.00 Cross-validated RMSE
446.270 five percent reduction Behavior T1
and W have no effect on E and T2 has only a minor
effect . Conclusion The stress terms are not
useful to the NPPc model, most likely because of
recent improvements in NDVI measures.
21Future Work on Ecological Model Revision
- Apply the revision method to other parts of NPPc
submodel and other static parts of CASA model. - Extend the revision method to improve parts of
CASA that involve difference equations. - Develop software for visualizing both spatial and
temporal anomalies, as well as relating them to
the model. - Implement an interactive system that lets
scientists direct high-level search for improved
ecosystem models.
22Visualizing an Improved Model
One way to visualize a model involves plotting
its rules spatially.
Our Earth science collaborators found this
useful, as regions often correspond to
recognizable ecological zones.
23Some Interesting Biological Questions
- How do organisms acclimate to increased
temperature or ultraviolet radiation? - Why do we observe bleaching of plant cells under
high light conditions? - What differences in biological processes exist
between a mutant organism and the original? - What are the effects on an organisms biological
processes when one of its important genes is
removed?
24Modeling Microarrary Results on Photosynthesis
Given Qualitative knowledge about reactions and
regulations for Cyanobacteria in a high light
situation.
Given Knowledge about the genes in Cyanobacteria
relevant to the photosynthetic process.
Given Observed expression levels, over time, of
the organisms genes in the presence of high
ultraviolet light.
Find A revised model with altered reactions
and regulations that explains the expression
levels and bleaching.
25A Model of Photosynthesis Regulation
How do plants modify their photosynthetic
apparatus in high light?
-
NBLA
NBLR
PBS
-
DFR
Health
psbA1
-
-
-
RR
Photo
psbA2
-
Light
cpcB
26Collecting Data on Photosynthetic Processes
www.affymetrix.com/
Microarray Trace
/wwwscience.murdoch.edu.au/teach
Continuous Culture (Chemostat)
Stress (e.g., High Light)
Adaptation Period
Sampling mRNA/cDNA
Health of Culture
Equlibrium Period
www.affymetrix.com/
Time
27Microarray Data on Photosynthetic Regulation
28Revising a Model of Gene Regulation
Our approach carries out heuristic search through
the model space, guided by candidates abilities
to explain the data
Starting state Initial model proposed by the
biologist Operators Add a link, delete a link,
determine sign on a link Control Greedy search
for N steps to determine link structure
Exhaustive search to determine best signs on
links Evaluation Agreement with predicted
relations among partial correlations, similar to
those used in Tetrad
To reduce variance, the system repeats this
process using bootstrap sampling and only makes
changes that occur in 75 of the models.
29Greedy Search Through a Space of Models
30A Revised Model of Photosynthesis Regulation
Changes to the model improve its match to the
expression data.
-
NBLA
NBLR
PBS
DFR
Health
psbA1
-
-
-
RR
Photo
psbA2
-
Light
cpcB
Similar changes adapt the model to expression
data from mutants.
31Future Work on Biological Modeling
- Add more knowledge about biochemical pathways and
use to interpret other microarray data (e.g., rat
metabolism, cancer). - Introduce taxonomic knowledge to limit the search
process and improve final models. - Expand modeling formalism to support biological
mechanisms in addition to abstract processes. - Implement an interactive system that lets
scientists direct high-level search for improved
biological process models.
32Concluding Remarks
In summary, unlike work in the data mining
paradigm, our research on computational discovery
- attempts to move beyond description and
prediction to both explanation and understanding - uses domain knowledge to initialize search and to
characterize differences from revised model - presents the new knowledge in some communicable
notation that is familiar to domain experts.
This approach seems especially appropriate for
manipulating and understanding complex scientific
and engineering data.
33In Memoriam
Earlier this year, computational scientific
discovery lost two of its founding fathers
- Herbert A. Simon (1916 2001)
- Jan M. Zytkow (1945 2001)
Both contributed to the field in many ways
posing new problems, inventing methods, training
students, and organizing meetings. Moreover, both
were interdisciplinary researchers who
contributed to computer science, psychology,
philosophy, and statistics. Herb Simon and Jan
Zytkow were excellent role models that we should
aim to emulate.
34A Closing Quotation
We would like to imagine that the great
discoverers, the scientists whose behavior we are
trying to understand, would be pleased with this
interpretation of their activity as normal
(albeit high-quality) human thinking. . . But
science is concerned with the way the world is,
not with how we would like it to be. So we must
continue to try new experiments, to be guided by
new evidence, in a heuristic search that is never
finished but always fascinating.
Herbert A. Simon, Envoi to Scientific Discovery,
1987.
35(No Transcript)
36Visualizing Errors in the Model
We can easily plot an improved models errors in
spatial terms.
Such displays can help suggest causes for
prediction errors and thus ways to further
improve the model.
37Related Research on Discovery
Our approach to computational scientific
discovery borrows ideas from earlier work on
- equation discovery (Langley et al. 1983 Zytkow
et al, 1990 Washio Motoda, 1998 Todorovski
Dzeroski, 1997) - revision of qualitative models (Ourston Mooney,
1990 Towell, 1991) - revision of quantitative models (Glymour et al.,
1987 Chown Dietterich, 2000).
However, our work combines these ideas in novel
ways to produce a discovery system with new
functionality.