Title: Thoughts on Evaluation
1(No Transcript)
2Thoughts on Evaluation
ShuoYong Shi, Ruslan Sadreyev, Lisa Kinch,
others David Baker and Nick V. Grishin
http//prodata.swmed.edu/CASP8
Howard Hughes Medical Institute, Department of
Biochemistry, University of Texas Southwestern
Medical Center at Dallas
3(No Transcript)
4Listen to your data!
Cutoffs, changes, strategies should
come naturally from the data you have.
51. Domains or not domains 2. Target
categories 3. Scores and evaluation 4.
Results
6Why domains?
Traditionally, CASP targets are evaluated as
domains, i.e. each target structure is parsed
into domains, and model quality is computed for
each domain separately. This strategy makes sense
for two reasons Domains can be mobile and
their relative packing can be influenced by
ligand presence, crystal packing for X-ray
structures, or be semi-random in NMR structures.
Thus even a perfect prediction algorithm will not
be able to cope with this adequately, for
instance in the absence of knowledge about the
ligand presence or crystal symmetry.
Predictions may be better or worse for
individual domains than for their assembly. This
happens when domains are of a different
predictability, e.g. one has a close template,
but the other one does not. Even if domains of a
target are of equal prediction difficulty, it is
possible that the mutual domain arrangement in
the target structure, while predictable in
principle, differs from the template, and thus is
modeled incorrectly by predictors. Comparison
of the whole-chain evaluation with the
domain-based evaluation dissects the problem of
'individual domain' vs. 'domain assembly'
modeling and should help in development of
prediction methods.
7How domains?
Evolutionary domains correspond to structurally
compact evolutionary modules.
http//prodata.swmed.edu/CASP8/evaluation/DomainDe
finition.htm
Ago protein T0487 consist of 5 domains
8Do we need domains?
125 targets, 181 evolutionary domains, do we
need that many?
Server predictions helps us to reduce the number
of domains if whole chain prediction quality is
not much different from domain prediction
quality, domain evaluation is not necessary.
Number of domains
S
Length(domain i) GDT-TS(domain i)
GDT-TS(whole chain) VS.
i1
Number of domains
S
Length(domain i)
i1
http//prodata.swmed.edu/CASP8/evaluation/Domains.
htm
9Whole chain is not the whole content of the
PDB file.
NMR models disordered regions removed! (3.5A
root mean atomic displacement in TESEUS maximum
likelihood minimum RMSD superposition)
474 Arc repressor
10467 OB-fold
468 OB-fold
11T0490 correlation between whole chain and
domain predictions
Correlation between weighted by the number of
residues sum of GDT-TS scores for domain-based
evaluation (y, vertical axis) and whole chain
GDT-TS (x, horizontal axis).
12Two parameters to describe correlation between
whole chain and domain predictions
1. The root mean square (RMS) difference between
the weighted sum of GDT_TS on domains and GDT_TS
on the whole chain (RMS of y-x) measures absolute
GDT-TS difference. 2. A slope of best-fit line
with intercept set to 0 (slope) measures relative
GDT-TS difference. These parameters are
computed on top 10 (according to the weighted
sum) predictions
Each point represents first server model. Green,
gray and black points are top 10, bottom 25 and
the rest of prediction models. Blue line is the
best-fit slope line (intersection 0) to the top
10 server models. Red line is the diagonal.
13T0504 needs domain evaluation
Correlation between weighted by the number of
residues sum of GDT-TS scores for domain-based
evaluation (y, vertical axis) and whole chain
GDT-TS (x, horizontal axis).
Correlation between weighted by the number of
residues sum of GDT-TS scores for domain-based
evaluation (y, vertical axis) and whole chain
GDT-TS (x, horizontal axis).
14T0447 does not need domain evaluation
Correlation between weighted by the number of
residues sum of GDT-TS scores for domain-based
evaluation (y, vertical axis) and whole chain
GDT-TS (x, horizontal axis).
15Domain swaps!
                                               Â
                                     5 out of
122 targets (4 !!!!) exhibit domain swaps, e.g.
                                               Â
                                     Ribbon
diagram of 459 3df8 chain A (rainbow) with its
symmetry mate (white).
16Swapped domain in T0459
Ribbon diagram of 459 3df8 chain A with a
swapped N-terminal ß-hairpin from its symmetry
mate chain (rainbow) and the swapped hairpin
symmetry mate chain (white).
17Domain-swapped 459 3df8 chain B-2-22 plus
chain A23-106
459 with domain-swapped segment removed 3df8
chain A23-106
whole chain 459 3df8 chain A
Correlation plots for the two domain definitions
(swapped and swapped segment removed) of this
single-domain target reveal differences
18All targets Correlation between RMS of the
difference between GDT_TS on domains and GDT_TS
on the whole chain (vertical axis) and the slope
of the best-fit line (horizontal axis), both
computed on top 10 server predictions.
19All targets Correlation between RMS of the
difference between GDT_TS on domains and GDT_TS
on the whole chain (vertical axis) and the slope
of the best-fit line (horizontal axis), both
computed on top 10 server predictions.
20Summary Comparison of domain-based predictions
with whole chain predictions revealed a natural,
data-dictated cutoff (slope of the zero intercept
best-fit line is above 1.3) to select targets
that require domain-based evaluation. These 17
targets are T0397, T0405, T0407, T0409, T0416,
T0419, T0429, T0443, T0457, T0462, T0472, T0478,
T0487, T0496, T0501, T0504, T0510. Predictions
for other targets follow the general trend, are
of a more similar quality for 'domain' and 'whole
chain' and thus domain-based evaluation may not
be necessary for them. Not 181 evolutionary
domains, but 146 evaluation domains
211. Domains or not domains 2. Target
categories 3. Scores and evaluation 4.
Results
22Historic categories
NF new folds FR fold recognition CM
comparative modeling
Grouping targets into categories of approximately
equal prediction difficulty brings out the
flavors of how each method deals with different
target types.
Idea categories should depend on predictions and
boundaries between categories should come out
naturally from the data.
Lisa Kinch, Dylan Chivian, David Kim, David
Baker, Nick Grishin
23Modern categories
Lets see what predictions tell us
Gaussian kernel density estimation!
http//en.wikipedia.org/wiki/Kernel_density_estima
tion
24Superfamily relatives of different function
Family relatives of similar function
Domains Gaussian kernel density estimation of
domain GDT-TS scores for the first model GDT-TS
averaged over top 10 servers, plotted at various
bandwidths (Â standard deviations). These average
GDT-TS scores for domains are shown as a spectrum
along the horizontal axis each bar represents a
domain. The bars are colored according to the
category suggested by this analysis black - FM
red - FR green - CM_H cyan - CM_M blue - CM_E.
The family of curves with varying bandwidth is
shown. Bandwidth varies from 0.3 to 8.2 GDT-TS
units with a step of 0.1, which corresponds to
the color ramp from magenta through blue to cyan.
Thicker curves red, yellow-framed brown and
black, correspond to bandwidths 1, 2 and 4
respectively.
2552
30
67
81
FM
FR
CM_H
CM_M
CM_E
Comparative modeling hard
Comparative modeling medium
Comparative modeling easy
Free modeling
Fold recognition
26(No Transcript)
27Whole chains Gaussian kernel density estimation
of domain GDT-TS scores for the first model
GDT-TS averaged over top 10 servers, plotted at
various bandwidths (Â standard deviations). These
average GDT-TS scores for domains are shown as a
spectrum along the horizontal axis each bar
represents a domain. The bars are colored
according to the category suggested by this
analysis black - FM red - FR green - CM_H
cyan - CM_M blue - CM_E. The family of curves
with varying bandwidth is shown. Bandwidth varies
from 0.3 to 8.2 GDT-TS units with a step of
0.1, which corresponds to the color ramp from
magenta through blue to cyan. Thicker curves
red, yellow-framed brown and black, correspond to
bandwidths 1, 2 and 4 respectively.
28Summary 1. Server predictions reveal natural
target groups that can be used as CASP
categories. 2. Two major categories are
apparent Hard (35, 25) and Easier (112,
75). 3. To enrich evaluation, it is possible
to divide Hard in 2 and Easier in 3
categories, so the final evaluation can be done
with 5 categories FM, FR, CM_H, CM_M, CM_E
4. Average top 10 server GDT-TS is a robust
category indicator for domains and whole chain
with the boundaries betweem categories around
30, 50, 70 and 80.
291. Domains or not domains 2. Target
categories 3. Scores and evaluation 4.
Results
30Three scoresTS, TR and CS
ShuoYong Shi, Ruslan Sadreyev, Jing Tong, David
Baker and Nick V. Grishin
http//prodata.swmed.edu/CASP8
Howard Hughes Medical Institute, Department of
Biochemistry, University of Texas Southwestern
Medical Center at Dallas
31GDT-TS the best single score
Why? Because it is 4 scores in one from 4
different superpositions.
GDT-TSN(1)N(2)N(4)N(8)/(4N) where N(r)
is the number of superimposed residue pairs with
the CACA distance number of residues in the target.
Since many approaches are trained to produce
models scoring better according to some
evaluation method, flaws in the evaluation method
will result in better-scoring models that will
not represent real protein structure in any
better way. One of such dangers is compression
of coordinates, which decreases the gyration
radius and may increase some scores based on
Cartesian superpositions.
http//prodata.swmed.edu/CASP8/evaluation/Scores.h
tm
32Compression is bad, but GDT-type scores may favor
it
33Attraction and repulsion in scores
GTDTS score measures the fraction of residues in
a model within a certain distance from the same
residues in the structure after a superposition.
This approach is based on a "reward". Taking an
analogy with physical forces, such a score is
only the "attraction" part of a potential, and
there is no "repulsion" component in GDTTS. It
might have been reasonable a few years ago, when
predictions were quite poor. It was important to
detect any positive feature of a model, since
there were more negatives about a model than
positives. Today, many models reflect
structures well. When the positives start to
outweigh the negatives, it becomes important to
pay attention to the negatives. Thus we
introduced a "repulsion" component into the
GDTTS score. When a residue is close to its
"correct" residue, GDTTS rewards it, and if a
residue is too close to "incorrect" residues
(other than the residue that is modeled), we
subtract a penalty from the GDTTS score.
TR-score i.e. The Repulsion'. TR score, in
addition to rewarding for close superposition of
corresponding model and target residues,
penalizes for close placement of other residues.
34TR score is calculated as follows 1.
Superimpose model with target using LGA in the
sequence-dependent mode, maximizing the number of
aligned residue pairs within distance cutoff4Ã….
2. For each aligned residue pair, calculate a
GDTTS - like score S0(R1, R2) 1/4
N(1)N(2)N(4)N(8), where N(r) is the number
of superimposed residue pairs with the CaCa
distance residues in both structures. For each residue R,
choose residues in the other structure that are
spatially close to R, excluding the residue
aligned with R and its immediate neighbors in the
chain. Count numbers of such residues with Ca-Ca
distance to R within cutoffs of 1, 2, and 4Ã…. (As
opposed to GDTTS, we do not use the cutoff of 8Ã…
as too inclusive). 4. The average of these
counts defines the penalty assigned to a given
residue R P(R) 1/3 N(1) N(2) N(4).
5. For each aligned residue pair (R1, R2), the
average of penalties for each residue P(R1, R2)
1/2 (P(R1) P(R2)) is weighted and subtracted
from the GDTTS score for this pair. The final
score is prohibited from being negative S(R1,
R2) Max S0(R1, R2)-wP(R1, R2), 0 . Among
tested values of weight w, we found that w1.0
produced the scores that were most consistent
with the evaluation of model abnormalities by
human experts.
35Segments of superimposed structure (black) and
model (red) With 1A distance cutoff. Superpositio
n does not look very good, but assume that only
segments of larger structures are shown, and the
rest of the structures looks better
Scale 0.6A
36GDT-TS calculation for 1A find the number of
corresponding atoms within 1A. Total GDT-TS
contribution 0101
1.34A
1.34A
0.6A
Scale 0.6A
37Penalty calculation for 1A For those residue
pairs that contribute to GDT-TS find incorrect
atoms within 1A Residue pairs (1,1) and (3,3) do
not contribute to penalty, as they do not
contribute to GDT-TS.
1.34A
1.34A
Scale 0.6A
38Penalty calculation for 1A For those residue
pairs that contribute to GDT-TS find incorrect
atoms within 1A Residue pair (2,2) may
contribute to penalty.
0.6A
Scale 0.6A
39Penalty calculation for 1A Step 1 from the
structure (black) Which incorrect residues in
the model are within 1A from residue 2 in the
structure?
Scale 0.6A
40Penalty calculation for 1A Step 1 from the
structure (black) Which incorrect residues in
the model are within 1A from residue 2 in the
structure? It is residue 1 (0.84A), as residue 2
is correct, and residue 3 is 1.2A away.
1.2A
0.6A
0.84A
Scale 0.6A
41Penalty calculation for 1A Step 1 from the
structure (black) In the structure, only residue
1 of the model contributes count 1 to the
penalty
1.2A
0.6A
0.84A
Scale 0.6A
42Penalty calculation for 1A Step 2 from the
model (red) Which incorrect residues in the
model are within 1A from residue 2 in the model?
Scale 0.6A
43Penalty calculation for 1A Step 2 from the
model (red) Which incorrect residues in the
model are within 1A from residue 2 in the
model? It is residue 1 (0.84A) and residue 3
(0.84A) , as residue 2 is correct.
0.84A
0.84A
0.6A
Scale 0.6A
44Penalty calculation for 1A Step 2 from the
model (red) In the model, residues 1 and 2
contribute total count 2 to the penalty
0.84A
0.84A
0.6A
Scale 0.6A
45Penalty calculation for 1A Step 3 averaging
penalty contributions from the structure and the
model Total penalty (penalty from the structure
penalty from the model)/2 (12)/21.5 The
penalty for 1A is 1.5
46TR contribution for 1A Compute it as TR GDT
weight penalty Check if TR0. Weight TR
0.25 1 - 0.251.50.6250
0.5 1 - 0.51.50.250
1 1 - 11.5-0.5compute these for 2A, 4A and 8A, average, and
divide by the structure length
47R0.991
Correlation between TR score
(vertical axis) and GDT-TS (horizontal axis)
Scores for top 10 first server models were
averaged for each domain shown by its number
positioned at a point with the coordinates equal
to these averaged scores. Domain numbers are
colored according to the difficulty category
suggested by our analysis black - FM (free
modeling) red - FR (fold recognition) green -
CM_H (comparative modeling hard) cyan - CM_M
(comparative modeling medium) blue - CM_E
(comparative modeling easy).
48Comparison of remote homologs compressing one
homolog can increase GDT_TS
Sample of N2050 of pairs of SCOP domains
sharing superfamily
Lower third by DALI Z (2.02.0N680
N540
49In 40 pairs of remote homologs GDT_TS increases
with compression
Domain pairs where compression causes GDT_TS
growth (N239 of 540)
50Compression of FR models can cause GDT_TS growth
SAM-T08 server
All 108 models of FR targets
Models of FR targets where compression causes
GDT_TS growth (N43)
51Contact score CS
Scores comparing intramolecular distances between
a model and a structure (contact scores) have
different properties than intermolecular distance
scores based on optimal superposition. One
advantage of such scores is that superpositions,
and thus arguments about their optimality, are
not involved. The problems with developing a
good a contact score are 1) contact definition
2) mathematical expressions converting distance
differences to scores. CS score is calculated
as follows 1. contact between residues is
defined by a distance 8.4Ã… between their Ca
atoms. 2. The difference between such distances
in a model and a structure is computed and used
as a fraction of the distance in the structure.
3. Fractional distances above 1 (distance
difference above the distance itself) are
discarded and exponential is used to convert
distances to scores (0?1). The factor in the
exponent is chosen to maximize the correlation
between contact scores and GDTTS scores. 4.
These residue pair scores are averaged over all
pairs of contacting residues. We call this score
CS, i.e. 'contact score', for short.
52R0.962
Correlation between Contact score CS
(vertical axis) and GDT-TS (horizontal axis)
Scores for top 10 first server models were
averaged for each domain shown by its number
positioned at a point with the coordinates equal
to these averaged scores. Domain numbers are
colored according to the difficulty category
suggested by our analysis black - FM (free
modeling) red - FR (fold recognition) green -
CM_H (comparative modeling hard) cyan - CM_M
(comparative modeling medium) blue - CM_E
(comparative modeling easy).
53 Server rankings on all targets in domains for
three scores On 143 domains, ranking does not
change much with score, illustrating that 1)
scores correlate with each other and 2) the
ranking is robust.
54 Server rankings on FR domains for three
Z-scores On 28 FR domains, ranking shows small
variations illustrating the differences between
individual scores and between servers. Â
55Summary 1. Single score is not enough for
model evaluation. 2. Do not train your method on
a single score. 3. Introduction of repulsion
terms in the score is useful, as it penalizes
compression and may help improving
alignments. 4. Superposition-independent contact
scores are fast and easy to compute, accurate and
correlate well with superposition-based scores.
561. Domains or not domains 2. Target
categories 3. Scores and evaluation 4.
Results
57ALL 67 human/server targets in domains for first
models, LGA GDT-TS
Sum of GDT-TS scores
Sum of Z-scores
58ALL 67 human/server targets in domains for first
models, GDT-TS and TR scores
Sum of GDT-TS scores
Sum of TR-scores
59ALL 67 human/server targets in domains for first
models, LGA GDT-TS
Sum of TR scores
Sum of Z-scores from TR
60Bootstrap for statistics
61Acknowledgement
Our group
Collaborators
Shuoyong Shi Jing Tong Ruslan Sadreyev
Lisa Kinch Jimin Pei Ming Tang Sasha
Safronova Yuan Qi Hua Cheng
Jamie Wrabl Indraneel Majumdar Erik
Nelson Yong Wang S. Sri
Krishna Bong-Hyun Kim Dorothee Staber
David Baker U. Washington Kimmen
Sjölander UC Berkeley William Noble
U. Washington
HHMI, NIH, UTSW, The Welch Foundation