Title: Machine learning and protein structure
1Machine learning and protein structure
- Describing structural principles underlying fold
space - Core element prediction
- Folding pathways
2Describing structural principles
- Several hundred protein folds already known
- Proteins distributed unevenly amongst different
folds - Many proteins adopt one of a limited number of
fold types (superfolds) - Most folds are adopted by only a small number of
proteins - Understanding this requires knowledge of protein
structure principles in the wider context of
folding, function and evolution - Number of protein structures expected to increase
with structural genomics projects - The total number of folds in biota is predicted
to be between 1000 and 10000. - Automated methods of analysis needed
- Can machine learning strategies be used to learn
and describe the structural principles
underpinning fold space?
3Protein structure classification
- Several classification schemes currently
- SCOP (manual)
- CATH (semi-automatic)
- FSSP (automatic)
- Largely similar in assignment but significantly
different - Classification alone does not explain why some
folds are more prevalent than others - Need to understand how folds differ in terms of
fundamental structural properties - Folds are usually described in terms of spatial
and topological arrangements of secondary
structure elements - SCOP includes detailed (manual) descriptions of
folds
4SCOP
- All-b class
- Immunoglobulin-like beta-sandwich (13)
sandwich 7 strands in 2 sheets greek-key some
members of the fold have additional strands - Common fold of diphtheria toxin/transcription
factors/cytochrome f (7) sandwich 9 strands
in 2 sheet greek-key subclass of
immunoglobin-like fold - Prealbumin-like (4) sandwich 7 strands in 2
sheets, greek-key variations some members have
additional 1-2 strands to common fold - alpha-Amylase inhibitor tendamistat (1)
sandwich 6 strands in 2 sheets - Cupredoxins (1) sandwich 7 strands in 2
sheets, greek-key variations some members have
additional 1-2 strands - C2 domain-like (4) sandwich 8 strands in 2
sheets greek-key - TRAF domain (1) sandwich 8 strands in 2
sheets greek-key
5SCOP
- ab class (preliminary)
- Sulfite oxidase, middle catalytic domain (1)
unusual fold contains 3 layers of beta-sheet
structure and a beta-grasp like motif - Fumarylacetoacetate hydrolase, FAH (1)
unusual fold contains 3 layers of beta-sheet
structure and a SH3-like like motif - Aromatic aminoacid monoxygenases, catalytic and
oligomerization domains (1) unusual fold - Substrate-binding domain of HMG-CoA reductase (1)
unusual fold - Conserved core of transcriptional regulatory
protein vp16 (1) unusual fold - Insert subdomain of RNA polymerase alpha subunit
N-terminal domain (1) unusual fold contains
a left-handed beta-alpha-beta unit - Baseplate structural protein gp11 (1) unusual
fold trimer - Non-globular alpha_beta subunits of globular
proteins (1)
6Wish list
- We would like to generate expert-like rules
- Automatically
- Objectively
- That are easily interpreted
- That discriminate between folds
7Learning Rules with Inductive Logic Programming
(ILP)
- Rules were learnt using the Progol-4.4 ILP
system. - Progol learns rules automatically from examples
and background knowledge. - Progol learns rules in such a way as to maximise
the coverage of positive examples and minimise
the coverage of negative examples. - Weighted to favour shorter rules given similar
coverage - Examples, background knowledge and the final
rules are represented as logic programs - Can incorporate relational data
- Rules can be interpreted easily
8Global v local features
- Previous ILP study could only learn local fold
features - Eg. Rules for the Rossmann fold
- Previous ILP rule (local) The 1st strand is
followed by a helix, the two elements are
separated by a coil of about one residue.The 6th
strand is followed by a helix. - SCOP (global) core 3 layers, a/b/a parallel
beta-sheet of 6 strands, order 321456 - Insertions and deletions inhibit learning of
global features difficult to identify
structurally equivalent (core) secondary
structure elements - We avoid this problem by using multiple structure
alignments
9 10Multiple alignment ? Core elements
11Background knowledge
- Background knowledge is defined in terms of the
properties of, and relationships between, core
elements
- number_helices(Lo lt D lt Hi)
- sheet(D, A, Stype)
- helix(D, B, Htype, Core)
- strand_position(A, B, N)
- adjacent(B, C)
- coil(B, C, N)
- contact(B, C)
- antiparallel(B, C)
- parallel(B, C)
- end_strand_distance(A, B, C, Dist)
- pair(B, C, Bloc, Cloc)
- helix_angle(B, C, Angle)
- has_n_strands(A, N)
- barrel(A)
- bifurcated(A)
- sheet_top_X(A, N1, N2,., NX)
- contains(B, AA, Loc)
- contains(B, AA)
12ILP run
- Rules are constructed from an examples background
knowledge - Progol builds steadily more specific rules,
optimising compression f - f p n c where p positive examples
covered - n negative examples covered
- c length of rule
- fold(A,'Rossmann-fold').
- fold(A,'Rossmann-fold') - sheet(A,B,para).
- fold(A,'Rossmann-fold') - sheet(A,B,para),
has_n_strands(B,6). - .
- .
- fold(A,'Rossmann-fold') - sheet(A,B,para),
helix(A,C,h,g), -
helix(A,D,h,i), helix_angle(C,D,par
a), sheet_top_6(B,3,2,1,4,5,6).
13Fold Rules
- Rules were learnt for 45 of the most popular
folds in the SCOP database. - Rules learnt were compared to descriptions given
by SCOP. - Cross-validation testing.
14Rossmann fold
Has between 3 and 4 helices Has a-helix B at
core position "b" B contains a glycine in both
its middle and n-terminal regions. OR Has a
parallel sheet B of six strands with topology
321456 Has a-helices C and D at core positions
"g" and "i" respectively C and D are in contact
and parallel.
15SCOP
ILP (old)
ILP (new)
- Global property (sheet topology) learnt
- Part of a conserved G-X-G-X-X-G sequence motif
involved in nucleotide binding
16Immunoglobulin fold
Has antiparallel sheets B and C B has 3
strands, topology 123 C has 4 strands, topology
2134.
17sandwich 7 strands in 2 sheets greek-key some
members of the fold have additional strands Has
antiparallel sheets B and C B has 3 strands,
topology 123 C has 4 strands, topology
2134. sandwich 7 strands in 2 sheets,
greek-key variations some members have
additional 1-2 strands to common fold Has a mixed
sheet B. B has 3 strands with topology 213.
SCOP
Immunoglobulin
ILP
SCOP
Prealbumin
ILP
- Can learn rules to discriminate between folds
with similar SCOP descriptions
18TIM barrel
Has between 5 and 9 helices Has a parallel sheet
of 8 strands.
19- SCOP contains parallel beta-sheet barrel, closed
- n8, S8 strand order 12345678 the first
six superfamilies have similar phosphate-binding
sites - ILP Has between 5 and 9 helices Has a parallel
sheet of 8 strands. - ILP does not give as many details as SCOP
- Fails to find topology or barrel structure
- Many TIM barrels are not closed
- Not necessary to include many details as few
alternative folds have 8 core strands in a
parallel sheet
20Cross-validation
- Each of the 45 folds was subject to a 5-fold
cross-validation procedure. - The overall accuracy was quite high (97)
although this is, in part, due to the large
number of negative examples used. A large number
of negative examples were included in the
learning procedure to limit the learning of
spurious rules. A largest class prediction (that
is, predicting that every example was NOT an
example of the fold of interest) would give an
accuracy of 95. - The overall accuracy was found to be
statistically significant (Pearsons ?2  58.5,
p ltlt 0.01) where the reference state was based on
such a largest class prediction. - The overall precision and recall were 77 and 55
respectively. The recall was relatively low for
those folds with few examples, largely due to the
difficulties in producing stable multiple
structure alignments. - For the 10 best represented folds examined here,
the precision and recall were 83 and 69
respectively.
21Conclusions Fold rules
- Protein structure principles underpinning much of
fold space can be automatically described using
ILP. - The rules are objective and discriminate between
different fold classes. - The rules learnt are readily interpretable to
human protein structure experts. - The rules learnt here describe global fold
properties as well as local features. - The rules learnt by ILP are, in many cases,
comparable to expert principles given in the SCOP
database. - The procedure applied here to the manual SCOP
fold classification could also be applied to any
other protein structure classification scheme.
Used in conjunction with a fully automatic
structure classification (such as DALI), ILP
could be used to derive the principles underlying
fold space from coordinates in an automatic and
objective fashion. - Given the increasing emphasis on high-throughput
experimental projects in biology, automatic
methods of analysis such as ILP are going to
become increasingly important in deriving
principles from data.
22Problems
- Difficult to learn rules for sparsely populated
folds - Fewer examples to learn from
- Difficult to validate rules
- Multiple structure alignments less reliable
- Can we predict core secondary structure elements
without multiple structure alignments?
23Core element prediction
- Can we predict structurally variable secondary
structure elements in the absence of a reliable
multiple structure alignment? Eg. singleton - Machine learning can be used to learn rules for
core and non-core elements from previously
generated multiple structure alignments. - Comparison of machine learning techniques.
24Well populated folds
Machine learning
Rules
Sparsely populated folds
Predict core elements
25Machine learning techniques
- C5.0 (decision-tree)
- SVM (support vector machine)
- ILP
- ILP (with additional relational information)
- Representation includes attributes such as
- Relative distance from centre of domain
- Number of contacts with other elements
- Hydrophobicity
- Length of element..
- ILP (relational) includes extra information such
as sequential or spatial adjacency of secondary
structure elements
26Core prediction summary
- Class C5.0 SVM ILP ILP (relational)
- --------------------------------------------------
---------------- - All-a 79.20 74.80 68.40 66.80
- All-b 80.50 82.50 79.00 81.00
- a/b 80.40 74.00 65.60 58.80
- ab 80.80 85.60 79.20 78.80
- --------------------------------------------------
---------------- - All 80.26 79.65 73.83 72.61
- Number elements tested 1150
- Default accuracy (largest class prediction) 60
27Conclusions - Core prediction
- Can predict core secondary structure elements in
the absence of a reliable multiple structure
alignment - ILP worse than SVM and C5.0 for core prediction
- C5.0 cross-validated accuracy comparable to SVM
but more consistent over main fold classes - Does core prediction help ILP learn fold rules
for sparsely populated folds? - No.
- When can it help?
28Folding pathways
- Recent interest in experimentally probing early
folding events in proteins - Phi values, calculated from mutation data, can
indicate which parts of a (two-state) protein
form the transition state when folding. - One or more folding nuclei?
- What role do kinetics play in evolution?
- Are early folding sections of a protein conserved
in sequence? - Two schools of thought
- Yes
- No
- Do proteins with similar structures fold in a
similar way? - Are structurally invariant parts of a protein
fold related to early folding? - Correlation between folding in distantly related
proteins has been seen - Two globins with substantially different
sequences shown to fold differently - Applied core prediction to see if there is any
correlation between core elements and early
folding
29Phi-values
T
??GT-U
WT
U
F
Mutant
??GF-U
- ? 0 Residue has disordered structure
in the transition state - ? 1 Residue has native-like structure
in the transition state - 0 lt ? lt 1 Indeterminate mixture of
native/disordered or partially unfolded?
30Comparing phi values with core elements- Why use
core prediction?
- Ideally, to compare structural invariance across
a fold class to folding parameters one needs to
generate a multiple structure alignment of
unrelated proteins within that class - Of those few proteins for which experimental
phi-values are available, not all have more than
1-2 protein families in the same fold class - Folding parameters could be compared to predicted
core-ness of secondary structure elements for
all proteins - Predicted core-ness of elements compared to
average phi-values for that element
31Phi-value test set
- Protein Elems Phis Res Coverage
- --------------------------------------------------
------------ - AcP 7 17 68 25
- ACBP 4 19 57 33
- CheY 10 25 90 28
- CI2 5 23 32 72
- FKBP12 6 19 41 46
- FNFn10 7 15 45 33
- SH3 5 19 23 83
- villin 5 16 51 31
- WW 2 8 10 80
- --------------------------------------------------
------------- - Total 51 161 417 39
32(No Transcript)
33(No Transcript)
34Wishful thinking?
- Can we reliably predict relatively early/late
folding elements with confident/less-confident
predictions of core-ness? - We can test binary classification via
leave-one-out cross validation.
35(No Transcript)
36(No Transcript)
37(No Transcript)
38Cross-validation summary
- protein elements acc expect recall prec phi_cu
t c50_cut - --------------------------------------------------
--------------------------------------------------
--- - AcP 7 71 57 67 67 0.34 88
- ACBP 4 25 50 0 0 0.34 89
- CheY 10 50 70 100 38 0.34 89
- CI2 5 100 100 0 0 0.34 88
- FKBP12 6 83 67 75 100 0.34 89
- FNFn10 7 71 57 50 100 0.34 90
- SH3 5 80 60 100 75 0.34 89
- villin 5 60 60 67 67 0.35 89
- WW 2 100 100 100 100 0.34 88
- --------------------------------------------------
--------------------------------------------------
--- - Overall 51 69 53 71 65
- Statistical significance ?2 5.04 (p 0.02)
39AcP
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
40ACBP
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
41CheY
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
42CI2
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
43FKBP12
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
44FNFn10
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
45SH3
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
46villin
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
47WW
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
48Fin
- People
- Mike Sternberg
- Stephen Muggleton
- Funding
- BBSRC