Machine learning and protein structure - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Machine learning and protein structure

Description:

Rules were learnt for 45 of the most popular folds in the SCOP database. ... SCOP contains parallel beta-sheet barrel, closed; ... – PowerPoint PPT presentation

Number of Views:499

Avg rating:3.0/5.0

Slides: 49

Provided by: SBG6

Category:

more less

Transcript and Presenter's Notes

Title: Machine learning and protein structure

1
Machine learning and protein structure

Describing structural principles underlying fold
space
Core element prediction
Folding pathways

2
Describing structural principles

Several hundred protein folds already known
Proteins distributed unevenly amongst different
folds
Many proteins adopt one of a limited number of
fold types (superfolds)
Most folds are adopted by only a small number of
proteins
Understanding this requires knowledge of protein
structure principles in the wider context of
folding, function and evolution
Number of protein structures expected to increase
with structural genomics projects
The total number of folds in biota is predicted
to be between 1000 and 10000.
Automated methods of analysis needed
Can machine learning strategies be used to learn
and describe the structural principles
underpinning fold space?

3
Protein structure classification

Several classification schemes currently
SCOP (manual)
CATH (semi-automatic)
FSSP (automatic)
Largely similar in assignment but significantly
different
Classification alone does not explain why some
folds are more prevalent than others
Need to understand how folds differ in terms of
fundamental structural properties
Folds are usually described in terms of spatial
and topological arrangements of secondary
structure elements
SCOP includes detailed (manual) descriptions of
folds

4
SCOP

All-b class
Immunoglobulin-like beta-sandwich (13)
sandwich 7 strands in 2 sheets greek-key some
members of the fold have additional strands
Common fold of diphtheria toxin/transcription
factors/cytochrome f (7) sandwich 9 strands
in 2 sheet greek-key subclass of
immunoglobin-like fold
Prealbumin-like (4) sandwich 7 strands in 2
sheets, greek-key variations some members have
additional 1-2 strands to common fold
alpha-Amylase inhibitor tendamistat (1)
sandwich 6 strands in 2 sheets
Cupredoxins (1) sandwich 7 strands in 2
sheets, greek-key variations some members have
additional 1-2 strands
C2 domain-like (4) sandwich 8 strands in 2
sheets greek-key
TRAF domain (1) sandwich 8 strands in 2
sheets greek-key

5
SCOP

ab class (preliminary)
Sulfite oxidase, middle catalytic domain (1)
unusual fold contains 3 layers of beta-sheet
structure and a beta-grasp like motif
Fumarylacetoacetate hydrolase, FAH (1)
unusual fold contains 3 layers of beta-sheet
structure and a SH3-like like motif
Aromatic aminoacid monoxygenases, catalytic and
oligomerization domains (1) unusual fold
Substrate-binding domain of HMG-CoA reductase (1)
unusual fold
Conserved core of transcriptional regulatory
protein vp16 (1) unusual fold
Insert subdomain of RNA polymerase alpha subunit
N-terminal domain (1) unusual fold contains
a left-handed beta-alpha-beta unit
Baseplate structural protein gp11 (1) unusual
fold trimer
Non-globular alpha_beta subunits of globular
proteins (1)

6
Wish list

We would like to generate expert-like rules
Automatically
Objectively
That are easily interpreted
That discriminate between folds

7
Learning Rules with Inductive Logic Programming
(ILP)

Rules were learnt using the Progol-4.4 ILP
system.
Progol learns rules automatically from examples
and background knowledge.
Progol learns rules in such a way as to maximise
the coverage of positive examples and minimise
the coverage of negative examples.
Weighted to favour shorter rules given similar
coverage
Examples, background knowledge and the final
rules are represented as logic programs
Can incorporate relational data
Rules can be interpreted easily

8
Global v local features

Previous ILP study could only learn local fold
features
Eg. Rules for the Rossmann fold
Previous ILP rule (local) The 1st strand is
followed by a helix, the two elements are
separated by a coil of about one residue.The 6th
strand is followed by a helix.
SCOP (global) core 3 layers, a/b/a parallel
beta-sheet of 6 strands, order 321456
Insertions and deletions inhibit learning of
global features difficult to identify
structurally equivalent (core) secondary
structure elements
We avoid this problem by using multiple structure
alignments

9

10
Multiple alignment ? Core elements
11
Background knowledge

Background knowledge is defined in terms of the
properties of, and relationships between, core
elements

number_helices(Lo lt D lt Hi)
sheet(D, A, Stype)
helix(D, B, Htype, Core)
strand_position(A, B, N)
adjacent(B, C)
coil(B, C, N)
contact(B, C)
antiparallel(B, C)
parallel(B, C)

end_strand_distance(A, B, C, Dist)
pair(B, C, Bloc, Cloc)
helix_angle(B, C, Angle)
has_n_strands(A, N)
barrel(A)
bifurcated(A)
sheet_top_X(A, N1, N2,., NX)
contains(B, AA, Loc)
contains(B, AA)

12
ILP run

Rules are constructed from an examples background
knowledge
Progol builds steadily more specific rules,
optimising compression f
f p n c where p positive examples
covered
n negative examples covered
c length of rule
fold(A,'Rossmann-fold').
fold(A,'Rossmann-fold') - sheet(A,B,para).
fold(A,'Rossmann-fold') - sheet(A,B,para),
has_n_strands(B,6).
.
.
fold(A,'Rossmann-fold') - sheet(A,B,para),
helix(A,C,h,g),
helix(A,D,h,i), helix_angle(C,D,par
a), sheet_top_6(B,3,2,1,4,5,6).

13
Fold Rules

Rules were learnt for 45 of the most popular
folds in the SCOP database.
Rules learnt were compared to descriptions given
by SCOP.
Cross-validation testing.

14
Rossmann fold
Has between 3 and 4 helices Has a-helix B at
core position "b" B contains a glycine in both
its middle and n-terminal regions. OR Has a
parallel sheet B of six strands with topology
321456 Has a-helices C and D at core positions
"g" and "i" respectively C and D are in contact
and parallel.
15
SCOP
ILP (old)
ILP (new)

Global property (sheet topology) learnt
Part of a conserved G-X-G-X-X-G sequence motif
involved in nucleotide binding

16
Immunoglobulin fold
Has antiparallel sheets B and C B has 3
strands, topology 123 C has 4 strands, topology
2134.
17
sandwich 7 strands in 2 sheets greek-key some
members of the fold have additional strands Has
antiparallel sheets B and C B has 3 strands,
topology 123 C has 4 strands, topology
2134. sandwich 7 strands in 2 sheets,
greek-key variations some members have
additional 1-2 strands to common fold Has a mixed
sheet B. B has 3 strands with topology 213.
SCOP
Immunoglobulin
ILP
SCOP
Prealbumin
ILP

Can learn rules to discriminate between folds
with similar SCOP descriptions

18
TIM barrel
Has between 5 and 9 helices Has a parallel sheet
of 8 strands.
19

SCOP contains parallel beta-sheet barrel, closed
n8, S8 strand order 12345678 the first
six superfamilies have similar phosphate-binding
sites
ILP Has between 5 and 9 helices Has a parallel
sheet of 8 strands.
ILP does not give as many details as SCOP
Fails to find topology or barrel structure
Many TIM barrels are not closed
Not necessary to include many details as few
alternative folds have 8 core strands in a
parallel sheet

20
Cross-validation

Each of the 45 folds was subject to a 5-fold
cross-validation procedure.
The overall accuracy was quite high (97)
although this is, in part, due to the large
number of negative examples used. A large number
of negative examples were included in the
learning procedure to limit the learning of
spurious rules. A largest class prediction (that
is, predicting that every example was NOT an
example of the fold of interest) would give an
accuracy of 95.
The overall accuracy was found to be
statistically significant (Pearsons ?2 58.5,
p ltlt 0.01) where the reference state was based on
such a largest class prediction.
The overall precision and recall were 77 and 55
respectively. The recall was relatively low for
those folds with few examples, largely due to the
difficulties in producing stable multiple
structure alignments.
For the 10 best represented folds examined here,
the precision and recall were 83 and 69
respectively.

21
Conclusions Fold rules

Protein structure principles underpinning much of
fold space can be automatically described using
ILP.
The rules are objective and discriminate between
different fold classes.
The rules learnt are readily interpretable to
human protein structure experts.
The rules learnt here describe global fold
properties as well as local features.
The rules learnt by ILP are, in many cases,
comparable to expert principles given in the SCOP
database.
The procedure applied here to the manual SCOP
fold classification could also be applied to any
other protein structure classification scheme.
Used in conjunction with a fully automatic
structure classification (such as DALI), ILP
could be used to derive the principles underlying
fold space from coordinates in an automatic and
objective fashion.
Given the increasing emphasis on high-throughput
experimental projects in biology, automatic
methods of analysis such as ILP are going to
become increasingly important in deriving
principles from data.

22
Problems

Difficult to learn rules for sparsely populated
folds
Fewer examples to learn from
Difficult to validate rules
Multiple structure alignments less reliable
Can we predict core secondary structure elements
without multiple structure alignments?

23
Core element prediction

Can we predict structurally variable secondary
structure elements in the absence of a reliable
multiple structure alignment? Eg. singleton
Machine learning can be used to learn rules for
core and non-core elements from previously
generated multiple structure alignments.
Comparison of machine learning techniques.

24
Well populated folds
Machine learning
Rules
Sparsely populated folds
Predict core elements
25
Machine learning techniques

C5.0 (decision-tree)
SVM (support vector machine)
ILP
ILP (with additional relational information)
Representation includes attributes such as
Relative distance from centre of domain
Number of contacts with other elements
Hydrophobicity
Length of element..
ILP (relational) includes extra information such
as sequential or spatial adjacency of secondary
structure elements

26
Core prediction summary

Class C5.0 SVM ILP ILP (relational)
--------------------------------------------------
----------------
All-a 79.20 74.80 68.40 66.80
All-b 80.50 82.50 79.00 81.00
a/b 80.40 74.00 65.60 58.80
ab 80.80 85.60 79.20 78.80
--------------------------------------------------
----------------
All 80.26 79.65 73.83 72.61
Number elements tested 1150
Default accuracy (largest class prediction) 60

27
Conclusions - Core prediction

Can predict core secondary structure elements in
the absence of a reliable multiple structure
alignment
ILP worse than SVM and C5.0 for core prediction
C5.0 cross-validated accuracy comparable to SVM
but more consistent over main fold classes
Does core prediction help ILP learn fold rules
for sparsely populated folds?
No.
When can it help?

28
Folding pathways

Recent interest in experimentally probing early
folding events in proteins
Phi values, calculated from mutation data, can
indicate which parts of a (two-state) protein
form the transition state when folding.
One or more folding nuclei?
What role do kinetics play in evolution?
Are early folding sections of a protein conserved
in sequence?
Two schools of thought
Yes
No
Do proteins with similar structures fold in a
similar way?
Are structurally invariant parts of a protein
fold related to early folding?
Correlation between folding in distantly related
proteins has been seen
Two globins with substantially different
sequences shown to fold differently
Applied core prediction to see if there is any
correlation between core elements and early
folding

29
Phi-values
T
??GT-U
WT
U
F
Mutant
??GF-U

? 0 Residue has disordered structure
in the transition state
? 1 Residue has native-like structure
in the transition state
0 lt ? lt 1 Indeterminate mixture of
native/disordered or partially unfolded?

30
Comparing phi values with core elements- Why use
core prediction?

Ideally, to compare structural invariance across
a fold class to folding parameters one needs to
generate a multiple structure alignment of
unrelated proteins within that class
Of those few proteins for which experimental
phi-values are available, not all have more than
1-2 protein families in the same fold class
Folding parameters could be compared to predicted
core-ness of secondary structure elements for
all proteins
Predicted core-ness of elements compared to
average phi-values for that element

31
Phi-value test set

Protein Elems Phis Res Coverage
--------------------------------------------------
------------
AcP 7 17 68 25
ACBP 4 19 57 33
CheY 10 25 90 28
CI2 5 23 32 72
FKBP12 6 19 41 46
FNFn10 7 15 45 33
SH3 5 19 23 83
villin 5 16 51 31
WW 2 8 10 80
--------------------------------------------------
-------------
Total 51 161 417 39

32
(No Transcript)
33
(No Transcript)
34
Wishful thinking?

Can we reliably predict relatively early/late
folding elements with confident/less-confident
predictions of core-ness?
We can test binary classification via
leave-one-out cross validation.

35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
Cross-validation summary

protein elements acc expect recall prec phi_cu
t c50_cut
--------------------------------------------------
--------------------------------------------------
---
AcP 7 71 57 67 67 0.34 88
ACBP 4 25 50 0 0 0.34 89
CheY 10 50 70 100 38 0.34 89
CI2 5 100 100 0 0 0.34 88
FKBP12 6 83 67 75 100 0.34 89
FNFn10 7 71 57 50 100 0.34 90
SH3 5 80 60 100 75 0.34 89
villin 5 60 60 67 67 0.35 89
WW 2 100 100 100 100 0.34 88
--------------------------------------------------
--------------------------------------------------
---
Overall 51 69 53 71 65
Statistical significance ?2 5.04 (p 0.02)

39
AcP
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
40
ACBP
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
41
CheY
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
42
CI2
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
43
FKBP12
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
44
FNFn10
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
45
SH3
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
46
villin
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
47
WW
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
48
Fin