Machine learning and protein structure - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Machine learning and protein structure

Description:

Rules were learnt for 45 of the most popular folds in the SCOP database. ... SCOP contains parallel beta-sheet barrel, closed; ... – PowerPoint PPT presentation

Number of Views:499
Avg rating:3.0/5.0
Slides: 49
Provided by: SBG6
Category:

less

Transcript and Presenter's Notes

Title: Machine learning and protein structure


1
Machine learning and protein structure
  • Describing structural principles underlying fold
    space
  • Core element prediction
  • Folding pathways

2
Describing structural principles
  • Several hundred protein folds already known
  • Proteins distributed unevenly amongst different
    folds
  • Many proteins adopt one of a limited number of
    fold types (superfolds)
  • Most folds are adopted by only a small number of
    proteins
  • Understanding this requires knowledge of protein
    structure principles in the wider context of
    folding, function and evolution
  • Number of protein structures expected to increase
    with structural genomics projects
  • The total number of folds in biota is predicted
    to be between 1000 and 10000.
  • Automated methods of analysis needed
  • Can machine learning strategies be used to learn
    and describe the structural principles
    underpinning fold space?

3
Protein structure classification
  • Several classification schemes currently
  • SCOP (manual)
  • CATH (semi-automatic)
  • FSSP (automatic)
  • Largely similar in assignment but significantly
    different
  • Classification alone does not explain why some
    folds are more prevalent than others
  • Need to understand how folds differ in terms of
    fundamental structural properties
  • Folds are usually described in terms of spatial
    and topological arrangements of secondary
    structure elements
  • SCOP includes detailed (manual) descriptions of
    folds

4
SCOP
  • All-b class
  • Immunoglobulin-like beta-sandwich (13)
    sandwich 7 strands in 2 sheets greek-key some
    members of the fold have additional strands
  • Common fold of diphtheria toxin/transcription
    factors/cytochrome f (7) sandwich 9 strands
    in 2 sheet greek-key subclass of
    immunoglobin-like fold
  • Prealbumin-like (4) sandwich 7 strands in 2
    sheets, greek-key variations some members have
    additional 1-2 strands to common fold
  • alpha-Amylase inhibitor tendamistat (1)
    sandwich 6 strands in 2 sheets
  • Cupredoxins (1) sandwich 7 strands in 2
    sheets, greek-key variations some members have
    additional 1-2 strands
  • C2 domain-like (4) sandwich 8 strands in 2
    sheets greek-key
  • TRAF domain (1) sandwich 8 strands in 2
    sheets greek-key

5
SCOP
  • ab class (preliminary)
  • Sulfite oxidase, middle catalytic domain (1)
    unusual fold contains 3 layers of beta-sheet
    structure and a beta-grasp like motif
  • Fumarylacetoacetate hydrolase, FAH (1)
    unusual fold contains 3 layers of beta-sheet
    structure and a SH3-like like motif
  • Aromatic aminoacid monoxygenases, catalytic and
    oligomerization domains (1) unusual fold
  • Substrate-binding domain of HMG-CoA reductase (1)
    unusual fold
  • Conserved core of transcriptional regulatory
    protein vp16 (1) unusual fold
  • Insert subdomain of RNA polymerase alpha subunit
    N-terminal domain (1) unusual fold contains
    a left-handed beta-alpha-beta unit
  • Baseplate structural protein gp11 (1) unusual
    fold trimer
  • Non-globular alpha_beta subunits of globular
    proteins (1)

6
Wish list
  • We would like to generate expert-like rules
  • Automatically
  • Objectively
  • That are easily interpreted
  • That discriminate between folds

7
Learning Rules with Inductive Logic Programming
(ILP)
  • Rules were learnt using the Progol-4.4 ILP
    system.
  • Progol learns rules automatically from examples
    and background knowledge.
  • Progol learns rules in such a way as to maximise
    the coverage of positive examples and minimise
    the coverage of negative examples.
  • Weighted to favour shorter rules given similar
    coverage
  • Examples, background knowledge and the final
    rules are represented as logic programs
  • Can incorporate relational data
  • Rules can be interpreted easily

8
Global v local features
  • Previous ILP study could only learn local fold
    features
  • Eg. Rules for the Rossmann fold
  • Previous ILP rule (local) The 1st strand is
    followed by a helix, the two elements are
    separated by a coil of about one residue.The 6th
    strand is followed by a helix.
  • SCOP (global) core 3 layers, a/b/a parallel
    beta-sheet of 6 strands, order 321456
  • Insertions and deletions inhibit learning of
    global features difficult to identify
    structurally equivalent (core) secondary
    structure elements
  • We avoid this problem by using multiple structure
    alignments

9

10
Multiple alignment ? Core elements
11
Background knowledge
  • Background knowledge is defined in terms of the
    properties of, and relationships between, core
    elements
  • number_helices(Lo lt D lt Hi)
  • sheet(D, A, Stype)
  • helix(D, B, Htype, Core)
  • strand_position(A, B, N)
  • adjacent(B, C)
  • coil(B, C, N)
  • contact(B, C)
  • antiparallel(B, C)
  • parallel(B, C)
  • end_strand_distance(A, B, C, Dist)
  • pair(B, C, Bloc, Cloc)
  • helix_angle(B, C, Angle)
  • has_n_strands(A, N)
  • barrel(A)
  • bifurcated(A)
  • sheet_top_X(A, N1, N2,., NX)
  • contains(B, AA, Loc)
  • contains(B, AA)

12
ILP run
  • Rules are constructed from an examples background
    knowledge
  • Progol builds steadily more specific rules,
    optimising compression f
  • f p n c where p positive examples
    covered
  • n negative examples covered
  • c length of rule
  • fold(A,'Rossmann-fold').
  • fold(A,'Rossmann-fold') - sheet(A,B,para).
  • fold(A,'Rossmann-fold') - sheet(A,B,para),
    has_n_strands(B,6).
  • .
  • .
  • fold(A,'Rossmann-fold') - sheet(A,B,para),
    helix(A,C,h,g),

  • helix(A,D,h,i), helix_angle(C,D,par
    a), sheet_top_6(B,3,2,1,4,5,6).

13
Fold Rules
  • Rules were learnt for 45 of the most popular
    folds in the SCOP database.
  • Rules learnt were compared to descriptions given
    by SCOP.
  • Cross-validation testing.

14
Rossmann fold
Has between 3 and 4 helices Has a-helix B at
core position "b" B contains a glycine in both
its middle and n-terminal regions. OR Has a
parallel sheet B of six strands with topology
321456 Has a-helices C and D at core positions
"g" and "i" respectively C and D are in contact
and parallel.
15
SCOP
ILP (old)
ILP (new)
  • Global property (sheet topology) learnt
  • Part of a conserved G-X-G-X-X-G sequence motif
    involved in nucleotide binding

16
Immunoglobulin fold
Has antiparallel sheets B and C B has 3
strands, topology 123 C has 4 strands, topology
2134.
17
sandwich 7 strands in 2 sheets greek-key some
members of the fold have additional strands Has
antiparallel sheets B and C B has 3 strands,
topology 123 C has 4 strands, topology
2134. sandwich 7 strands in 2 sheets,
greek-key variations some members have
additional 1-2 strands to common fold Has a mixed
sheet B. B has 3 strands with topology 213.
SCOP
Immunoglobulin
ILP
SCOP
Prealbumin
ILP
  • Can learn rules to discriminate between folds
    with similar SCOP descriptions

18
TIM barrel
Has between 5 and 9 helices Has a parallel sheet
of 8 strands.
19
  • SCOP contains parallel beta-sheet barrel, closed
  • n8, S8 strand order 12345678 the first
    six superfamilies have similar phosphate-binding
    sites
  • ILP Has between 5 and 9 helices Has a parallel
    sheet of 8 strands.
  • ILP does not give as many details as SCOP
  • Fails to find topology or barrel structure
  • Many TIM barrels are not closed
  • Not necessary to include many details as few
    alternative folds have 8 core strands in a
    parallel sheet

20
Cross-validation
  • Each of the 45 folds was subject to a 5-fold
    cross-validation procedure.
  • The overall accuracy was quite high (97)
    although this is, in part, due to the large
    number of negative examples used. A large number
    of negative examples were included in the
    learning procedure to limit the learning of
    spurious rules. A largest class prediction (that
    is, predicting that every example was NOT an
    example of the fold of interest) would give an
    accuracy of 95.
  • The overall accuracy was found to be
    statistically significant (Pearsons ?2  58.5,
    p ltlt 0.01) where the reference state was based on
    such a largest class prediction.
  • The overall precision and recall were 77 and 55
    respectively. The recall was relatively low for
    those folds with few examples, largely due to the
    difficulties in producing stable multiple
    structure alignments.
  • For the 10 best represented folds examined here,
    the precision and recall were 83 and 69
    respectively.

21
Conclusions Fold rules
  • Protein structure principles underpinning much of
    fold space can be automatically described using
    ILP.
  • The rules are objective and discriminate between
    different fold classes.
  • The rules learnt are readily interpretable to
    human protein structure experts.
  • The rules learnt here describe global fold
    properties as well as local features.
  • The rules learnt by ILP are, in many cases,
    comparable to expert principles given in the SCOP
    database.
  • The procedure applied here to the manual SCOP
    fold classification could also be applied to any
    other protein structure classification scheme.
    Used in conjunction with a fully automatic
    structure classification (such as DALI), ILP
    could be used to derive the principles underlying
    fold space from coordinates in an automatic and
    objective fashion.
  • Given the increasing emphasis on high-throughput
    experimental projects in biology, automatic
    methods of analysis such as ILP are going to
    become increasingly important in deriving
    principles from data.

22
Problems
  • Difficult to learn rules for sparsely populated
    folds
  • Fewer examples to learn from
  • Difficult to validate rules
  • Multiple structure alignments less reliable
  • Can we predict core secondary structure elements
    without multiple structure alignments?

23
Core element prediction
  • Can we predict structurally variable secondary
    structure elements in the absence of a reliable
    multiple structure alignment? Eg. singleton
  • Machine learning can be used to learn rules for
    core and non-core elements from previously
    generated multiple structure alignments.
  • Comparison of machine learning techniques.

24
Well populated folds
Machine learning
Rules
Sparsely populated folds
Predict core elements
25
Machine learning techniques
  • C5.0 (decision-tree)
  • SVM (support vector machine)
  • ILP
  • ILP (with additional relational information)
  • Representation includes attributes such as
  • Relative distance from centre of domain
  • Number of contacts with other elements
  • Hydrophobicity
  • Length of element..
  • ILP (relational) includes extra information such
    as sequential or spatial adjacency of secondary
    structure elements

26
Core prediction summary
  • Class C5.0 SVM ILP ILP (relational)
  • --------------------------------------------------
    ----------------
  • All-a 79.20 74.80 68.40 66.80
  • All-b 80.50 82.50 79.00 81.00
  • a/b 80.40 74.00 65.60 58.80
  • ab 80.80 85.60 79.20 78.80
  • --------------------------------------------------
    ----------------
  • All 80.26 79.65 73.83 72.61
  • Number elements tested 1150
  • Default accuracy (largest class prediction) 60

27
Conclusions - Core prediction
  • Can predict core secondary structure elements in
    the absence of a reliable multiple structure
    alignment
  • ILP worse than SVM and C5.0 for core prediction
  • C5.0 cross-validated accuracy comparable to SVM
    but more consistent over main fold classes
  • Does core prediction help ILP learn fold rules
    for sparsely populated folds?
  • No.
  • When can it help?

28
Folding pathways
  • Recent interest in experimentally probing early
    folding events in proteins
  • Phi values, calculated from mutation data, can
    indicate which parts of a (two-state) protein
    form the transition state when folding.
  • One or more folding nuclei?
  • What role do kinetics play in evolution?
  • Are early folding sections of a protein conserved
    in sequence?
  • Two schools of thought
  • Yes
  • No
  • Do proteins with similar structures fold in a
    similar way?
  • Are structurally invariant parts of a protein
    fold related to early folding?
  • Correlation between folding in distantly related
    proteins has been seen
  • Two globins with substantially different
    sequences shown to fold differently
  • Applied core prediction to see if there is any
    correlation between core elements and early
    folding

29
Phi-values
T
??GT-U
WT
U
F
Mutant
??GF-U
  • ? 0 Residue has disordered structure
    in the transition state
  • ? 1 Residue has native-like structure
    in the transition state
  • 0 lt ? lt 1 Indeterminate mixture of
    native/disordered or partially unfolded?

30
Comparing phi values with core elements- Why use
core prediction?
  • Ideally, to compare structural invariance across
    a fold class to folding parameters one needs to
    generate a multiple structure alignment of
    unrelated proteins within that class
  • Of those few proteins for which experimental
    phi-values are available, not all have more than
    1-2 protein families in the same fold class
  • Folding parameters could be compared to predicted
    core-ness of secondary structure elements for
    all proteins
  • Predicted core-ness of elements compared to
    average phi-values for that element

31
Phi-value test set
  • Protein Elems Phis Res Coverage
  • --------------------------------------------------
    ------------
  • AcP 7 17 68 25
  • ACBP 4 19 57 33
  • CheY 10 25 90 28
  • CI2 5 23 32 72
  • FKBP12 6 19 41 46
  • FNFn10 7 15 45 33
  • SH3 5 19 23 83
  • villin 5 16 51 31
  • WW 2 8 10 80
  • --------------------------------------------------
    -------------
  • Total 51 161 417 39

32
(No Transcript)
33
(No Transcript)
34
Wishful thinking?
  • Can we reliably predict relatively early/late
    folding elements with confident/less-confident
    predictions of core-ness?
  • We can test binary classification via
    leave-one-out cross validation.

35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
Cross-validation summary
  • protein elements acc expect recall prec phi_cu
    t c50_cut
  • --------------------------------------------------
    --------------------------------------------------
    ---
  • AcP 7 71 57 67 67 0.34 88
  • ACBP 4 25 50 0 0 0.34 89
  • CheY 10 50 70 100 38 0.34 89
  • CI2 5 100 100 0 0 0.34 88
  • FKBP12 6 83 67 75 100 0.34 89
  • FNFn10 7 71 57 50 100 0.34 90
  • SH3 5 80 60 100 75 0.34 89
  • villin 5 60 60 67 67 0.35 89
  • WW 2 100 100 100 100 0.34 88
  • --------------------------------------------------
    --------------------------------------------------
    ---
  • Overall 51 69 53 71 65
  • Statistical significance ?2 5.04 (p 0.02)

39
AcP
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
40
ACBP
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
41
CheY
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
42
CI2
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
43
FKBP12
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
44
FNFn10
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
45
SH3
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
46
villin
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
47
WW
Experiment
Core Prediction
Earlier folding Later folding
More likely core Less likely core
48
Fin
  • People
  • Mike Sternberg
  • Stephen Muggleton
  • Funding
  • BBSRC
Write a Comment
User Comments (0)
About PowerShow.com