Bioinformatics CS 40400 presentation

About This Presentation

Transcript and Presenter's Notes

Title: Bioinformatics CS 40400

1
BioinformaticsCS 40400

Gianluca Pollastri
office CS A1.07
email gianluca.pollastri_at_ucd.ie

2
Lecture notes

http//gruyere.ucd.ie/2007_courses/40400/
confidential..

3
Recommended/useful readings

No book is actually required
Introduction to Computational Molecular Biology
Setubal, Meidanis
Introduction to Bioinformatics
Lesk
Bioinformatics the Machine Learning approach
Baldi, Brunak
Biological sequence analysis (but this is a tough
one)
Eddy, Durbin, Krogh, Mitchison

4
(No Transcript)
5
The course so far..

Introduction proteins, RNA, DNA. GenBank,
SWISS-PROT, PDB, ExPaSy.
Sequence comparison. Needleman-Wunsch and
Smith-Waterman algorithms. Semiglobal comparison.
Variations to basic algorithms. Approximate
algorithms BLAST. Multiple sequence alignments.
Approximate algorithms ClustalW
Molecular phylogenetics. Distance based
algorithms UPGMA, Neighbour Joining. Maximum
parsimony. Maximum Likelihood. Rooting trees.
Estimating times. Bootstrapping.

6
Whats next?

Protein structure prediction Comparative
Modelling Threading De novo. Introduction to
artificial neural networks. Prediction of protein
structural features by machine learning
algorithms secondary structure, solvent
accessibility, contact maps.

7
Protein Structure Prediction and Structural
Genomics
Loosely based on

David Baker and Andrej Sali

MKTLVHVASV EKGRSYEDFQ KVYNAIALKL REDDEYENYI
GYGDDLVRLA
WHISGTWDKH DNTGGSYGGT YRFKKEFNDP SNAGLQNGFK
FLEPIHKEFP
WISSGDLFSL GGVTAVQEMQ GPKIPWRCGR VDTPEDTTPD
NGRLPDADKD
AGYVRTFFQR LNMNDREVVA LMGAHALGKT HLKNSGYEGP
WGAANNVFTN
EFYLNLLNED WKLEKNDANN EQWDSKSGYM MLPTDYSLIQ
DPKYLSIVKE
YANDQDKFFK DFSKAFEKLL ENGITFPKDA PSPFIFKTLE EQGL

9
(No Transcript)
10
Why?

1.7M protein sequences, cheap product of genome
sequencing projects.
29k high resolution protein structures.
Determined by X-ray christallography, NMR
painful, costly and time consuming.
Sequence determines Structure, Structure
determines Function. This is why we want to know
the structure..

11
(No Transcript)
12
ATOM 1 N LEU A 4 12.803 88.583
75.298 1.00 25.27 N ATOM 2 CA
LEU A 4 12.284 89.166 74.064 1.00 25.73
C ATOM 3 C LEU A 4
11.896 88.062 73.094 1.00 24.57 C
ATOM 4 O LEU A 4 12.740 87.441
72.459 1.00 21.83 O ATOM 5 CB
LEU A 4 13.283 90.098 73.387 1.00 25.28
C ATOM 6 CG LEU A 4
12.714 91.348 72.710 1.00 30.06 C
ATOM 7 CD1 LEU A 4 13.446 91.644
71.405 1.00 23.16 C ATOM 8 CD2
LEU A 4 11.221 91.221 72.456 1.00 27.13
C ATOM 9 N VAL A 5
10.588 87.839 72.988 1.00 19.24 N
ATOM 10 CA VAL A 5 10.180 86.742
72.108 1.00 22.98 C ATOM 11 C
VAL A 5 9.286 87.293 71.005 1.00 25.18
C ATOM 12 O VAL A 5
8.388 88.103 71.215 1.00 19.54 O
ATOM 13 CB VAL A 5 9.527 85.607
72.915 1.00 36.54 C ATOM 14 CG1
VAL A 5 8.876 86.145 74.185 1.00 68.19
C ATOM 15 CG2 VAL A 5
8.518 84.844 72.075 1.00 42.84 C
ATOM 16 N HIS A 6 9.594 86.832
69.801 1.00 19.98 N ATOM 17 CA
HIS A 6 8.898 87.164 68.570 1.00 14.76
C ATOM 18 C HIS A 6
8.153 85.933 68.072 1.00 13.19 C
ATOM 19 O HIS A 6 8.794 85.029
67.536 1.00 12.12 O ATOM 20 CB
HIS A 6 9.900 87.636 67.521 1.00 15.61
C ATOM 21 CG HIS A 6
10.488 88.969 67.851 1.00 16.42 C
ATOM 22 ND1 HIS A 6 11.808 89.287
67.631 1.00 17.91 N ATOM 23 CD2
HIS A 6 9.916 90.073 68.382 1.00 12.99
C ATOM 24 CE1 HIS A 6
12.036 90.531 68.009 1.00 10.96 C
ATOM 25 NE2 HIS A 6 10.904 91.032
68.472 1.00 17.40 N ATOM 26 N
VAL A 7 6.839 85.922 68.277 1.00 10.72
N ATOM 27 CA VAL A 7
6.048 84.781 67.851 1.00 11.90 C
ATOM 28 C VAL A 7 5.539 85.014
66.423 1.00 17.11 C ATOM 29 O
VAL A 7 4.938 86.053 66.131 1.00 8.14
O ATOM 30 CB VAL A 7
4.833 84.488 68.746 1.00 12.98 C
ATOM 31 CG1 VAL A 7 4.223 83.146
68.336 1.00 11.94 C ATOM 32 CG2
VAL A 7 5.188 84.475 70.218 1.00 14.19
C
13
(No Transcript)
14
Simulating nature?

We probably dont know the physics well enough
(or rather, we know it well on an intractably
small scale)
Ugly landscapes to search.
An enormous amount of time steps needed.
Computationally intractable.
We need to
start close to the solution
approximate/simplify

15
Methods for 3D prediction

If there are proteins of known structure that
look like the one I want to model, comparative
modelling (CM) or threading/fold recognition
methods available (starting close to the solution
and possibly simplify).
If there arent, we use de novo (or ab initio)
methods (cant start close to the solution we
need to simplify/approximate).

16
Comparative Modelling (CM)

Find proteins of known structure whose sequence
looks like my sequence (templates).
Align sequence and template(s)
Build a model
Figure out if the model makes sense

17
Structure more conserved than sequence

If two sequences are more than 30 similar,
strong structural similarity almost guaranteed.
(Average similarity of unrelated sequences around
7)

18
CM

Find proteins of known structure whose sequence
looks at least 30 like my sequence (templates).
Align sequence and template(s)
..

19
Sequence similarity

Query 3 FEFHGYARSGVIMNDSGASTKSGAYITPAGETGGAIGRL
GNQADTYVEMNLEHKQTLDNG 62
FEFH YAR V MND A K AY PA E A
RL NQAD YVEMNLEHKQ LDN
Sbjct 3 FEFHHYARCHVHMNDCHACCKCHAYHCPAHECHHAHHRL
HNQADCYVEMNLEHKQCLDNH 62
Query 63 ATTRFKVMVADGQTSYNDWTASTSDLNVRQAFVELGNLP
TFAGPFKGSTLWAGKRFDRDN 122
A RFKVMVAD Q YNDW A DLNVRQAFVEL
NLP FA PFK LWA KRFDRDN
Sbjct 63 ACCRFKVMVADHQCCYNDWCACCCDLNVRQAFVELHNLP
CFAHPFKH--LWAHKRFDRDN 120
Query 123 FDIHWIDSDVVFLAGTGGGIYDVKWNDGLRSNFSLYGRN
FGDIDDSSNSVQNYILTMNHF 182
FD HW D DVVFLA YDVKWND LR NF LY
RNF D DD N VQNY L MNHF
Sbjct 121 FDHHWHDCDVVFLAHCHHHHYDVKWNDHLRCNFCLYHRN
FHDHDDCCNCVQNYHLCMNHF 180

20
CM finding templates, aligning sequence and
templates

Sequence comparison methods
Exact version (complexity o(nm) where n and m are
sequence lengths) a bit demanding.
Linear approximations (blast, psi-blast)
aligning a sequence vs 1M sequences takes tens of
seconds.

21
CM building/assessing model

Copy template or parts thereof (to start close to
the solution)..
Fondle it a bit and assess fondling by physical
energy or pseudo-energy.

22
Threading/Fold Recognition

If no sequence similarity is detected
Find proteins of known structure (templates) by
some other method (not sequence comparison)
Align sequence and template(s)
Build a model
Figure out if the model makes sense

23
Threading finding templates

Libraries of folds.
Thread the sequence into each of the folds and
check if it has low energy in one or more of them.

24
Threading

Energy computations constrained by folds.
This is a lot simpler (quicker) than
unconstrained search.
Still 1-2k folds

25
Threading building/assessing model

Copy template or parts thereof (to start close to
the solution)..
Fondle it a bit more than in CM and assess
fondling by physical energy or pseudo-energy.

26
De novo prediction

No sequence similarity with proteins of known
structure detected.
No fold where threading is possible at acceptable
energy levels.

27
De novo prediction note

Sequence similarity methods are not perfect.
Threading methods are far from perfect
(especially energy functions).
Often de novo methods are used for proteins whose
structure does resemble a known one.

28
De novo how does it work?

It usually does not.
(neither does threading)
Search for a minimum of some energy function. Key
actors
How we search the space of 3D configurations.
The energy function we use.

29
De novo simplify

An all-atom model is computationally heavy
Only some atoms are modelled (e.g. backbone
atoms).

30
De novo simplify more

An all-atom model is computationally heavy
Whole stretches of atoms are modelled together.

31
Choosing the stretches

Regular local structures (helices, strands) are a
natural modelling unit.
We dont know where they are.
Machine learning.
Huge field.

32
(No Transcript)
33
(No Transcript)
34
The energy function

Purely physical functions are not accurate enough
for all-atom models, are very inaccurate for
coarser models, and generally dont provide
decent landscapes to search.
Pseudo-energies huge room for machine learning,
e.g. contact map prediction.

35
Contact maps

Amino acid adjacency map.
Invariant to rotations and translations, unlike
xyz coordinates.
Maps with 50-60 uniform random noise compatible
with correct 3D structure.

36
Do CM, threading, de novo work?

CM works, so long as the template is correct (it
often is)
Threading works, so long as the template is
correct (it never is)
De novo in some cases produces one model out of 5
that is correct over a short stretch of amino
acids (80?).

37
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Bioinformatics CS 40400 PowerPoint PPT Presentation