Bioinformatics CS 40400 - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Bioinformatics CS 40400

Description:

Biological sequence analysis (but this is a tough one) Eddy, Durbin, Krogh, Mitchison ... Needleman-Wunsch and Smith-Waterman algorithms. Semiglobal comparison. ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 38
Provided by: gruye
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics CS 40400


1
BioinformaticsCS 40400
  • Gianluca Pollastri
  • office CS A1.07
  • email gianluca.pollastri_at_ucd.ie

2
Lecture notes
  • http//gruyere.ucd.ie/2007_courses/40400/
  • confidential..

3
Recommended/useful readings
  • No book is actually required
  • Introduction to Computational Molecular Biology
  • Setubal, Meidanis
  • Introduction to Bioinformatics
  • Lesk
  • Bioinformatics the Machine Learning approach
  • Baldi, Brunak
  • Biological sequence analysis (but this is a tough
    one)
  • Eddy, Durbin, Krogh, Mitchison

4
(No Transcript)
5
The course so far..
  • Introduction proteins, RNA, DNA. GenBank,
    SWISS-PROT, PDB, ExPaSy.
  • Sequence comparison. Needleman-Wunsch and
    Smith-Waterman algorithms. Semiglobal comparison.
    Variations to basic algorithms. Approximate
    algorithms BLAST. Multiple sequence alignments.
    Approximate algorithms ClustalW
  • Molecular phylogenetics. Distance based
    algorithms UPGMA, Neighbour Joining. Maximum
    parsimony. Maximum Likelihood. Rooting trees.
    Estimating times. Bootstrapping.

6
Whats next?
  • Protein structure prediction Comparative
    Modelling Threading De novo. Introduction to
    artificial neural networks. Prediction of protein
    structural features by machine learning
    algorithms secondary structure, solvent
    accessibility, contact maps.

7
Protein Structure Prediction and Structural
Genomics
Loosely based on
  • David Baker and Andrej Sali

8
  • MKTLVHVASV EKGRSYEDFQ KVYNAIALKL REDDEYENYI
    GYGDDLVRLA
  • WHISGTWDKH DNTGGSYGGT YRFKKEFNDP SNAGLQNGFK
    FLEPIHKEFP
  • WISSGDLFSL GGVTAVQEMQ GPKIPWRCGR VDTPEDTTPD
    NGRLPDADKD
  • AGYVRTFFQR LNMNDREVVA LMGAHALGKT HLKNSGYEGP
    WGAANNVFTN
  • EFYLNLLNED WKLEKNDANN EQWDSKSGYM MLPTDYSLIQ
    DPKYLSIVKE
  • YANDQDKFFK DFSKAFEKLL ENGITFPKDA PSPFIFKTLE EQGL

9
(No Transcript)
10
Why?
  • 1.7M protein sequences, cheap product of genome
    sequencing projects.
  • 29k high resolution protein structures.
    Determined by X-ray christallography, NMR
    painful, costly and time consuming.
  • Sequence determines Structure, Structure
    determines Function. This is why we want to know
    the structure..

11
(No Transcript)
12
ATOM 1 N LEU A 4 12.803 88.583
75.298 1.00 25.27 N ATOM 2 CA
LEU A 4 12.284 89.166 74.064 1.00 25.73
C ATOM 3 C LEU A 4
11.896 88.062 73.094 1.00 24.57 C
ATOM 4 O LEU A 4 12.740 87.441
72.459 1.00 21.83 O ATOM 5 CB
LEU A 4 13.283 90.098 73.387 1.00 25.28
C ATOM 6 CG LEU A 4
12.714 91.348 72.710 1.00 30.06 C
ATOM 7 CD1 LEU A 4 13.446 91.644
71.405 1.00 23.16 C ATOM 8 CD2
LEU A 4 11.221 91.221 72.456 1.00 27.13
C ATOM 9 N VAL A 5
10.588 87.839 72.988 1.00 19.24 N
ATOM 10 CA VAL A 5 10.180 86.742
72.108 1.00 22.98 C ATOM 11 C
VAL A 5 9.286 87.293 71.005 1.00 25.18
C ATOM 12 O VAL A 5
8.388 88.103 71.215 1.00 19.54 O
ATOM 13 CB VAL A 5 9.527 85.607
72.915 1.00 36.54 C ATOM 14 CG1
VAL A 5 8.876 86.145 74.185 1.00 68.19
C ATOM 15 CG2 VAL A 5
8.518 84.844 72.075 1.00 42.84 C
ATOM 16 N HIS A 6 9.594 86.832
69.801 1.00 19.98 N ATOM 17 CA
HIS A 6 8.898 87.164 68.570 1.00 14.76
C ATOM 18 C HIS A 6
8.153 85.933 68.072 1.00 13.19 C
ATOM 19 O HIS A 6 8.794 85.029
67.536 1.00 12.12 O ATOM 20 CB
HIS A 6 9.900 87.636 67.521 1.00 15.61
C ATOM 21 CG HIS A 6
10.488 88.969 67.851 1.00 16.42 C
ATOM 22 ND1 HIS A 6 11.808 89.287
67.631 1.00 17.91 N ATOM 23 CD2
HIS A 6 9.916 90.073 68.382 1.00 12.99
C ATOM 24 CE1 HIS A 6
12.036 90.531 68.009 1.00 10.96 C
ATOM 25 NE2 HIS A 6 10.904 91.032
68.472 1.00 17.40 N ATOM 26 N
VAL A 7 6.839 85.922 68.277 1.00 10.72
N ATOM 27 CA VAL A 7
6.048 84.781 67.851 1.00 11.90 C
ATOM 28 C VAL A 7 5.539 85.014
66.423 1.00 17.11 C ATOM 29 O
VAL A 7 4.938 86.053 66.131 1.00 8.14
O ATOM 30 CB VAL A 7
4.833 84.488 68.746 1.00 12.98 C
ATOM 31 CG1 VAL A 7 4.223 83.146
68.336 1.00 11.94 C ATOM 32 CG2
VAL A 7 5.188 84.475 70.218 1.00 14.19
C
13
(No Transcript)
14
Simulating nature?
  • We probably dont know the physics well enough
    (or rather, we know it well on an intractably
    small scale)
  • Ugly landscapes to search.
  • An enormous amount of time steps needed.
  • Computationally intractable.
  • We need to
  • start close to the solution
  • approximate/simplify

15
Methods for 3D prediction
  • If there are proteins of known structure that
    look like the one I want to model, comparative
    modelling (CM) or threading/fold recognition
    methods available (starting close to the solution
    and possibly simplify).
  • If there arent, we use de novo (or ab initio)
    methods (cant start close to the solution we
    need to simplify/approximate).

16
Comparative Modelling (CM)
  • Find proteins of known structure whose sequence
    looks like my sequence (templates).
  • Align sequence and template(s)
  • Build a model
  • Figure out if the model makes sense

17
Structure more conserved than sequence
  • If two sequences are more than 30 similar,
    strong structural similarity almost guaranteed.
  • (Average similarity of unrelated sequences around
    7)

18
CM
  • Find proteins of known structure whose sequence
    looks at least 30 like my sequence (templates).
  • Align sequence and template(s)
  • ..

19
Sequence similarity
  • Query 3 FEFHGYARSGVIMNDSGASTKSGAYITPAGETGGAIGRL
    GNQADTYVEMNLEHKQTLDNG 62
  • FEFH YAR V MND A K AY PA E A
    RL NQAD YVEMNLEHKQ LDN
  • Sbjct 3 FEFHHYARCHVHMNDCHACCKCHAYHCPAHECHHAHHRL
    HNQADCYVEMNLEHKQCLDNH 62
  • Query 63 ATTRFKVMVADGQTSYNDWTASTSDLNVRQAFVELGNLP
    TFAGPFKGSTLWAGKRFDRDN 122
  • A RFKVMVAD Q YNDW A DLNVRQAFVEL
    NLP FA PFK LWA KRFDRDN
  • Sbjct 63 ACCRFKVMVADHQCCYNDWCACCCDLNVRQAFVELHNLP
    CFAHPFKH--LWAHKRFDRDN 120
  • Query 123 FDIHWIDSDVVFLAGTGGGIYDVKWNDGLRSNFSLYGRN
    FGDIDDSSNSVQNYILTMNHF 182
  • FD HW D DVVFLA YDVKWND LR NF LY
    RNF D DD N VQNY L MNHF
  • Sbjct 121 FDHHWHDCDVVFLAHCHHHHYDVKWNDHLRCNFCLYHRN
    FHDHDDCCNCVQNYHLCMNHF 180

20
CM finding templates, aligning sequence and
templates
  • Sequence comparison methods
  • Exact version (complexity o(nm) where n and m are
    sequence lengths) a bit demanding.
  • Linear approximations (blast, psi-blast)
    aligning a sequence vs 1M sequences takes tens of
    seconds.

21
CM building/assessing model
  • Copy template or parts thereof (to start close to
    the solution)..
  • Fondle it a bit and assess fondling by physical
    energy or pseudo-energy.

22
Threading/Fold Recognition
  • If no sequence similarity is detected
  • Find proteins of known structure (templates) by
    some other method (not sequence comparison)
  • Align sequence and template(s)
  • Build a model
  • Figure out if the model makes sense

23
Threading finding templates
  • Libraries of folds.
  • Thread the sequence into each of the folds and
    check if it has low energy in one or more of them.

24
Threading
  • Energy computations constrained by folds.
  • This is a lot simpler (quicker) than
    unconstrained search.
  • Still 1-2k folds

25
Threading building/assessing model
  • Copy template or parts thereof (to start close to
    the solution)..
  • Fondle it a bit more than in CM and assess
    fondling by physical energy or pseudo-energy.

26
De novo prediction
  • No sequence similarity with proteins of known
    structure detected.
  • No fold where threading is possible at acceptable
    energy levels.

27
De novo prediction note
  • Sequence similarity methods are not perfect.
  • Threading methods are far from perfect
    (especially energy functions).
  • Often de novo methods are used for proteins whose
    structure does resemble a known one.

28
De novo how does it work?
  • It usually does not.
  • (neither does threading)
  • Search for a minimum of some energy function. Key
    actors
  • How we search the space of 3D configurations.
  • The energy function we use.

29
De novo simplify
  • An all-atom model is computationally heavy
  • Only some atoms are modelled (e.g. backbone
    atoms).

30
De novo simplify more
  • An all-atom model is computationally heavy
  • Whole stretches of atoms are modelled together.

31
Choosing the stretches
  • Regular local structures (helices, strands) are a
    natural modelling unit.
  • We dont know where they are.
  • Machine learning.
  • Huge field.

32
(No Transcript)
33
(No Transcript)
34
The energy function
  • Purely physical functions are not accurate enough
    for all-atom models, are very inaccurate for
    coarser models, and generally dont provide
    decent landscapes to search.
  • Pseudo-energies huge room for machine learning,
    e.g. contact map prediction.

35
Contact maps
  • Amino acid adjacency map.
  • Invariant to rotations and translations, unlike
    xyz coordinates.
  • Maps with 50-60 uniform random noise compatible
    with correct 3D structure.

36
Do CM, threading, de novo work?
  • CM works, so long as the template is correct (it
    often is)
  • Threading works, so long as the template is
    correct (it never is)
  • De novo in some cases produces one model out of 5
    that is correct over a short stretch of amino
    acids (80?).

37
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com