Introduction to Bioinformatics - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Introduction to Bioinformatics

Description:

Profile Methods. Structure Prediction. Sequence alignments can be used to infer homology ... Bos taurus (Bovine) Homo sapiens (Human) ORGANISM. Orphan G ... – PowerPoint PPT presentation

Number of Views:164
Avg rating:3.0/5.0
Slides: 61
Provided by: blac1
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics


1
Giles Velarde velarde_at_bioinf.man.ac.uk
Introduction to Bioinformatics Which Craft is
best?
Sponsored by
2
AIM
  • To give biologists a flavour of sequence analysis
  • protein sequence structure Databases (DBs)
    available
  • the software used to mine analyse these
  • pros cons
  • No time today for
  • genome DBs
  • proteomics

3
The Plan
  • Introduction
  • why analyse sequences?
  • sequence alignment
  • The Craft
  • sequence similarity searches
  • pattern searches
  • structure modelling
  • stucture prediction
  • 2 example test cases
  • Conclusions which Craft is best?

4
Why analyse sequences?
  • Evolutionary relationship between two proteins
    often means similar structure function
  • orphan genes
  • at time of genome publication, up to 1/3 of
    predicted ORFs are orphan
  • judged using sequence similarity
  • sequence-structure deficit

5
Sequence analysis in the real world
  • Gene prediction is still in its infancy
  • problems with De novo prediction
  • alignment with homologous sequences most
    successful
  • genome projects ? big business
  • Structure prediction is still in its infancy
  • problems with Ab initio prediction
  • alignment with homologous sequences most
    successful
  • protein structure ? big business

6
Sequence Alignment
  • Tabular description of the relationships between
    proteins
  • rows represent individual sequences
  • columns the residue positions
  • Brought into vertical register by introducing
    gaps
  • the relative position of residues within the
    alignment is preserved
  • The result
  • an expression of the similarities and
    dissimilarities between the sequences

7
Online mutliple sequence alignment tools
8
Alignment - only a model
  • Allows sequences to be compared their
    evolutionary relationships assessed.
  • The more divergent the sequences, the more
    distant the structural and functional
    relationships, and the more difficult it is to
    perform the alignment.
  • identity is an important indicator of the level
    of evolutionary divergence and functional/structur
    al similarity between compared sequences.

9
Different alignment methods have different areas
of optimum application
10
Sequence similarity searches
  • sequence databases

11
Sequence similarity searchesin sequence DBs
  • BLAST
  • FASTA
  • BLAT
  • PSI-BLAST
  • Many others
  • Pros
  • good for identifying close homologues
  • often used to identify your cloned sequence
  • quick one easy step
  • Cons
  • not so good at distant relationships

12
BLAST
  • Heuristic Pairwise alignments
  • Scans through sequence database
  • seeks words of length W that score at least T
    when aligned with the query and scored with a
    substitution matrix
  • hits are extended in both directions to find a
    locally optimal ungapped alignment or HSP (high
    scoring pair)
  • later gaps are added to improve score
  • Provides
  • ranked matches on the basis of the alignment
    scores and statistics
  • alignments
  • links back to the sequence DBs
  • www.ebi.ac.uk/blastall/
  • www.ncbi.nlm.nih.gov/BLAST/
  • www.ch.embnet.org/software/bBLAST.html

13
BLAST
www.ncbi.nlm.nih.gov/Education/
14
BLAST result statistics
  • Bits score
  • the sum of substitution (e.g. PAM or BLOSUM) and
    gap scores
  • normalised with respect to the scoring system
    used
  • E-value
  • the number of hits expected
  • by chance
  • to obtain a score as good as or better than the
    match
  • in a database of the same size as the one
    searched
  • for a sequence of the same length as the query
  • the lower the better

15
Pattern Searches
  • the use of patterns in grouping proteins
    together into families

16
Single motif methods
Fuzzy regex (eMOTIF)
Exact regex (PROSITE)
Full domain alignment methods
Profiles (PROFILE LIBRARY)
HMMs (Pfam)
Identity matrices (PRINTS)
Multiple motif methods
Weight matrices (BLOCKS)
17
Single motif methods
  • Original pattern DB approach
  • Use alignments to build RegExps which define
    permitted and not-permitted amino acids in a
    motif e.g.
  • LIVMFYC-SA-SAPGLVFYKQH-G-DENQMW-KRQASPCLI
    MFW-KRNQSTAVM- KRACLVM-LIVMFYPAN-PHY-LIV
    MFW-SAGCLIVP-FYWHP-KRHP-LIVMFYWSTA
  • Numbers of false-positive and false-negative hits
    often important
  • if you don't know what you're looking for, you'll
    never know you missed it!
  • www.expasy.ch/prosite/
  • motif.stanford.edu/emotif/

18
Full domain alignment methods
F K L L S H C L L V
F K A F G Q T M F Q
Y P I V G Q E L L G F
P V V K E A I L K F K
V L A A V I A D L E F
I S E C I I Q F K L L
G N V L V C A -18 -10 -1 -8 8
-3 3 -10 -2 -8 C -22 -33 -18 -18 -22 -26
22 -24 -19 -7 D -35 0 -32 -33 -7 6 -17
-34 -31 0 E -27 15 -25 -26 -9 23 -9 -24
-23 -1 F 60 -30 12 14 -26 -29 -15 4 12
-29 G -30 -20 -28 -32 28 -14 -23 -33 -27
-5 H -13 -12 -25 -25 -16 14 -22 -22 -23
-10 I 3 -27 21 25 -29 -23 -8 33 19
-23 K -26 25 -25 -27 -6 4 -15 -27 -26
0 L 14 -28 19 27 -27 -20 -9 33 26 -21 M
3 -15 10 14 -17 -10 -9 25 12 -11 N
-22 -6 -24 -27 1 8 -15 -24 -24 -4 P -30
24 -26 -28 -14 -10 -22 -24 -26 -18 Q -32 5
-25 -26 -9 24 -16 -17 -23 7 R -18 9 -22
-22 -10 0 -18 -23 -22 -4 S -22 -8 -16 -21
11 2 -1 -24 -19 -4 T -10 -10 -6 -7 -5
-8 2 -10 -7 -11 V 0 -25 22 25 -19 -26
6 19 16 -16 W 9 -25 -18 -19 -25 -27 -34
-20 -17 -28 Y 34 -18 -1 1 -23 -12 -19 0
0 -18
  • Use profiles (PSSMs or HMMs)
  • More powerful descriptors than RegExps
  • Applicable to whole domains
  • hits.isb-sib.ch/cgi-bin/PFSCAN www.expasy.ch/prosi
    te/
  • pfam.wustl.edu/

19
Hidden Markov Models
D6
D5
D4
D3
D2
D1
M1 Prob G 2/4 0.5 I 1/4 0.25
I0
I1
I2
I4
I3
I6
I5
M1
M2
M3
M4
M5
M6
end
begin
delete state deletions
GGWdygLFP IGWyntGFP D-WqlnGFP GEWksvGIP
insert state variable regions
main state conserved regions
20
PSI-BLAST
  • BLASTs your sequence against sequence DB
    (pairwise)
  • allows automatic creation of a PSSM from results
    (multiple)
  • BLASTs the PSSM
  • iterations further refine the PSSM adding
    sequences that BLAST alone might have missed
  • if no false positives introduced, this improves
    sensitivity
  • else reduction in selectivity

iterative
www.ncbi.nlm.nih.gov/BLAST/
21
Multiple motif methods
  • PRINTS BLOCKS
  • bioinf.man.ac.uk/dbbrowser/PRINTS/
  • www.blocks.fhcrc.org/

22
What is PRINTS?(not the best thing since sliced
bread, but....)
  • A DB of diagnostic fingerprints that characterise
    proteins
  • family analysis is hierarchical, allowing
    fine-grained diagnoses
  • Fingerprints are groups of conserved motifs, used
    for iterative DB searching
  • iteration refines the fingerprint
  • potency is gained from the mutual context of
    motif neighbours
  • results are biologically more meaningful than
    from single motifs
  • results are manually annotated prior to inclusion
    in the DB
  • PRINTS has many applications, e.g.
  • basis of BLOCKS eMOTIF
  • EditToTrEMBL - to annotate TrEMBL
  • provide annotation hierarchical protein
    classification in InterPro

23
BLOCKS
  • 2 Blocks databases
  • one derived automatically from sequence families
    classified in InterPro,
  • created automatically using 2 different
    motif-detection algorithms
  • the other derived directly from motifs stored in
    PRINTS
  • Calibrated against SWISS-PROT
  • obtain a measure of the chance distribution of
    matches, and hence provide a measure of their
    diagnostic potential
  • With BLOCKS
  • sequence weights calculated using a scoring
    matrix
  • reduce the tendency for over-represented
    sequences to dominate stacks
  • PRINTS uses unweighted residue frequencies

24
InterPro www.ebi.ac.uk/interpro/Release 5.1
July 2002
  • DATABASE VERSION ENTRIES DATE
  • SWISS-PROT 40.22 110823 24-JUN-2002
  • TREMBL 21.2 671586 05-JUN-2002
  • PROSITE 17.5 1565 21-JUN-2002
  • PREFILE N/A 252 18-JUL-2001
  • PFAM 7.3 3865 17-MAY-2002
  • PRINTS 33.0 1650 24-JAN-2002
  • PRODOM 20001.3 1346 28-JAN-2002
  • SMART 3.1 509 16-NOV-2000
  • TIGRFAMs 1.2 814 03-AUG-2001

25
Typical InterPro Results
  • PRINTS
  • PROSITE Profiles
  • PROSITE Patterns
  • Pfam
  • ProDom
  • SMART

26
Why bother with pattern DBs?
  • Seq searches won't always allow outright
    diagnosis
  • BLAST FASTA are not infallible
  • BLAST, in particular, often can't assign
    significant scores
  • results may be complicated by the presence of
    modules, or compositionally-biased regions
  • annotations of retrieved hits may be incorrect
  • Pattern DBs contain potent descriptors
  • so, distant relationships missed by BLAST may be
    captured by one or more of the family or
    functional site distillations

27
Test Case 1 Two GPCRsfrom the SWISS-PROT
annotation
Question how are they related?
28
SWISS-PROT annotation5-hydroxytryptamine 1A
Receptor - GPCR 7TM Topology
0
422
29
BLAST Them!
  • Region matched corresponds to first 5 TM helices
  • 24 identity in matched region
  • ? lower identity overall
  • E-value of 4x10-3
  • ? match statistics not convincing

30
Searching Prosite patterns, Prosite profiles
Pfam using Motif Scan
5H1A ? patG_PROTEIN_RECEP_F1_1 pos. 122
138 n.a. ! prfG_PROTEIN_RECEP_F1_2 pos. 53 -
400    E-value2.1x10-39 ! pfam7TM_1 pos. 53
- 400     E-value5x10-110 Orphan
GPCR ? patATP_GTP_A pos. 324
331 n.a. ? patG_PROTEIN_RECEP_F1_1 pos. 107 -
123 n.a. ! prfG_PROTEIN_RECEP_F1_2 pos. 38 -
286     E-value2.1x10-28 ! pfam7TM_1 pos. 38
- 286     E-value2.5x10-50
31
Prosite pattern profile statistics
  • patG_PROTEIN_RECEP_F1_1
  • GSTALIVMFYWC-GSTANCPDE-EDPKRH-x(2)-LIVMNQGA
    -x(2)- LIVMFT-GSTANC-LIVMFYWSTAC-DENH-R-
    FYWCSH-x(2)- LIVM
  • pattern for the majority of GPCRs (95 detected)
  • Precision (true hits / (true hits false
    positives)) 94.29
  • Recall (true hits / (true hits false
    negatives)) 90.42
  • for better sensitivity, user is directed to the
    Profile
  • prfG_PROTEIN_RECEP_F1_2
  • better sensitivity
  • Precision 99.92
  • Recall 100.00

32
FingerPRINTScan 5H1A
33
FingerPRINTScan Orphan GPCR
34
PRINTS GraphScan
  • 5H1A vs. GPCRRHODOPSN
  • Orphan GPCR vs. GPCRRHODOPSN

35
PRINTS GraphScan
  • Orphan GPCR vs. OGR1RECEPTOR
  • 5H1A vs. 5HT1ARECEPTR

36
From the PRINTS annotation
  • The orphan receptor OGR1 has been identified as a
    high affinity receptor for the lysophospholipid,
    sphingosylphosphorylcholine (SPC).
  • Upon activation by SPC, OGR1 couples to both Gi
    proteins, causing increases in intracellular
    calcium, and Gq proteins, to activate MAP
    kinases, inhibiting proliferation. SPC also
    causes regulation of ion channel activity,
    binding of activator protein-1 to DNA, and
    expression of cell adhesion molecule-1 and
    interleukin-6.

37
InterPro www.ebi.ac.uk/interpro/
5H1A
Orphan GPCR
38
BLAST vs. pattern searchessummary
  • Pairwise BLAST 4x10-3
  • 5H1A Orphan GPCR
  • Pattern G_PROTEIN_RECEP_F1_1 n.a. n.a.
  • Profile G_PROTEIN_RECEP_F1_2 2.1x10-39
    2.1x10-28
  • Pfam 7TM_1 5x10-110 2.5x10-50
  • PRINTS GPCRRHODOPSN 1x10-54 1.7x10-28
  • BLOCKS Rhodopsin-like GPCR superfamily 2.8e-28 1.
    1e-16

39
Structure prediction
  • areas of application

40
Classical secondary structure prediction
  • empirical statistical methods
  • parameters from known structures
  • machine learning methods
  • trained using known secondary structures
  • assumptions
  • all the information for folding is contained in
    the sequence.
  • many proteins require chaperones to achieve their
    correct fold
  • disulphide bonds and/or PTMs that affect folding
    are also dependent on cellular conditions
  • examining short sequence windows (e.g., 10-20
    residues) is sufficient to provide robust
    predictions
  • elements of sequence not adjacent in 2D are often
    close in 3D, and are hence likely to influence
    the final fold
  • jura.ebi.ac.uk8888/
  • npsa-pbil.ibcp.fr

41
The importance of consensus
42
Homology Modelling
  • geno3d-pbil.ibcp.fr
  • www.expasy.ch/swissmod/

43
Homology modelling limitations3DCrunch
44
Fold recognition
  • Low ID
  • when neither simple sequence alignment nor
    homology modelling possible
  • Relies on
  • limited number of folds
  • secondary structure being more conserved than
    sequence
  • Seeks compatible folds for a sequence within fold
    template databases
  • Results
  • alignment with fold and secondary structure
    prediction
  • bioinf.cs.ucl.ac.uk/psipred/
  • www.sbg.bio.ic.ac.uk/3dpssm/

45
Fold recognition - GenTHREADER
Pairwise Energy
Solvation Energy
Proteins Unrelated
Alignment Score
Alignment Length
Proteins Related
Length of Structure
Output Layer
Hidden Layer
Length of Sequence
Input Layer
46
Fold Recognition 3D-PSSM
  • List of "Master Proteins" (SCOP-defined domains
    and whole PDB chains)
  • 1D-PSSMs
  • searched iteratively against NRPROT with
    PSIBLAST, and the hits are aligned to create a
    1D-PSSM
  • 3D-PSSMs
  • aligning in 3D proteins structures classed in the
    same superfamily (provided they align well
    enough) using the SAP program.
  • initially, the closest fitting (lowest RMS)
    structure is added to the alignment
  • built in a hierarchical fashion, progressively
    adding alignments that are closest to an existing
    member of the alignment
  • Crude 3D models

47
Test case 2 - uncharacterised DNA fragment
  • PhD student at the ExtraCellular Matrix unit
    isolates an unknown clone
  • It interacts with other proteins he is interested
    in
  • BLAST identifies it as CD14 or LPS-receptor
  • Student needs to find out as much as possible
    about its structure to develop assays

48
CD14 SWISSPROT feature table
49
When little is known
  • InterProScan - few results
  • It is, however, now clear that all major classes
    of LRR have curved horseshoe structures with a
    parallelsheet on the concave side and mostly
    helical elements on the convex side. The concave
    face and the adjacent loops are the most common
    protein interaction surfaces on LRR proteins.
    annotation

50
BLASTx against PIR-NRL3D
  • No good matches
  • What to do?

51
? No homology model possible
  • Welcome to the SWISS-MODEL Protein Modelling
    Server

  • Your modeling request could not be carried out.
  • Please look at the TraceLog file issued by the
    server.
  • The degree of similarity of your sequence with
    proteins of
  • known 3D structure may be to low.
  • At present, SWISS-MODEL will generate models for
    sequences
  • which respond to these criteria
  • BLAST search P value
  • Global degree of sequence identity (SIM) 25
  • Minimal projected model length 25 aa.

  • Length of target sequence 375 residues
  • Searching sequences of known 3D structures

52
Fold recognition - GenTHREADER
  • Medium-confidence hits to several PDBs
  • Not picked up by BLAST
  • 1yrg
  • Gtpase-activating protein rna1_schpo
  • 1fqv
  • scf ubiquitin ligase Skp2
  • 1dfjI
  • Ribonuclease a.
  • 1a4y
  • Ribonuclease inhibitor
  • 1d0b
  • Internalin B Leucine Rich Repeat Domain
  • Common theme
  • Protein-protein interactions

53
Fold recognition - GenTHREADER
54
3D-PSSM manages to generate crude models
  • Viewed in rasmol
  • www.bernstein-plus-sons.com/software/rasmol/

55
Test Case 2Summary
  • LLR annotation
  • parallelsheet on the concave side and mostly
    helical elements on the convex side
  • Similar results with 3D-PSSM
  • ccccceeecccccchhhhhhhh repeats
  • Able to supply collaborator with a
    medium-confidence secondary-structure
    prediction
  • Help develop an assay

56
Structure prediction summary
  • Classical secondary structure prediction
  • Consensus is important
  • Homology modelling
  • Useful if template structure is at least 30-40
    ID to query
  • Fold Recognition
  • If all else fails
  • But more sensitive than classical secondary
    structure prediction
  • Can sometimes produce simple 3D models
  • A model can look very convincing

57
Conclusions
  • Different application areas
  • sequence DB searches
  • BLAST useful for finding close homologues
    (pairwise alignments)
  • PSI-BLAST can improve sensitivity (multiple
    alignments PSSMs)
  • But not necessarilly selectivity danger of
    false positives!
  • pattern DB searches
  • into the twilight zone
  • group (and sometimes classify) related proteins
    together based on sequence similarity
  • contain more potent descriptors of conserved
    protein features
  • improves the signal to noise by searching the
    interesting bits
  • structure prediction (structure/fold DB searches)
  • from the twilight zone into the midnight zone
  • helps address the sequence-structure deficit,
    whenever the structure of the query is not known
  • how well it can do this depends on the method and
    the how similar the closest structure is

58
Online tutorials
  • Bioactivity
  • online now
  • www.bioinf.man.ac.uk/dbbrowser/bioactivity
  • European Multimedia Bioinformatics Educational
    Resource
  • online by April
  • www.bioinf.man.ac.uk/ember
  • accompanied by a book CD-ROM

59
Thanks to
  • Neil Maudling, Jane Mabey, Ioannis Selimas, Anna
    Gaulton, Grigoris Amoutzias, George Moulton
  • for advice on colour schemes and useful
    discussions!
  • Cripsin Miller
  • for use of his alignment editing software!
  • Terri Attwood
  • for letting me have a holid-er give this talk in
    Crete!
  • everyone_at_UMBER the Manchester University
    EMBnet Node
  • for being a great bunch to work with!
  • Sarah Blackford
  • for organising everything!

60
Giles Velarde velarde_at_bioinf.man.ac.uk
Thank you!
Sponsored by
Write a Comment
User Comments (0)
About PowerShow.com