Title: Introduction to Bioinformatics
1Giles Velarde velarde_at_bioinf.man.ac.uk
Introduction to Bioinformatics Which Craft is
best?
Sponsored by
2AIM
- To give biologists a flavour of sequence analysis
- protein sequence structure Databases (DBs)
available - the software used to mine analyse these
- pros cons
- No time today for
- genome DBs
- proteomics
3The Plan
- Introduction
- why analyse sequences?
- sequence alignment
- The Craft
- sequence similarity searches
- pattern searches
- structure modelling
- stucture prediction
- 2 example test cases
- Conclusions which Craft is best?
4Why analyse sequences?
- Evolutionary relationship between two proteins
often means similar structure function - orphan genes
- at time of genome publication, up to 1/3 of
predicted ORFs are orphan - judged using sequence similarity
- sequence-structure deficit
5Sequence analysis in the real world
- Gene prediction is still in its infancy
- problems with De novo prediction
- alignment with homologous sequences most
successful - genome projects ? big business
- Structure prediction is still in its infancy
- problems with Ab initio prediction
- alignment with homologous sequences most
successful - protein structure ? big business
6Sequence Alignment
- Tabular description of the relationships between
proteins - rows represent individual sequences
- columns the residue positions
- Brought into vertical register by introducing
gaps - the relative position of residues within the
alignment is preserved - The result
- an expression of the similarities and
dissimilarities between the sequences
7Online mutliple sequence alignment tools
8Alignment - only a model
- Allows sequences to be compared their
evolutionary relationships assessed. - The more divergent the sequences, the more
distant the structural and functional
relationships, and the more difficult it is to
perform the alignment. - identity is an important indicator of the level
of evolutionary divergence and functional/structur
al similarity between compared sequences.
9Different alignment methods have different areas
of optimum application
10Sequence similarity searches
11Sequence similarity searchesin sequence DBs
- BLAST
- FASTA
- BLAT
- PSI-BLAST
- Many others
- Pros
- good for identifying close homologues
- often used to identify your cloned sequence
- quick one easy step
- Cons
- not so good at distant relationships
12BLAST
- Heuristic Pairwise alignments
- Scans through sequence database
- seeks words of length W that score at least T
when aligned with the query and scored with a
substitution matrix - hits are extended in both directions to find a
locally optimal ungapped alignment or HSP (high
scoring pair) - later gaps are added to improve score
- Provides
- ranked matches on the basis of the alignment
scores and statistics - alignments
- links back to the sequence DBs
- www.ebi.ac.uk/blastall/
- www.ncbi.nlm.nih.gov/BLAST/
- www.ch.embnet.org/software/bBLAST.html
13BLAST
www.ncbi.nlm.nih.gov/Education/
14BLAST result statistics
- Bits score
- the sum of substitution (e.g. PAM or BLOSUM) and
gap scores - normalised with respect to the scoring system
used - E-value
- the number of hits expected
- by chance
- to obtain a score as good as or better than the
match - in a database of the same size as the one
searched - for a sequence of the same length as the query
- the lower the better
15Pattern Searches
- the use of patterns in grouping proteins
together into families
16Single motif methods
Fuzzy regex (eMOTIF)
Exact regex (PROSITE)
Full domain alignment methods
Profiles (PROFILE LIBRARY)
HMMs (Pfam)
Identity matrices (PRINTS)
Multiple motif methods
Weight matrices (BLOCKS)
17Single motif methods
- Original pattern DB approach
- Use alignments to build RegExps which define
permitted and not-permitted amino acids in a
motif e.g. - LIVMFYC-SA-SAPGLVFYKQH-G-DENQMW-KRQASPCLI
MFW-KRNQSTAVM- KRACLVM-LIVMFYPAN-PHY-LIV
MFW-SAGCLIVP-FYWHP-KRHP-LIVMFYWSTA - Numbers of false-positive and false-negative hits
often important - if you don't know what you're looking for, you'll
never know you missed it! - www.expasy.ch/prosite/
- motif.stanford.edu/emotif/
18Full domain alignment methods
F K L L S H C L L V
F K A F G Q T M F Q
Y P I V G Q E L L G F
P V V K E A I L K F K
V L A A V I A D L E F
I S E C I I Q F K L L
G N V L V C A -18 -10 -1 -8 8
-3 3 -10 -2 -8 C -22 -33 -18 -18 -22 -26
22 -24 -19 -7 D -35 0 -32 -33 -7 6 -17
-34 -31 0 E -27 15 -25 -26 -9 23 -9 -24
-23 -1 F 60 -30 12 14 -26 -29 -15 4 12
-29 G -30 -20 -28 -32 28 -14 -23 -33 -27
-5 H -13 -12 -25 -25 -16 14 -22 -22 -23
-10 I 3 -27 21 25 -29 -23 -8 33 19
-23 K -26 25 -25 -27 -6 4 -15 -27 -26
0 L 14 -28 19 27 -27 -20 -9 33 26 -21 M
3 -15 10 14 -17 -10 -9 25 12 -11 N
-22 -6 -24 -27 1 8 -15 -24 -24 -4 P -30
24 -26 -28 -14 -10 -22 -24 -26 -18 Q -32 5
-25 -26 -9 24 -16 -17 -23 7 R -18 9 -22
-22 -10 0 -18 -23 -22 -4 S -22 -8 -16 -21
11 2 -1 -24 -19 -4 T -10 -10 -6 -7 -5
-8 2 -10 -7 -11 V 0 -25 22 25 -19 -26
6 19 16 -16 W 9 -25 -18 -19 -25 -27 -34
-20 -17 -28 Y 34 -18 -1 1 -23 -12 -19 0
0 -18
- Use profiles (PSSMs or HMMs)
- More powerful descriptors than RegExps
- Applicable to whole domains
- hits.isb-sib.ch/cgi-bin/PFSCAN www.expasy.ch/prosi
te/ - pfam.wustl.edu/
19Hidden Markov Models
D6
D5
D4
D3
D2
D1
M1 Prob G 2/4 0.5 I 1/4 0.25
I0
I1
I2
I4
I3
I6
I5
M1
M2
M3
M4
M5
M6
end
begin
delete state deletions
GGWdygLFP IGWyntGFP D-WqlnGFP GEWksvGIP
insert state variable regions
main state conserved regions
20PSI-BLAST
- BLASTs your sequence against sequence DB
(pairwise) - allows automatic creation of a PSSM from results
(multiple) - BLASTs the PSSM
- iterations further refine the PSSM adding
sequences that BLAST alone might have missed - if no false positives introduced, this improves
sensitivity - else reduction in selectivity
iterative
www.ncbi.nlm.nih.gov/BLAST/
21Multiple motif methods
- PRINTS BLOCKS
- bioinf.man.ac.uk/dbbrowser/PRINTS/
- www.blocks.fhcrc.org/
22What is PRINTS?(not the best thing since sliced
bread, but....)
- A DB of diagnostic fingerprints that characterise
proteins - family analysis is hierarchical, allowing
fine-grained diagnoses - Fingerprints are groups of conserved motifs, used
for iterative DB searching - iteration refines the fingerprint
- potency is gained from the mutual context of
motif neighbours - results are biologically more meaningful than
from single motifs - results are manually annotated prior to inclusion
in the DB - PRINTS has many applications, e.g.
- basis of BLOCKS eMOTIF
- EditToTrEMBL - to annotate TrEMBL
- provide annotation hierarchical protein
classification in InterPro
23BLOCKS
- 2 Blocks databases
- one derived automatically from sequence families
classified in InterPro, - created automatically using 2 different
motif-detection algorithms - the other derived directly from motifs stored in
PRINTS - Calibrated against SWISS-PROT
- obtain a measure of the chance distribution of
matches, and hence provide a measure of their
diagnostic potential - With BLOCKS
- sequence weights calculated using a scoring
matrix - reduce the tendency for over-represented
sequences to dominate stacks - PRINTS uses unweighted residue frequencies
24InterPro www.ebi.ac.uk/interpro/Release 5.1
July 2002
- DATABASE VERSION ENTRIES DATE
- SWISS-PROT 40.22 110823 24-JUN-2002
- TREMBL 21.2 671586 05-JUN-2002
- PROSITE 17.5 1565 21-JUN-2002
- PREFILE N/A 252 18-JUL-2001
- PFAM 7.3 3865 17-MAY-2002
- PRINTS 33.0 1650 24-JAN-2002
- PRODOM 20001.3 1346 28-JAN-2002
- SMART 3.1 509 16-NOV-2000
- TIGRFAMs 1.2 814 03-AUG-2001
25Typical InterPro Results
- PRINTS
- PROSITE Profiles
- PROSITE Patterns
- Pfam
- ProDom
- SMART
26Why bother with pattern DBs?
- Seq searches won't always allow outright
diagnosis - BLAST FASTA are not infallible
- BLAST, in particular, often can't assign
significant scores - results may be complicated by the presence of
modules, or compositionally-biased regions - annotations of retrieved hits may be incorrect
- Pattern DBs contain potent descriptors
- so, distant relationships missed by BLAST may be
captured by one or more of the family or
functional site distillations
27Test Case 1 Two GPCRsfrom the SWISS-PROT
annotation
Question how are they related?
28SWISS-PROT annotation5-hydroxytryptamine 1A
Receptor - GPCR 7TM Topology
0
422
29BLAST Them!
- Region matched corresponds to first 5 TM helices
- 24 identity in matched region
- ? lower identity overall
- E-value of 4x10-3
- ? match statistics not convincing
30Searching Prosite patterns, Prosite profiles
Pfam using Motif Scan
5H1A ? patG_PROTEIN_RECEP_F1_1 pos. 122
138 n.a. ! prfG_PROTEIN_RECEP_F1_2 pos. 53 -
400 Â Â E-value2.1x10-39 ! pfam7TM_1 pos. 53
- 400 Â Â Â E-value5x10-110 Orphan
GPCR ? patATP_GTP_A pos. 324
331 n.a. ? patG_PROTEIN_RECEP_F1_1 pos. 107 -
123 n.a. ! prfG_PROTEIN_RECEP_F1_2 pos. 38 -
286 Â Â Â E-value2.1x10-28 ! pfam7TM_1 pos. 38
- 286 Â Â Â E-value2.5x10-50
31Prosite pattern profile statistics
- patG_PROTEIN_RECEP_F1_1
- GSTALIVMFYWC-GSTANCPDE-EDPKRH-x(2)-LIVMNQGA
-x(2)- LIVMFT-GSTANC-LIVMFYWSTAC-DENH-R-
FYWCSH-x(2)- LIVM - pattern for the majority of GPCRs (95 detected)
- Precision (true hits / (true hits false
positives)) 94.29 - Recall (true hits / (true hits false
negatives)) 90.42 - for better sensitivity, user is directed to the
Profile - prfG_PROTEIN_RECEP_F1_2
- better sensitivity
- Precision 99.92
- Recall 100.00
32FingerPRINTScan 5H1A
33FingerPRINTScan Orphan GPCR
34PRINTS GraphScan
- Orphan GPCR vs. GPCRRHODOPSN
35PRINTS GraphScan
- Orphan GPCR vs. OGR1RECEPTOR
36From the PRINTS annotation
- The orphan receptor OGR1 has been identified as a
high affinity receptor for the lysophospholipid,
sphingosylphosphorylcholine (SPC). - Upon activation by SPC, OGR1 couples to both Gi
proteins, causing increases in intracellular
calcium, and Gq proteins, to activate MAP
kinases, inhibiting proliferation. SPC also
causes regulation of ion channel activity,
binding of activator protein-1 to DNA, and
expression of cell adhesion molecule-1 and
interleukin-6.
37InterPro www.ebi.ac.uk/interpro/
5H1A
Orphan GPCR
38BLAST vs. pattern searchessummary
- Pairwise BLAST 4x10-3
- 5H1A Orphan GPCR
- Pattern G_PROTEIN_RECEP_F1_1 n.a. n.a.
- Profile G_PROTEIN_RECEP_F1_2 2.1x10-39
2.1x10-28 - Pfam 7TM_1 5x10-110 2.5x10-50
- PRINTS GPCRRHODOPSN 1x10-54 1.7x10-28
- BLOCKS Rhodopsin-like GPCR superfamily 2.8e-28 1.
1e-16
39Structure prediction
40Classical secondary structure prediction
- empirical statistical methods
- parameters from known structures
- machine learning methods
- trained using known secondary structures
- assumptions
- all the information for folding is contained in
the sequence. - many proteins require chaperones to achieve their
correct fold - disulphide bonds and/or PTMs that affect folding
are also dependent on cellular conditions - examining short sequence windows (e.g., 10-20
residues) is sufficient to provide robust
predictions - elements of sequence not adjacent in 2D are often
close in 3D, and are hence likely to influence
the final fold - jura.ebi.ac.uk8888/
- npsa-pbil.ibcp.fr
41The importance of consensus
42Homology Modelling
- geno3d-pbil.ibcp.fr
- www.expasy.ch/swissmod/
43Homology modelling limitations3DCrunch
44Fold recognition
- Low ID
- when neither simple sequence alignment nor
homology modelling possible - Relies on
- limited number of folds
- secondary structure being more conserved than
sequence - Seeks compatible folds for a sequence within fold
template databases - Results
- alignment with fold and secondary structure
prediction - bioinf.cs.ucl.ac.uk/psipred/
- www.sbg.bio.ic.ac.uk/3dpssm/
45Fold recognition - GenTHREADER
Pairwise Energy
Solvation Energy
Proteins Unrelated
Alignment Score
Alignment Length
Proteins Related
Length of Structure
Output Layer
Hidden Layer
Length of Sequence
Input Layer
46Fold Recognition 3D-PSSM
- List of "Master Proteins" (SCOP-defined domains
and whole PDB chains) - 1D-PSSMs
- searched iteratively against NRPROT with
PSIBLAST, and the hits are aligned to create a
1D-PSSM - 3D-PSSMs
- aligning in 3D proteins structures classed in the
same superfamily (provided they align well
enough) using the SAP program. - initially, the closest fitting (lowest RMS)
structure is added to the alignment - built in a hierarchical fashion, progressively
adding alignments that are closest to an existing
member of the alignment - Crude 3D models
47Test case 2 - uncharacterised DNA fragment
- PhD student at the ExtraCellular Matrix unit
isolates an unknown clone - It interacts with other proteins he is interested
in - BLAST identifies it as CD14 or LPS-receptor
- Student needs to find out as much as possible
about its structure to develop assays
48CD14 SWISSPROT feature table
49When little is known
- InterProScan - few results
- It is, however, now clear that all major classes
of LRR have curved horseshoe structures with a
parallelsheet on the concave side and mostly
helical elements on the convex side. The concave
face and the adjacent loops are the most common
protein interaction surfaces on LRR proteins.
annotation
50BLASTx against PIR-NRL3D
- No good matches
- What to do?
51? No homology model possible
- Welcome to the SWISS-MODEL Protein Modelling
Server
- Your modeling request could not be carried out.
- Please look at the TraceLog file issued by the
server. - The degree of similarity of your sequence with
proteins of - known 3D structure may be to low.
- At present, SWISS-MODEL will generate models for
sequences - which respond to these criteria
- BLAST search P value
- Global degree of sequence identity (SIM) 25
- Minimal projected model length 25 aa.
- Length of target sequence 375 residues
- Searching sequences of known 3D structures
52Fold recognition - GenTHREADER
- Medium-confidence hits to several PDBs
- Not picked up by BLAST
- 1yrg
- Gtpase-activating protein rna1_schpo
- 1fqv
- scf ubiquitin ligase Skp2
- 1dfjI
- Ribonuclease a.
- 1a4y
- Ribonuclease inhibitor
- 1d0b
- Internalin B Leucine Rich Repeat Domain
- Common theme
- Protein-protein interactions
53Fold recognition - GenTHREADER
543D-PSSM manages to generate crude models
- Viewed in rasmol
- www.bernstein-plus-sons.com/software/rasmol/
55Test Case 2Summary
- LLR annotation
- parallelsheet on the concave side and mostly
helical elements on the convex side - Similar results with 3D-PSSM
- ccccceeecccccchhhhhhhh repeats
- Able to supply collaborator with a
medium-confidence secondary-structure
prediction - Help develop an assay
56Structure prediction summary
- Classical secondary structure prediction
- Consensus is important
- Homology modelling
- Useful if template structure is at least 30-40
ID to query - Fold Recognition
- If all else fails
- But more sensitive than classical secondary
structure prediction - Can sometimes produce simple 3D models
- A model can look very convincing
57Conclusions
- Different application areas
- sequence DB searches
- BLAST useful for finding close homologues
(pairwise alignments) - PSI-BLAST can improve sensitivity (multiple
alignments PSSMs) - But not necessarilly selectivity danger of
false positives! - pattern DB searches
- into the twilight zone
- group (and sometimes classify) related proteins
together based on sequence similarity - contain more potent descriptors of conserved
protein features - improves the signal to noise by searching the
interesting bits - structure prediction (structure/fold DB searches)
- from the twilight zone into the midnight zone
- helps address the sequence-structure deficit,
whenever the structure of the query is not known - how well it can do this depends on the method and
the how similar the closest structure is
58Online tutorials
- Bioactivity
- online now
- www.bioinf.man.ac.uk/dbbrowser/bioactivity
- European Multimedia Bioinformatics Educational
Resource - online by April
- www.bioinf.man.ac.uk/ember
- accompanied by a book CD-ROM
59Thanks to
- Neil Maudling, Jane Mabey, Ioannis Selimas, Anna
Gaulton, Grigoris Amoutzias, George Moulton - for advice on colour schemes and useful
discussions! - Cripsin Miller
- for use of his alignment editing software!
- Terri Attwood
- for letting me have a holid-er give this talk in
Crete! - everyone_at_UMBER the Manchester University
EMBnet Node - for being a great bunch to work with!
- Sarah Blackford
- for organising everything!
60Giles Velarde velarde_at_bioinf.man.ac.uk
Thank you!
Sponsored by