Title: Introduction to Bioinformatics
1Introduction to Bioinformatics
Lecture 12 Iterative homology searching and
Protein Structure-Function Relationships Centre
for Integrative Bioinformatics VU (IBIVU)
2PSI (Position Specific Iterated) BLAST
- basic idea
- use results from BLAST query to construct a
profile matrix - search database with profile instead of query
sequence - iterate
3A Profile Matrix (Position Specific Scoring
Matrix PSSM)
This is the same as a profile without
position-specific gap penalties
4PSI BLAST
- Searching with a Profile
- aligning profile matrix to a simple sequence
- like aligning two sequences
- except score for aligning a character with a
matrix position is given by the matrix itself - not a substitution matrix
5PSI BLASTConstructing the Profile Matrix
Figure from Altschul et al. Nucleic Acids
Research 25, 1997
6PSI BLASTDetermining Profile Elements
- the value for a given element of the profile
matrix is given by - where the probability of seeing amino acid ai in
column j is estimated as
Observed frequency
Pseudocount
e.g. ? number of sequences in profile, ?1
7PSI-BLAST iteration
Query sequence
Q
xxxxxxxxxxxxxxxxx
Gapped BLAST search
Query sequence
Q
xxxxxxxxxxxxxxxxx
Database hits
A C D . . Y
iterate
PSSM
Pi Px
Gapped BLAST search
A C D . . Y
PSSM
Pi Px
Database hits
8PSI-BLAST steps in words
PSI-BLAST steps in words
- Query sequences are first scanned for the
presence of so-called low-complexity regions
(Wooton and Federhen, 1996), i.e. regions with a
biased composition likely to lead to spurious
hits are excluded from alignment. - The program then initially operates on a single
query sequence by performing a gapped BLAST
search - Then, the program takes significant local
alignments (hits) found, constructs a multiple
alignment (master-slave alignment) and abstracts
a position-specific scoring matrix (PSSM) from
this alignment. - Rescan the database in a subsequent round, using
the PSSM, to find more homologous sequences.
Iteration continues until user decides to stop or
search has converged
9PSI-BLAST entry page
Paste your query sequence
Switch this off for default run
10(No Transcript)
111 - This portion of each description links to the
sequence record for a particular hit. 2 - Score
or bit score is a value calculated from the
number of gaps and substitutions associated with
each aligned sequence. The higher the score, the
more significant the alignment. Each score links
to the corresponding pairwise alignment between
query sequence and hit sequence (also referred to
as subject sequence). 3 - E Value (Expect Value)
describes the likelihood that a sequence with a
similar score will occur in the database by
chance. The smaller the E Value, the more
significant the alignment. For example, the first
alignment has a very low E value of e-117 meaning
that a sequence with a similar score is very
unlikely to occur simply by chance. 4 - These
links provide the user with direct access from
BLAST results to related entries in other
databases. L links to LocusLink records and
S links to structure records in NCBI's
Molecular Modeling DataBase.
12X residues denote low-complexity sequence
fragments that are ignored
13Alignment Bit Score
B (?S ln K) / ln 2
- S is the raw alignment score
- The bit score (bits) B has a standard set of
units - The bit score B is calculated from the number of
gaps and substitutions associated with each
aligned sequence. The higher the score, the more
significant the alignment - ? and K and are the statistical parameters of the
scoring system (BLOSUM62 in Blast). - See Altschul and Gish, 1996, for a collection of
values for ? and K over a set of widely used
scoring matrices. - Because bit scores are normalized with respect to
the scoring system, they can be used to compare
alignment scores from different searches based on
different scoring schemes (a.a. exchange matrices)
14Normalised sequence similarity
The p-value is defined as the probability of
seeing at least one unrelated score S greater
than or equal to a given score x in a database
search over n sequences. This probability
follows the Poisson distribution (Waterman and
Vingron, 1994)
P(x, n) 1 e-n?P(S? x), where n is the
number of sequences in the database Depending on
x and n (fixed)
15Normalised sequence similarityStatistical
significance
The E-value is defined as the expected number of
non-homologous sequences with score greater than
or equal to a score x in a database of n
sequences E(x, n)
n?P(S ? x) For example, if E-value 0.01, then
the expected number of random hits with score S ?
x is 0.01, which means that this E-value is
expected by chance only once in 100 independent
searches over the database. if the E-value of a
hit is 5, then five fortuitous hits with S ? x
are expected within a single database search,
which renders the hit not significant.
16A model for database searching score probabilities
- Scores resulting from searching with a query
sequence against a database follow the Extreme
Value Distribution (EDV) (Gumbel, 1955). - Using the EDV, the raw alignment scores are
converted to a statistical score (E value) that
keeps track of the database amino acid
composition and the scoring scheme (a.a. exchange
matrix)
17Extreme Value Distribution
y 1 exp(-e-?(x-?))
Probability density function for the extreme
value distribution resulting from parameter
values ? 0 and ? 1, y 1 exp(-e-x),
where ? is the characteristic value and ? is the
decay constant.
18Extreme Value Distribution (EDV)
EDV approximation
real data
You know that an optimal alignment of two
sequences is selected out of many suboptimal
alignments, and that a database search is also
about selecting the best alignment(s). This bodes
well with the EDV which has a right tail that
falls off more slowly than the left tail.
Compared to using the normal distribution, when
using the EDV an alignment has to score further
away from the expected mean value to become a
significant hit.
19Extreme Value Distribution
The probability of a score S to be larger than a
given value x can be calculated following the EDV
as E-value P(S ? x) 1 exp(-e -?(x-?)),
where ? (ln Kmn)/?, and K a constant that can
be estimated from the background amino acid
distribution and scoring matrix (see Altschul and
Gish, 1996, for a collection of values for ? and
K over a set of widely used scoring matrices).
20Extreme Value Distribution
Using the equation for ? (preceding slide), the
probability for the raw alignment score S becomes
P(S ? x) 1 exp(-Kmne-?x). In practice, the
probability P(S?x) is estimated using the
approximation 1 exp(-e-x) ? e-x, which is valid
for large values of x. This leads to a
simplification of the equation for P(S?x) P(S ?
x) ? e-?(x-?) Kmne-?x. The lower the
probability (E value) for a given threshold value
x, the more significant the score S.
21Normalised sequence similarityStatistical
significance
- Database searching is commonly performed using an
E-value in between 0.1 and 0.001. - Low E-values decrease the number of false
positives in a database search, but increase the
number of false negatives, thereby lowering the
sensitivity of the search.
22Words of Encouragement
- There are three kinds of lies lies, damned
lies, and statistics Benjamin Disraeli - Statistics in the hands of an engineer are like
a lamppost to a drunk theyre used more for
support than illumination - Then there is the man who drowned crossing a
stream with an average depth of six inches.
W.I.E. Gates
23Protein structure-functionrelationships
24Protein function
Genome/DNA Transcriptome/mRNA Proteome Metabolo
me Physiome
Transcription factors
Ribosomal proteins Chaperonins
Enzymes
25Protein function
Not all proteins are enzymes ?-crystallin eye
lens protein needs to stay stable and
transparent for a lifetime (very little turnover
in the eye lens)
26Protein function groups
- Catalysis (enzymes)
- Binding transport (active/passive)
- Protein-DNA/RNA binding (e.g. histones,
transcription factors) - Protein-protein interactions (e.g.
antibody-lysozyme) - Protein-fatty acid binding (e.g. apolipoproteins)
- Protein small molecules (drug interaction,
structure decoding) - Structural component (e.g. ?-crystallin)
- Regulation
- Transcription regulation
- Signalling
- Immune system
- Motor proteins (actin/myosin)
27What can happen to protein function through
evolution
- Proteins can have multiple functions (and
sometimes many -- Ig). - Enzyme function is defined by specificity and
activity - Through evolution
- Function and specificity can stay the same
- Function stays same but specificity changes
- Change to some similar function (e.g. somewhere
else in metabolic system) - Change to completely new function
28How to arrive at a given function
- Divergent evolution homologous proteins
proteins have same structure and same-ish
function - Convergent evolution analogous proteins
different structure but same function - Question can homologous proteins change
structure (and function)?
29Protein function evolution
Chymotrypsin
Active site (combination of ancestral active site
residues)
Putative ancestral barrel structure
Modern 2-barrel structure
Activity 1000-10,000 times enhanced
30How to evolve
- Important distinction
- Orthologues homologous proteins in different
species (all deriving from same ancestor) - Paralogues homologous proteins in same species
(internal gene duplication) - In practice to recognise orthology,
bi-directional best hit is used in conjunction
with database search program (this is called an
operational definition)
31How to evolve
- By addition of domains (at either end of protein
sequence or at loop sites see next slides) - Often through gene duplication followed by
divergence - Multi-domain proteins are a result of gene
fusion (multiple genes ending up in a single
ORF). - Repetitions of the same domain in a single
protein occur frequently (gene duplication
followed by gene fusion)
32Protein structure evolution
- Insertion/deletion of secondary structural
elements can easily be done at loop sites
These sites are normally at the surface of a
protein
33Example -- Flavodoxin fold
5(??) fold
34Flavodoxin family - TOPS diagrams (Flores et
al., 1994)
These are four variations of the same basic
topology (bottom) Do you see what is inserted as
compared to the basic topology?
2
3
4
A TOPS diagram is a schematic representation of a
protein fold
alpha-helix
1
2
3
4
5
beta-strand
5
1
35Protein structure evolution
Insertion/deletion of structural domains can
easily be done at loop sites
N C
36The basic functional unit of a protein is the
domain A domain is a
- Compact, semi-independent unit (Richardson,
1981). - Stable unit of a protein structure that can fold
autonomously (Wetlaufer, 1973). - Recurring functional and evolutionary module
(Bork, 1992). - Nature is a tinkerer and not an inventor
(Jacob, 1977).
37Delineating domains is essential for
- Obtaining high resolution structures (x-ray, NMR)
- Sequence analysis
- Multiple sequence alignment methods
- Prediction algorithms (SS, Class,
secondary/tertiary structure) - Fold recognition and threading
- Elucidating the evolution, structure and function
of a protein family (e.g. Rosetta Stone method
next lecture) - Structural/functional genomics
- Cross genome comparative analysis
38Structural domain organisation can be nasty
Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel catalytic
substrate binding domain a/b nucleotide binding
domain
1 continuous 2 discontinuous domains
39Complex protein functions are a result of
multiple domains
- An example is the so-called swivelling domain in
pyruvate phosphate dikinase (Herzberg et al.,
1996), which brings an intermediate enzymatic
product over about 45 Å from the active site of
one domain to that of another. - This enhances the enzymatic activity delivery of
intermediate product not by a diffusion process
but by active transport
40- The DEATH Domain
- Present in a variety of Eukaryotic proteins
involved with cell death. - Six helices enclose a tightly packed hydrophobic
core. - Some DEATH domains form homotypic and
heterotypic dimers.
http//www.mshri.on.ca/pawson
41(No Transcript)
42Globin fold ? protein myoglobin PDB 1MBN
43 ? sandwich ? protein immunoglobulin PDB 7FAB
44TIM barrel ? / ? protein Triose phosphate
IsoMerase PDB 1TIM
45A fold in ? ? protein ribonuclease A PDB 7RSA
The red balls represent waters that are bound
to the protein based on polar contacts
46(No Transcript)
47 434 Cro protein complex (phage) PDB 3CRO
48Zinc finger DNA recognition (Drosophila) PDB
2DRP
..YRCKVCSRVY THISNFCRHY VTSH...
49Zinc-finger DNA binding protein family
Characteristics of the family
Function
The DNA-binding motif is found as part of
transcription regulatory proteins.
Structure
One of the most abundant DNA-binding motifs.
Proteins may contain more than one finger in a
single chain. For example Transcription Factor
TF3A was the first zinc-finger protein discovered
to contain 9 C2H2 zinc-finger motifs (tandem
repeats). Each motif consists of 2 antiparallel
beta-strands followed by by an alpha-helix. A
single zinc ion is tetrahedrally coordinated by
conserved histidine and cysteine residues,
stabilising the motif.
50 Zinc-finger DNA binding protein family
Characteristics of the family
Binding
Fingers bind to 3 base-pair subsites and specific
contacts are mediated by amino acids in positions
-1, 2, 3 and 6 relative to the start of the
alpha-helix. Contacts mainly involve one strand
of the DNA. Where proteins contain multiple
fingers, each finger binds to adjacent subsites
within a larger DNA recognition site thus
allowing a relatively simple motif to
specifically bind to a wide range of DNA
sequences. This means that the number and the
type of zinc fingers dictates the specificity of
binding to DNA
51Leucine zipper (yeast) PDB 1YSA
..RA RKLQRMKQLE DKVEE LLSKN YHLENEVARL...