Introduction to Bioinformatics - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to Bioinformatics

Description:

Title: PowerPoint Presentation Author: heringa Last modified by: Jaap Heringa Created Date: 2/20/2003 6:08:47 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 52

Provided by: heringa

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics

1
Introduction to Bioinformatics
Lecture 12 Iterative homology searching and
Protein Structure-Function Relationships Centre
for Integrative Bioinformatics VU (IBIVU)
2
PSI (Position Specific Iterated) BLAST

basic idea
use results from BLAST query to construct a
profile matrix
search database with profile instead of query
sequence
iterate

3
A Profile Matrix (Position Specific Scoring
Matrix PSSM)
This is the same as a profile without
position-specific gap penalties
4
PSI BLAST

Searching with a Profile
aligning profile matrix to a simple sequence
like aligning two sequences
except score for aligning a character with a
matrix position is given by the matrix itself
not a substitution matrix

5
PSI BLASTConstructing the Profile Matrix
Figure from Altschul et al. Nucleic Acids
Research 25, 1997
6
PSI BLASTDetermining Profile Elements

the value for a given element of the profile
matrix is given by
where the probability of seeing amino acid ai in
column j is estimated as

Observed frequency
Pseudocount
e.g. ? number of sequences in profile, ?1
7
PSI-BLAST iteration
Query sequence
Q
xxxxxxxxxxxxxxxxx
Gapped BLAST search
Query sequence
Q
xxxxxxxxxxxxxxxxx
Database hits
A C D . . Y
iterate
PSSM
Pi Px
Gapped BLAST search
A C D . . Y
PSSM
Pi Px
Database hits
8
PSI-BLAST steps in words
PSI-BLAST steps in words

Query sequences are first scanned for the
presence of so-called low-complexity regions
(Wooton and Federhen, 1996), i.e. regions with a
biased composition likely to lead to spurious
hits are excluded from alignment.
The program then initially operates on a single
query sequence by performing a gapped BLAST
search
Then, the program takes significant local
alignments (hits) found, constructs a multiple
alignment (master-slave alignment) and abstracts
a position-specific scoring matrix (PSSM) from
this alignment.
Rescan the database in a subsequent round, using
the PSSM, to find more homologous sequences.
Iteration continues until user decides to stop or
search has converged

9
PSI-BLAST entry page
Paste your query sequence
Switch this off for default run
10
(No Transcript)
11
1 - This portion of each description links to the
sequence record for a particular hit. 2 - Score
or bit score is a value calculated from the
number of gaps and substitutions associated with
each aligned sequence. The higher the score, the
more significant the alignment. Each score links
to the corresponding pairwise alignment between
query sequence and hit sequence (also referred to
as subject sequence). 3 - E Value (Expect Value)
describes the likelihood that a sequence with a
similar score will occur in the database by
chance. The smaller the E Value, the more
significant the alignment. For example, the first
alignment has a very low E value of e-117 meaning
that a sequence with a similar score is very
unlikely to occur simply by chance. 4 - These
links provide the user with direct access from
BLAST results to related entries in other
databases. L links to LocusLink records and
S links to structure records in NCBI's
Molecular Modeling DataBase.
12
X residues denote low-complexity sequence
fragments that are ignored
13
Alignment Bit Score
B (?S ln K) / ln 2

S is the raw alignment score
The bit score (bits) B has a standard set of
units
The bit score B is calculated from the number of
gaps and substitutions associated with each
aligned sequence. The higher the score, the more
significant the alignment
? and K and are the statistical parameters of the
scoring system (BLOSUM62 in Blast).
See Altschul and Gish, 1996, for a collection of
values for ? and K over a set of widely used
scoring matrices.
Because bit scores are normalized with respect to
the scoring system, they can be used to compare
alignment scores from different searches based on
different scoring schemes (a.a. exchange matrices)

14
Normalised sequence similarity
The p-value is defined as the probability of
seeing at least one unrelated score S greater
than or equal to a given score x in a database
search over n sequences. This probability
follows the Poisson distribution (Waterman and
Vingron, 1994)
P(x, n) 1 e-n?P(S? x), where n is the
number of sequences in the database Depending on
x and n (fixed)
15
Normalised sequence similarityStatistical
significance
The E-value is defined as the expected number of
non-homologous sequences with score greater than
or equal to a score x in a database of n
sequences E(x, n)
n?P(S ? x) For example, if E-value 0.01, then
the expected number of random hits with score S ?
x is 0.01, which means that this E-value is
expected by chance only once in 100 independent
searches over the database. if the E-value of a
hit is 5, then five fortuitous hits with S ? x
are expected within a single database search,
which renders the hit not significant.
16
A model for database searching score probabilities

Scores resulting from searching with a query
sequence against a database follow the Extreme
Value Distribution (EDV) (Gumbel, 1955).
Using the EDV, the raw alignment scores are
converted to a statistical score (E value) that
keeps track of the database amino acid
composition and the scoring scheme (a.a. exchange
matrix)

17
Extreme Value Distribution
y 1 exp(-e-?(x-?))
Probability density function for the extreme
value distribution resulting from parameter
values ? 0 and ? 1, y 1 exp(-e-x),
where ? is the characteristic value and ? is the
decay constant.
18
Extreme Value Distribution (EDV)
EDV approximation
real data
You know that an optimal alignment of two
sequences is selected out of many suboptimal
alignments, and that a database search is also
about selecting the best alignment(s). This bodes
well with the EDV which has a right tail that
falls off more slowly than the left tail.
Compared to using the normal distribution, when
using the EDV an alignment has to score further
away from the expected mean value to become a
significant hit.
19
Extreme Value Distribution
The probability of a score S to be larger than a
given value x can be calculated following the EDV
as E-value P(S ? x) 1 exp(-e -?(x-?)),
where ? (ln Kmn)/?, and K a constant that can
be estimated from the background amino acid
distribution and scoring matrix (see Altschul and
Gish, 1996, for a collection of values for ? and
K over a set of widely used scoring matrices).
20
Extreme Value Distribution
Using the equation for ? (preceding slide), the
probability for the raw alignment score S becomes
P(S ? x) 1 exp(-Kmne-?x). In practice, the
probability P(S?x) is estimated using the
approximation 1 exp(-e-x) ? e-x, which is valid
for large values of x. This leads to a
simplification of the equation for P(S?x) P(S ?
x) ? e-?(x-?) Kmne-?x. The lower the
probability (E value) for a given threshold value
x, the more significant the score S.
21
Normalised sequence similarityStatistical
significance

Database searching is commonly performed using an
E-value in between 0.1 and 0.001.
Low E-values decrease the number of false
positives in a database search, but increase the
number of false negatives, thereby lowering the
sensitivity of the search.

22
Words of Encouragement

There are three kinds of lies lies, damned
lies, and statistics Benjamin Disraeli
Statistics in the hands of an engineer are like
a lamppost to a drunk theyre used more for
support than illumination
Then there is the man who drowned crossing a
stream with an average depth of six inches.
W.I.E. Gates

23
Protein structure-functionrelationships
24
Protein function
Genome/DNA Transcriptome/mRNA Proteome Metabolo
me Physiome
Transcription factors
Ribosomal proteins Chaperonins
Enzymes
25
Protein function
Not all proteins are enzymes ?-crystallin eye
lens protein needs to stay stable and
transparent for a lifetime (very little turnover
in the eye lens)
26
Protein function groups

Catalysis (enzymes)
Binding transport (active/passive)
Protein-DNA/RNA binding (e.g. histones,
transcription factors)
Protein-protein interactions (e.g.
antibody-lysozyme)
Protein-fatty acid binding (e.g. apolipoproteins)
Protein small molecules (drug interaction,
structure decoding)
Structural component (e.g. ?-crystallin)
Regulation
Transcription regulation
Signalling
Immune system
Motor proteins (actin/myosin)

27
What can happen to protein function through
evolution

Proteins can have multiple functions (and
sometimes many -- Ig).
Enzyme function is defined by specificity and
activity
Through evolution
Function and specificity can stay the same
Function stays same but specificity changes
Change to some similar function (e.g. somewhere
else in metabolic system)
Change to completely new function

28
How to arrive at a given function

Divergent evolution homologous proteins
proteins have same structure and same-ish
function
Convergent evolution analogous proteins
different structure but same function
Question can homologous proteins change
structure (and function)?

29
Protein function evolution
Chymotrypsin
Active site (combination of ancestral active site
residues)
Putative ancestral barrel structure
Modern 2-barrel structure
Activity 1000-10,000 times enhanced
30
How to evolve

Important distinction
Orthologues homologous proteins in different
species (all deriving from same ancestor)
Paralogues homologous proteins in same species
(internal gene duplication)
In practice to recognise orthology,
bi-directional best hit is used in conjunction
with database search program (this is called an
operational definition)

31
How to evolve

By addition of domains (at either end of protein
sequence or at loop sites see next slides)
Often through gene duplication followed by
divergence
Multi-domain proteins are a result of gene
fusion (multiple genes ending up in a single
ORF).
Repetitions of the same domain in a single
protein occur frequently (gene duplication
followed by gene fusion)

32
Protein structure evolution

Insertion/deletion of secondary structural
elements can easily be done at loop sites

These sites are normally at the surface of a
protein
33
Example -- Flavodoxin fold
5(??) fold
34
Flavodoxin family - TOPS diagrams (Flores et
al., 1994)
These are four variations of the same basic
topology (bottom) Do you see what is inserted as
compared to the basic topology?
2
3
4
A TOPS diagram is a schematic representation of a
protein fold
alpha-helix
1
2
3
4
5
beta-strand
5
1
35
Protein structure evolution
Insertion/deletion of structural domains can
easily be done at loop sites
N C
36
The basic functional unit of a protein is the
domain A domain is a

Compact, semi-independent unit (Richardson,
1981).
Stable unit of a protein structure that can fold
autonomously (Wetlaufer, 1973).
Recurring functional and evolutionary module
(Bork, 1992).
Nature is a tinkerer and not an inventor
(Jacob, 1977).

37
Delineating domains is essential for

Obtaining high resolution structures (x-ray, NMR)
Sequence analysis
Multiple sequence alignment methods
Prediction algorithms (SS, Class,
secondary/tertiary structure)
Fold recognition and threading
Elucidating the evolution, structure and function
of a protein family (e.g. Rosetta Stone method
next lecture)
Structural/functional genomics
Cross genome comparative analysis

38
Structural domain organisation can be nasty
Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel catalytic
substrate binding domain a/b nucleotide binding
domain
1 continuous 2 discontinuous domains
39
Complex protein functions are a result of
multiple domains

An example is the so-called swivelling domain in
pyruvate phosphate dikinase (Herzberg et al.,
1996), which brings an intermediate enzymatic
product over about 45 Å from the active site of
one domain to that of another.
This enhances the enzymatic activity delivery of
intermediate product not by a diffusion process
but by active transport

The DEATH Domain
Present in a variety of Eukaryotic proteins
involved with cell death.
Six helices enclose a tightly packed hydrophobic
core.
Some DEATH domains form homotypic and
heterotypic dimers.

http//www.mshri.on.ca/pawson
41
(No Transcript)
42
Globin fold ? protein myoglobin PDB 1MBN
43
? sandwich ? protein immunoglobulin PDB 7FAB
44
TIM barrel ? / ? protein Triose phosphate
IsoMerase PDB 1TIM
45
A fold in ? ? protein ribonuclease A PDB 7RSA
The red balls represent waters that are bound
to the protein based on polar contacts
46
(No Transcript)
47
434 Cro protein complex (phage) PDB 3CRO
48
Zinc finger DNA recognition (Drosophila) PDB
2DRP
..YRCKVCSRVY THISNFCRHY VTSH...
49
Zinc-finger DNA binding protein family
Characteristics of the family

Function
The DNA-binding motif is found as part of
transcription regulatory proteins.

Structure
One of the most abundant DNA-binding motifs.
Proteins may contain more than one finger in a
single chain. For example Transcription Factor
TF3A was the first zinc-finger protein discovered
to contain 9 C2H2 zinc-finger motifs (tandem
repeats). Each motif consists of 2 antiparallel
beta-strands followed by by an alpha-helix. A
single zinc ion is tetrahedrally coordinated by
conserved histidine and cysteine residues,
stabilising the motif.

50

Zinc-finger DNA binding protein family
Characteristics of the family

Binding

Fingers bind to 3 base-pair subsites and specific
contacts are mediated by amino acids in positions
-1, 2, 3 and 6 relative to the start of the
alpha-helix. Contacts mainly involve one strand
of the DNA. Where proteins contain multiple
fingers, each finger binds to adjacent subsites
within a larger DNA recognition site thus
allowing a relatively simple motif to
specifically bind to a wide range of DNA
sequences. This means that the number and the
type of zinc fingers dictates the specificity of
binding to DNA

51
Leucine zipper (yeast) PDB 1YSA
..RA RKLQRMKQLE DKVEE LLSKN YHLENEVARL...

Write a Comment

User Comments (0)