Introduction to Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Bioinformatics

Description:

A model for database searching score probabilities ... to score further away from the expected mean value to become a significant hit. ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 52
Provided by: heri4
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics


1
Introduction to Bioinformatics
Lecture 12 Iterative homology searching and
Protein Structure-Function Relationships Centre
for Integrative Bioinformatics VU (IBIVU)
2
PSI (Position Specific Iterated) BLAST
  • basic idea
  • use results from BLAST query to construct a
    profile matrix
  • search database with profile instead of query
    sequence
  • iterate

3
A Profile Matrix (Position Specific Scoring
Matrix PSSM)
This is the same as a profile without
position-specific gap penalties
4
PSI BLAST
  • Searching with a Profile
  • aligning profile matrix to a simple sequence
  • like aligning two sequences
  • except score for aligning a character with a
    matrix position is given by the matrix itself
  • not a substitution matrix

5
PSI BLASTConstructing the Profile Matrix
Figure from Altschul et al. Nucleic Acids
Research 25, 1997
6
PSI BLASTDetermining Profile Elements
  • the value for a given element of the profile
    matrix is given by
  • where the probability of seeing amino acid ai in
    column j is estimated as

Observed frequency
Pseudocount
e.g. ? number of sequences in profile, ?1
7
PSI-BLAST iteration
Query sequence
Q
xxxxxxxxxxxxxxxxx
Gapped BLAST search
Query sequence
Q
xxxxxxxxxxxxxxxxx
Database hits
A C D . . Y
iterate
PSSM
Pi Px
Gapped BLAST search
A C D . . Y
PSSM
Pi Px
Database hits
8
PSI-BLAST
  • Query sequences are first scanned for the
    presence of so-called low-complexity regions
    (Wooton and Federhen, 1996), i.e. regions with a
    biased composition likely to lead to spurious
    hits are excluded from alignment.
  • The program then initially operates on a single
    query sequence by performing a gapped BLAST
    search
  • Then, the program takes significant local
    alignments (hits) found, constructs a multiple
    alignment (master-slave alignment) and abstracts
    a position-specific scoring matrix (PSSM) from
    this alignment.
  • Rescan the database in a subsequent round, using
    the PSSM, to find more homologous sequences.
    Iteration continues until user decides to stop or
    search has converged

9
(No Transcript)
10
(No Transcript)
11
1 - This portion of each description links to the
sequence record for a particular hit. 2 - Score
or bit score is a value calculated from the
number of gaps and substitutions associated with
each aligned sequence. The higher the score, the
more significant the alignment. Each score links
to the corresponding pairwise alignment between
query sequence and hit sequence (also referred to
as subject sequence). 3 - E Value (Expect Value)
describes the likelihood that a sequence with a
similar score will occur in the database by
chance. The smaller the E Value, the more
significant the alignment. For example, the first
alignment has a very low E value of e-117 meaning
that a sequence with a similar score is very
unlikely to occur simply by chance. 4 - These
links provide the user with direct access from
BLAST results to related entries in other
databases. L links to LocusLink records and
S links to structure records in NCBI's
Molecular Modeling DataBase.
12
X residues denote low-complexity sequence
fragments that are ignored
13
PSI-BLAST output example
14
Alignment Bit Score
B (?S ln K) / ln 2
  • S is the raw alignment score
  • The bit score (bits) B has a standard set of
    units
  • The bit score B is calculated from the number of
    gaps and substitutions associated with each
    aligned sequence. The higher the score, the more
    significant the alignment
  • ? and K and are the statistical parameters of the
    scoring system (BLOSUM62 in Blast).
  • See Altschul and Gish, 1996, for a collection of
    values for ? and K over a set of widely used
    scoring matrices.
  • Because bit scores are normalized with respect to
    the scoring system, they can be used to compare
    alignment scores from different searches based on
    different scoring schemes (a.a. exchange matrices)


15
Normalised sequence similarity
The p-value is defined as the probability of
seeing at least one unrelated score S greater
than or equal to a given score x in a database
search over n sequences. This probability
follows the Poisson distribution (Waterman and
Vingron, 1994)
P(x, n) 1 e-n?P(S? x), where n is the
number of sequences in the database Depending on
x and n (fixed)
16
Normalised sequence similarityStatistical
significance
The E-value is defined as the expected number of
non-homologous sequences with score greater than
or equal to a score x in a database of n
sequences E(x, n)
n?P(S ? x) For example, if E-value 0.01, then
the expected number of random hits with score S ?
x is 0.01, which means that this E-value is
expected by chance only once in 100 independent
searches over the database. if the E-value of a
hit is 5, then five fortuitous hits with S ? x
are expected within a single database search,
which renders the hit not significant.
17
A model for database searching score probabilities
  • Scores resulting from searching with a query
    sequence against a database follow the Extreme
    Value Distribution (EDV) (Gumbel, 1955).
  • Using the EDV, the raw alignment scores are
    converted to a statistical score (E value) that
    keeps track of the database amino acid
    composition and the scoring scheme (a.a. exchange
    matrix)

18
Extreme Value Distribution
y 1 exp(-e-?(x-?))
Probability density function for the extreme
value distribution resulting from parameter
values ? 0 and ? 1, y 1 exp(-e-x),
where ? is the characteristic value and ? is the
decay constant.
19
Extreme Value Distribution (EDV)
EDV approximation
real data
You know that an optimal alignment of two
sequences is selected out of many suboptimal
alignments, and that a database search is also
about selecting the best alignment(s). This bodes
well with the EDV which has a right tail that
falls off more slowly than the left tail.
Compared to using the normal distribution, when
using the EDV an alignment has to score further
away from the expected mean value to become a
significant hit.
20
Extreme Value Distribution
The probability of a score S to be larger than a
given value x can be calculated following the EDV
as E-value P(S ? x) 1 exp(-e -?(x-?)),
where ? (ln Kmn)/?, and K a constant that can
be estimated from the background amino acid
distribution and scoring matrix (see Altschul and
Gish, 1996, for a collection of values for ? and
K over a set of widely used scoring matrices).
21
Extreme Value Distribution
Using the equation for ? (preceding slide), the
probability for the raw alignment score S becomes
P(S ? x) 1 exp(-Kmne-?x). In practice, the
probability P(S?x) is estimated using the
approximation 1 exp(-e-x) ? e-x, which is valid
for large values of x. This leads to a
simplification of the equation for P(S?x) P(S ?
x) ? e-?(x-?) Kmne-?x. The lower the
probability (E value) for a given threshold value
x, the more significant the score S.
22
Normalised sequence similarityStatistical
significance
  • Database searching is commonly performed using an
    E-value in between 0.1 and 0.001.
  • Low E-values decrease the number of false
    positives in a database search, but increase the
    number of false negatives, thereby lowering the
    sensitivity of the search.

23
Words of Encouragement
  • There are three kinds of lies lies, damned
    lies, and statistics Benjamin Disraeli
  • Statistics in the hands of an engineer are like
    a lamppost to a drunk theyre used more for
    support than illumination
  • Then there is the man who drowned crossing a
    stream with an average depth of six inches.
    W.I.E. Gates

24
Protein structure-functionrelationships
25
Protein function
Genome/DNA Transcriptome/mRNA Proteome Metabolo
me Physiome
Transcription factors
Ribosomal proteins Chaperonins
Enzymes
26
Protein function
Not all proteins are enzymes ?-crystallin eye
lens protein needs to stay stable and
transparent for a lifetime (very little turnover
in the eye lens)
27
Protein function groups
  • Catalysis (enzymes)
  • Binding transport (active/passive)
  • Protein-DNA/RNA binding (e.g. histones,
    transcription factors)
  • Protein-protein interactions (e.g.
    antibody-lysozyme)
  • Protein-fatty acid binding (e.g. apolipoproteins)
  • Protein small molecules (drug interaction,
    structure decoding)
  • Structural component (e.g. ?-crystallin)
  • Regulation
  • Transcription regulation
  • Signalling
  • Immune system
  • Motor proteins (actin/myosin)

28
What can happen to protein function through
evolution
  • Proteins can have multiple functions (and
    sometimes many -- Ig).
  • Enzyme function is defined by specificity and
    activity
  • Through evolution
  • Function and specificity can stay the same
  • Function stays same but specificity changes
  • Change to some similar function (e.g. somewhere
    else in metabolic system)
  • Change to completely new function

29
How to arrive at a given function
  • Divergent evolution homologous proteins
    proteins have same structure and same-ish
    function
  • Convergent evolution analogous proteins
    different structure but same function
  • Question can homologous proteins change
    structure (and function)?

30
How to evolve
  • Important distinction
  • Orthologues homologous proteins in different
    species (all deriving from same ancestor)
  • Paralogues homologous proteins in same species
    (internal gene duplication)
  • In practice to recognise orthology,
    bi-directional best hit is used in conjunction
    with database search program (this is called an
    operational definition)

31
How to evolve
  • By addition of domains (at either end of protein
    sequence) Lesk book page 108
  • Often through gene duplication followed by
    divergence
  • Multi-domain proteins are result of gene fusion

32
Protein structure evolution
  • Insertion/deletion of secondary structural
    elements can easily be done at loop sites

33
Flavodoxin fold
5(??) fold
34
Flavodoxin family - TOPS diagrams (Flores et
al., 1994)
2
3
4
1
2
3
4
5
1
5
35
Protein structure evolution
Insertion/deletion of structural domains can
easily be done at loop sites
N C
36
The basic functional unit of a protein is the
domain A domain is a
  • Compact, semi-independent unit (Richardson,
    1981).
  • Stable unit of a protein structure that can fold
    autonomously (Wetlaufer, 1973).
  • Recurring functional and evolutionary module
    (Bork, 1992).
  • Nature is a tinkerer and not an inventor
    (Jacob, 1977).

37
Delineating domains is essential for
  • Obtaining high resolution structures (x-ray, NMR)
  • Sequence analysis
  • Multiple sequence alignment methods
  • Prediction algorithms (SS, Class,
    secondary/tertiary structure)
  • Fold recognition and threading
  • Elucidating the evolution, structure and function
    of a protein family (e.g. Rosetta Stone method)
  • Structural/functional genomics
  • Cross genome comparative analysis

38
Structural domain organisation can be nasty
Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel catalytic
substrate binding domain a/b nucleotide binding
domain
1 continuous 2 discontinuous domains
39
Complex protein functions are a result of
multiple domains
  • An example is the so-called swivelling domain in
    pyruvate phosphate dikinase (Herzberg et al.,
    1996), which brings an intermediate enzymatic
    product over about 45 Ã… from the active site of
    one domain to that of another.
  • This enhances the enzymatic activity

40
  • The DEATH Domain
  • Present in a variety of Eukaryotic proteins
    involved with cell death.
  • Six helices enclose a tightly packed hydrophobic
    core.
  • Some DEATH domains form homotypic and
    heterotypic dimers.

http//www.mshri.on.ca/pawson
41
(No Transcript)
42
Globin fold ? protein myoglobin PDB 1MBN
43
? sandwich ? protein immunoglobulin PDB 7FAB
44
TIM barrel ? / ? protein Triose phosphate
IsoMerase PDB 1TIM
45
A fold in ? ? protein ribonuclease A PDB 7RSA
The red balls represent waters that are bound
to the protein based on polar contacts
46
(No Transcript)
47
434 Cro protein complex (phage) PDB 3CRO
48
Zinc finger DNA recognition (Drosophila) PDB
2DRP
..YRCKVCSRVY THISNFCRHY VTSH...
49
Zinc-finger DNA binding protein family
Characteristics of the family
     
Function
The DNA-binding motif is found as part of
transcription regulatory proteins.  
  
Structure
One of the most abundant DNA-binding motifs.
Proteins may contain more than one finger in a
single chain. For example Transcription Factor
TF3A was the first zinc-finger protein discovered
to contain 9 C2H2 zinc-finger motifs (tandem
repeats). Each motif consists of 2 antiparallel
beta-strands followed by by an alpha-helix. A
single zinc ion is tetrahedrally coordinated by
conserved histidine and cysteine residues,
stabilising the motif.  
  
50
     
Zinc-finger DNA binding protein family
Characteristics of the family
  
Binding
     
Fingers bind to 3 base-pair subsites and specific
contacts are mediated by amino acids in positions
-1, 2, 3 and 6 relative to the start of the
alpha-helix. Contacts mainly involve one strand
of the DNA. Where proteins contain multiple
fingers, each finger binds to adjacent subsites
within a larger DNA recognition site thus
allowing a relatively simple motif to
specifically bind to a wide range of DNA
sequences. This means that the number and the
type of zinc fingers dictates the specificity of
binding to DNA
  
51
Leucine zipper (yeast) PDB 1YSA
..RA RKLQRMKQLE DKVEE LLSKN YHLENEVARL...
Write a Comment
User Comments (0)
About PowerShow.com