Bioinformatics - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Bioinformatics

Description:

Bioinformatics. Finding signals and motifs in DNA and proteins ... An alignment of sequences is intrinsically connected with another essential task, ... – PowerPoint PPT presentation

Number of Views:350
Avg rating:3.0/5.0
Slides: 19
Provided by: anatolyr
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics


1
Lecture 10
Bioinformatics
  • Finding signals and motifs in DNA and proteins
  • Expectation Maximization Algorithm
  • MEME
  • The Gibbs sampler

2
Finding signals and motifs in DNA and proteins
  • An alignment of sequences is intrinsically
    connected with another essential task, which is
    finding certain signals and motifs (highly
    conservative ungapped blocks) shared by some
    sequences.
  • A motif is a sequence pattern that occurs
    repeatedly in a group of related protein or DNA
    sequences. Motifs are represented as
    position-dependent scoring matrices that describe
    the score of each possible letter at each
    position in the pattern.
  • Another related task is searching biological
    databases for sequences that contain one or more
    of known motifs.
  • These objectives are critical in analysis of
    genes and proteins, as any gene or protein
    contains a set of different motifs and signals.
    Complete knowledge about locations and structure
    of such motifs and signals leads to a
    comprehensive description of a gene or protein
    and indicates at a potential function.

3
The eMOTIF method of motif analysis
  • eMotif is very useful method of identifying
    motifs in proteins
  • MSA of a particular set of proteins is submitted
    to eMotif, which essentially searches for
    consensus sequence(s) and identifies the
    conservative motifs.
  • The probability of a motif is estimated from the
    frequencies of the individual amino acids in the
    SwissProt DB as a product of probabilities of
    each position in the consensus
  • The result could be as follows This motif
    matches 25 out of the 30 sequences supplied. It
    will match 1 in 10 19 random sequences, or less
    than 1 sequence in the current SWISS-PROT
    database.
  • Then a motif can be searched in the Swiss-Prot DB

4
eMOTIF

True positives
5
eMOTIF search of sequences with certain emotif
in the DB
6
Expectation Maximization (EM) Algorithm
  • This algorithm is used to identify conserved
    areas in unaligned DNA and proteins.
  • Assume that a set of sequences is expected to
    have a common sequence pattern.
  • An initial guess is made as to location and size
    of the site of interest in each of the sequences
    and these parts are loosely aligned.
  • This alignment provides an estimate of base or
    aa composition of each column in the site.
  • The EM algorithm consists of the two steps,
    which are repeated consecutively.
  • Step 1, the expectation step, the
    column-by-column composition of the site is used
    to estimate the probability of finding the site
    at any position in each of the sequences. These
    probabilities are used to provide new information
    as to expected base or aa distribution for each
    column in the site.
  • Step 2, the maximization step, the new counts
    for bases or aa for each position in the site
    found in the step 1 are substituted for the
    previous set.

7
Expectation Maximization (EM) Algorithm
OOOOOOOOXXXXOOOOOOOOOOOOOOOOXXXXOOOOOOOO o o o o
o o o o o o o o o o o o o o o o o o o
o OOOOOOOOXXXXOOOOOOOO OOOOOOOOXXXXOOOOOOOO
IIII IIIIIIII IIIIIII
Columns defined by a preliminary alignment of
the sequences provide initial estimates of
frequencies of aa in each motif column
Columns not in motif provide background
frequencies
Bases Background Site column 1 Site column 2
G 0.27 0.4 0.1
C 0.25 0.4 0.1
A 0.25 0.2 0.1
T 0.23 0.2 0.7
Total 1.00 1.00 1.00
8
Expectation Maximization (EM) Algorithm
A
B
The resulting score gives the likelihood that the
motif matches positions A, B or other in seq 1.
Repeat for all other positions and find most
likely locator. Then repeat for the remaining
seqs.
9
EM Algorithm 1st expectation step calculations
  • Assume that the seq1 is 20 bases long and the
    length of the site is 20 bases.
  • Suppose that the site starts in the column 1 and
    the first two positions are A and T.
  • The site will end at the position 20 and
    positions 21 and 22 do not belong to the site.
    Assume that these two positions are A and T also.
  • The Probability of this location of the site in
    seq1 is given by
  • Psite1,seq1 0.2 (for A in position 1) x 0.7
    (for T in position 2) x Ps (for the next 18
    positions in site) x 0.25 (for A in first
    flanking position) x 0.23 (for T in second
    flanking position x Ps (for the next 78 flanking
    positions).
  • The same procedure is applied for calculation of
    probabilities for Psite2,seq1 to Psite78,
    seq1, thus providing a comparative set of
    probabilities for the site location.
  • The probability of the best location in seq1,
    say at site k, is the ratio of the site
    probability at k divided by the sum of all the
    other site probabilities.
  • Then the procedure repeated for all other
    sequences.

10
EM Algorithm 2nd optimisation step calculations
  • The site probabilities for each seq calculated
    at the 1st step are then used to create a new
    table of expected values for base counts for each
    of the site positions using the site
    probabilities as weights.
  • Suppose that P (site 1 in seq 1) Psite1,seq1 /
    (Psite1,seq1 Psite2,seq1 Psite78,seq1 )
    0.01 and P (site 2 in seq 1) 0.02.
  • Then this values are added to the previous table
    as shown in the table below.
  • This procedure is repeated for every other
    possible first columns in seq1 and then the
    process continues for all other sequences
    resulting in a new version of the table.
  • The expectation and maximization steps are
    repeated until the estimates of base frequencies
    do not change.

Bases Background Site column 1 Site column 2
G 0.27 0.4 0.1
C 0.25 0.4 0.1
A 0.25 0.2 0.01 0.1
T 0.23 0.2 0.7 0.02
Total/ weighted 1.00 1.00 1.00
11
(No Transcript)
12
Multiple EM for Motif Elicitation - MEME
13
MEME Summary Line
  • This line gives the width (width), number of
    occurrences in the training set (sites), log
    likelihood ratio (llr) and E-value of the
    motif. Each motif describes a pattern of a fixed
    width and no gaps are allowed in MEME motifs.
    MEME numbers the motifs consecutively from one as
    it finds them. MEME usually finds the most
    statistically significant (low E-value) motifs
    first.
  • The statistical significance of a motif is based
    on its log likelihood ratio, its width and number
    of occurrences, the background letter
    frequencies (given in the command line summary),
    and the size of the training set.
  • The E-value is an estimate of the expected
    number of motifs with the given log likelihood
    ratio (or higher), and with the same width and
    number of occurrences, that one would find in a
    similarly sized set of random sequences. (In
    random sequences each position is independent
    with letters chosen according to the background
    letter frequencies.)
  • The log likelihood ratio is the logarithm of
    the ratio of the probability of the occurrences
    of the motif given the motif model (likelihood
    given the motif) versus their probability given
    the background model (likelihood given the null
    model). (Normally the background model is a
    0-order Markov model using the background letter
    frequencies, but higher order Markov models may
    be specified via the -bfile option to MEME.)
  • Clicking on the buttons to the left of the motif
    summary line takes you to the previous motif (P)
    or next motif (N).

14
MEME Summary Line
15
MEME
MOTIF 1 width 26 sites 5 llr
244 E-value 5.0e-006
16
MEME
17
The Gibbs Sampler
  • The Gibbs sampler algorithm is slightly
    different from the EM approach. The method also
    searches for the statistically most probable
    motifs and can find the optimal width and the
    number of motifs in each sequence.
  • The method iterates through two steps. In the
    first step a random start position for the motif
    is chosen for all sequences but for one. These
    seq. are then aligned and used to find an initial
    guess of the motif.
  • The objective of the next step is to find the
    most probable pattern common to left out sequence
    (and on the next iterations to all of the
    sequences) by sliding them back and forth until
    the ratio of the motif probability to the
    background probability is a maximum.
  • Then the next sequence is left out and the
    process is repeated until the residue frequencies
    in each motif do not change. The number of
    iterations may range from several hundred to
    several thousand.
  • Several additional statistical procedure are
    used to improve the performance of the algorithm.
    The Gibbs sampler was used to align sequences
    with very little sequences similarity.

18
Steps of the Gibbs sampler algorithm
A. Estimate the aa or base frequencies in the
motif columns of all but the 1 sequence. Also
obtain background
Motif
xxxxxxxMxxxxxxx
xxxxxxxMxxxxxxx xxxxxxxxxxMxxxx
xxxxxxxxxxMxxxx xMxxxxxxxxxxxxx
xMxxxxxxxxxxxxx xxxxxxxxxxxxxxM
xxxxxxxxxxxxxxM xxxxxMxxxxxxxxx
xxxxxMxxxxxxxxx Random start
Location of motif in each sequence
provides positions chosen first
estimate of motif composition
All sequences except the outlier
x is equal to n seq. positions M indicates
random location of the motif in each seq. -
indicates initially aligned motif positions
B. Use the estimate from A to calculate the
ratio of probability of motif to background score
at each position in the left out
sequence. This ratio for each possible location
in the sequence is the weight of the position.
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx M - gt
M - gt M
- gt M - gt
M - gt C. Choose a new
location for the motif in the left out sequence
by a random selection using the weights to bias
the choice. xxxxxxxxxxMxx
Estimated locations of the motif in left out
sequence D. Repeat steps A to C gtgttimes
The outlier sequence
Write a Comment
User Comments (0)
About PowerShow.com