Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment

Description:

A motif model. To define a motif, lets say we know where ... Generative Model ... from a probability distribution p(x), we set up a Markov chain s.t. each state ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 33
Provided by: xin4
Category:

less

Transcript and Presenter's Notes

Title: Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment


1
Detecting Subtle Sequence Signals a Gibbs
Sampling Strategy for Multiple Alignment
  • Lawrence et al. 1993

Presented By Manish Agrawal Slides adapted from
Prof Sinhas notes.
2
A motif model
  • To define a motif, lets say we know where the
    motif starts in the sequence
  • The motif start positions in their sequences can
    be represented as s (s1,s2,s3,,st)

Genes regulated by same transcription factor
3
Motifs Matrices and Consensus
  • Line up the patterns by their start indexes
  • s (s1, s2, , st)
  • Construct position weight matrix with
    frequencies of each nucleotide in columns
  • Consensus nucleotide in each position has the
    highest frequency in column
  • a
    G g t a c T t
  • C c A t a c g t
  • Alignment a c g t T A g t
  • a c g t C c A t
  • C c g t a c g G

  • _________________
  • A 3 0 1 0 3 1 1 0
  • Matrix C 2 4 0 0 1 4 0 0
  • G 0 1 4 0 0 0 3 1
  • T 0 0 0 5 1 0 1 4
  • _________________
  • Consensus A C G T A C G T

4
Motif Finding Problem(Simplified)
  • Given a set of sequences, find the motif shared
    by all or most sequences, while its starting
    position in each sequence is unknown
  • Assumption
  • Each motif appears exactly once in one sequence.
  • The motif has fixed length.

5
Generative Model
  • Suppose the sequences are aligned, the aligned
    regions are generated from a motif model.
  • Motif model is a PWM. A PWM is a
    position-specific multinomial distribution.
  • For each position i (from 1 to W), a multinomial
    distribution on amino acids, consisting of
    variables qi1, qi2,..,qi20
  • The unaligned regions are generated from a
    background model p1,p2, , p20

6
Notations
  • Set of symbols
  • Sequences S S1, S2, , SN
  • Starting positions of motifs A a1, a2, , aN
  • Motif model ( ) qij P(symbol at the i-th
    position j)
  • Background model pj P(symbol j)
  • Count of symbols in each column cij count of
    symbol, j, in the i-th column in the aligned
    region

7
Probability of data given model
8
Scoring Function
  • Maximize the log-odds ratio
  • Is greater than zero if the data is a better
    match to the motif model than to the background
    model

9
Scoring function
  • A particular alignment A gives us the
  • counts cij.
  • In the scoring function F, use

10
Scoring function
  • Thus, given an alignment A, we can calculate the
    scoring function F
  • We need to find A that maximizes this scoring
    function, which is a log-odds score

11
Optimization and Sampling
  • To maximize a function, f(x)
  • Brute force method try all possible x
  • Sample method sample x from probability
    distribution p(x) f(x)
  • Idea suppose xmax is argmax of f(x), then it is
    also argmax of p(x), thus we have a high
    probability of selecting xmax

12
Markov Chain Sampling
  • To sample from a probability distribution p(x),
    we set up a Markov chain s.t. each state
    represents a value of x and for any two states, x
    and y, the transitional probabilities satisfy
  • This would then imply

13
Gibbs sampling to maximize F
  • Gibbs sampling is a special type of Markov chain
    sampling algorithm
  • Our goal is to find the optimal A (a1,aN)
  • The Markov chain we construct will only have
    transitions from A to alignments A that differ
    from A in only one of the ai
  • In round-robin order, pick one of the ai to
    replace
  • Consider all A formed by replacing ai with some
    other starting position ai in sequence Si
  • Move to one of these A probabilistically
  • Iterate the last three steps

14
Algorithm
  • Randomly initialize A0
  • Repeat
  • (1) randomly choose a sequence z from S
  • A At \ az compute ?t from A
  • (2) sample az according to P(az x), which is
    proportional to Qx/Px update At1 A ? x
  • Select At that maximizes F

Qx the probability of generating x according to
?t Px the probability of generating x
according to the background model
15
Algorithm
Current solution At
16
Algorithm
Choose one az to replace
17
Algorithm
x
For each candidate site x in sequence z,
calculate Qx and Px Probabilities of sampling x
from motif model and background model resp.
18
Algorithm
x
Among all possible candidates, choose one (say
x) with probability proportional to Qx/Px
19
Algorithm
x
Set At1 A ? x
20
Algorithm
x
Repeat
21
Local optima
  • The algorithm may not find the global or true
    maximum of the scoring function
  • Once At contains many similar substrings,
    others matching these will be chosen with higher
    probability
  • Algorithm will get locked into a local
    optimum
  • all neighbors have poorer scores, hence low
    chance of moving out of this solution

22
Phase shifts
  • After every M iterations, compare the current At
    with alignments obtained by shifting every
    aligned substring ai by some amount, either to
    left or right

23
Phase shift
24
Phase shift
25
Pattern Width
  • The algorithm described so far requires pattern
    width(W) to be input.
  • We can modify the algorithm so that it executes
    for a range of plausible widths.
  • The function F is not immediately useful for this
    purpose as its optimal value always increases
    with increasing W.

26
Pattern Width
  • Another function based on the incomplete-data
    log-probability ratio G can be used.
  • Dividing G by the number of free parameters
    needed to specify the pattern (19W in the case of
    proteins) produced a statistic useful for
    choosing pattern width. This quantity can be
    called information per parameter.

27
Examples
  • The algorithm was applied to locate
    helix-turn-helix (HTH) motif, which represent a
    large class of sequence-specific DNA binding
    structures involved in numerous cases of gene
    regulation.
  • Detection and alignment of HTH motifs is a well
    recognized problem because of the great sequence
    variation.

28
HTH Motif
Complete Sequences
Non-site seq
Random seq
29
Convergence behavior of Gibbs Sampling Algorithm
30
Time complexity analysis
  • For a typical protein sequence, it was found
    that, for a single pattern width, each input
    sequence needs to be sampled fewer than T 100
    times before convergence.
  • LW multiplications are performed in Step2 of the
    algorithm.
  • Total multiplications to execute the algorithm
    TNLavgW
  • Linear Time complexity has been observed in
    applications

31
Motif finding
  • The Gibbs sampling algorithm was originally
    applied to find motifs in amino acid sequences
  • Protein motifs represent common sequence patterns
    in proteins, that are related to certain
    structure and function of the protein
  • Gibbs sampling is extensively used to find motifs
    in DNA sequence, i.e., transcription factor
    binding sites

32
Thank You
Write a Comment
User Comments (0)
About PowerShow.com