Title: Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment
1Detecting Subtle Sequence Signals a Gibbs
Sampling Strategy for Multiple Alignment
Presented By Manish Agrawal Slides adapted from
Prof Sinhas notes.
2A motif model
- To define a motif, lets say we know where the
motif starts in the sequence - The motif start positions in their sequences can
be represented as s (s1,s2,s3,,st)
Genes regulated by same transcription factor
3Motifs Matrices and Consensus
- Line up the patterns by their start indexes
- s (s1, s2, , st)
- Construct position weight matrix with
frequencies of each nucleotide in columns - Consensus nucleotide in each position has the
highest frequency in column
- a
G g t a c T t - C c A t a c g t
- Alignment a c g t T A g t
- a c g t C c A t
- C c g t a c g G
-
_________________ -
- A 3 0 1 0 3 1 1 0
- Matrix C 2 4 0 0 1 4 0 0
- G 0 1 4 0 0 0 3 1
- T 0 0 0 5 1 0 1 4
- _________________
- Consensus A C G T A C G T
4Motif Finding Problem(Simplified)
- Given a set of sequences, find the motif shared
by all or most sequences, while its starting
position in each sequence is unknown - Assumption
- Each motif appears exactly once in one sequence.
- The motif has fixed length.
5Generative Model
- Suppose the sequences are aligned, the aligned
regions are generated from a motif model. - Motif model is a PWM. A PWM is a
position-specific multinomial distribution. - For each position i (from 1 to W), a multinomial
distribution on amino acids, consisting of
variables qi1, qi2,..,qi20 - The unaligned regions are generated from a
background model p1,p2, , p20
6Notations
- Set of symbols
- Sequences S S1, S2, , SN
- Starting positions of motifs A a1, a2, , aN
- Motif model ( ) qij P(symbol at the i-th
position j) - Background model pj P(symbol j)
- Count of symbols in each column cij count of
symbol, j, in the i-th column in the aligned
region
7Probability of data given model
8Scoring Function
- Maximize the log-odds ratio
- Is greater than zero if the data is a better
match to the motif model than to the background
model
9Scoring function
- A particular alignment A gives us the
- counts cij.
- In the scoring function F, use
10Scoring function
- Thus, given an alignment A, we can calculate the
scoring function F - We need to find A that maximizes this scoring
function, which is a log-odds score
11Optimization and Sampling
- To maximize a function, f(x)
- Brute force method try all possible x
- Sample method sample x from probability
distribution p(x) f(x) - Idea suppose xmax is argmax of f(x), then it is
also argmax of p(x), thus we have a high
probability of selecting xmax
12Markov Chain Sampling
- To sample from a probability distribution p(x),
we set up a Markov chain s.t. each state
represents a value of x and for any two states, x
and y, the transitional probabilities satisfy
13Gibbs sampling to maximize F
- Gibbs sampling is a special type of Markov chain
sampling algorithm - Our goal is to find the optimal A (a1,aN)
- The Markov chain we construct will only have
transitions from A to alignments A that differ
from A in only one of the ai - In round-robin order, pick one of the ai to
replace - Consider all A formed by replacing ai with some
other starting position ai in sequence Si - Move to one of these A probabilistically
- Iterate the last three steps
14Algorithm
- Randomly initialize A0
- Repeat
- (1) randomly choose a sequence z from S
- A At \ az compute ?t from A
- (2) sample az according to P(az x), which is
proportional to Qx/Px update At1 A ? x - Select At that maximizes F
Qx the probability of generating x according to
?t Px the probability of generating x
according to the background model
15Algorithm
Current solution At
16Algorithm
Choose one az to replace
17Algorithm
x
For each candidate site x in sequence z,
calculate Qx and Px Probabilities of sampling x
from motif model and background model resp.
18Algorithm
x
Among all possible candidates, choose one (say
x) with probability proportional to Qx/Px
19Algorithm
x
Set At1 A ? x
20Algorithm
x
Repeat
21Local optima
- The algorithm may not find the global or true
maximum of the scoring function - Once At contains many similar substrings,
others matching these will be chosen with higher
probability - Algorithm will get locked into a local
optimum - all neighbors have poorer scores, hence low
chance of moving out of this solution
22Phase shifts
- After every M iterations, compare the current At
with alignments obtained by shifting every
aligned substring ai by some amount, either to
left or right
23Phase shift
24Phase shift
25Pattern Width
- The algorithm described so far requires pattern
width(W) to be input. - We can modify the algorithm so that it executes
for a range of plausible widths. - The function F is not immediately useful for this
purpose as its optimal value always increases
with increasing W.
26Pattern Width
- Another function based on the incomplete-data
log-probability ratio G can be used. - Dividing G by the number of free parameters
needed to specify the pattern (19W in the case of
proteins) produced a statistic useful for
choosing pattern width. This quantity can be
called information per parameter.
27Examples
- The algorithm was applied to locate
helix-turn-helix (HTH) motif, which represent a
large class of sequence-specific DNA binding
structures involved in numerous cases of gene
regulation. - Detection and alignment of HTH motifs is a well
recognized problem because of the great sequence
variation.
28HTH Motif
Complete Sequences
Non-site seq
Random seq
29Convergence behavior of Gibbs Sampling Algorithm
30Time complexity analysis
- For a typical protein sequence, it was found
that, for a single pattern width, each input
sequence needs to be sampled fewer than T 100
times before convergence. - LW multiplications are performed in Step2 of the
algorithm. - Total multiplications to execute the algorithm
TNLavgW - Linear Time complexity has been observed in
applications
31Motif finding
- The Gibbs sampling algorithm was originally
applied to find motifs in amino acid sequences - Protein motifs represent common sequence patterns
in proteins, that are related to certain
structure and function of the protein - Gibbs sampling is extensively used to find motifs
in DNA sequence, i.e., transcription factor
binding sites
32Thank You