Title: (Regulatory-) Motif Finding
1(Regulatory-) Motif Finding
2Finding Regulatory Motifs
. . .
- Given a collection of genes with common
expression, - Find the TF-binding motif in common
3Expectation Maximization
- Initialize parameters ? (M, B), ?
- Try different values of ? from N-1/2 up to 1/(2K)
- Repeat
- Expectation
- Maximization
- Until change in ? (M, B), ? falls below ?
- Report results for several good ?
?
1 ?
M1
M1
MK
B
A C G T
4Gibbs Sampling in Motif Finding
5Gibbs Sampling
- Given
- x1, , xN,
- motif length K,
- background B,
- Find
- Model M
- Locations a1,, aN in x1, , xN
- Maximizing log-odds likelihood ratio
6Gibbs Sampling
- AlignACE first statistical motif finder
- BioProspector improved version of AlignACE
- Algorithm (sketch)
- Initialization
- Select random locations in sequences x1, , xN
- Compute an initial model M from these locations
- Sampling Iterations
- Remove one sequence xi
- Recalculate model
- Pick a new location of motif in xi according to
probability the location is a motif occurrence
7Gibbs Sampling
- Initialization
- Select random locations a1,, aN in x1, , xN
- For these locations, compute M
- That is, Mkj is the number of occurrences of
letter j in motif position k, over the total
8Gibbs Sampling
- Predictive Update
- Select a sequence x xi
- Remove xi, recompute model
M
- where ?j are pseudocounts to avoid 0s,
- and B ?j ?j
9Gibbs Sampling
- Sampling
- For every K-long word xj,,xjk-1 in x
- Qj Prob word motif M(1,xj)??M(k,xjk-1)
- Pi Prob word background B(xj)??B(xjk-1)
- Let
- Sample a random new position ai according to the
probabilities A1,, Ax-k1.
Prob
0
x
10Gibbs Sampling
- Running Gibbs Sampling
- Initialize
- Run until convergence
- Repeat 1,2 several times, report common motifs
11Advantages / Disadvantages
- Very similar to EM
- Advantages
- Easier to implement
- Less dependent on initial parameters
- More versatile, easier to enhance with heuristics
- Disadvantages
- More dependent on all sequences to exhibit the
motif - Less systematic search of initial parameter space
12Repeats, and a Better Background Model
- Repeat DNA can be confused as motif
- Especially low-complexity CACACA AAAAA, etc.
- Solution
- more elaborate background model
- 0th order B pA, pC, pG, pT
- 1st order B P(AA), P(AC), , P(TT)
-
- Kth order B P(X b1bK) X, bi?A,C,G,T
- Has been applied to EM and Gibbs (up to 3rd
order)
13Phylogenetic Footprinting(Slides by Martin Tompa)
14Phylogenetic Footprinting(Tagle et al. 1988)
- Functional sequences evolve slower than
nonfunctional ones - Consider a set of orthologous sequences from
different species - Identify unusually well conserved regions
15Substring Parsimony Problem
- Given
- phylogenetic tree T,
- set of orthologous sequences at leaves of T,
- length k of motif
- threshold d
- Problem
- Find each set S of k-mers, one k-mer from each
leaf, such that the parsimony score of S in T
is at most d. - This problem is NP-hard.
16Small Example
Size of motif sought k 4
17Solution
Parsimony score 1 mutation
18CLUSTALW multiple sequence alignment (rbcS
gene) Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT-
--CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA------
-AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATC
TTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-
------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---
CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTA
AATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACAT
TGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA
--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATT
CAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCG
TCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wh
eat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGT
CGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAG
CAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT--
---TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGC
CAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCAC
ACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAAC
AAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC
------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AG
GATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCA
ATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC-
---ACAATCCAACAA-ACTGGTTCT---------A Tobacco AAAAAT
AATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTA
TCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-p
lant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATA
AGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-AC
GATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAAC
CATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATT
TCT---------A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTA
ATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGG
CAATGCTTCTTC-------- Duckweed ATATAATATTAGAAAAAAAT
C-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAG
ACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGG
CCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACA
CA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACC
AATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGA
CTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pe
a GGCAGTGGCC---AACTAC--------------------CACAATTT-
TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACAT
TA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-G
CGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTG
GGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGA
ATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGG
GG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTG
GCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCT
TCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAG
AAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CAT
CTGTACCAAAGAAACGG----GGCTATATATACCGTG Duckweed TTA
GGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATA
TTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC La
rch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATT
TCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TC
TATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGT
AGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCA
ATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TT
AAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAA
AGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTT
CTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Tur
nip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGA
AAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCC
TCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGA
GCAGGCTCAGTCTCCTTCTCG
19An Exact Algorithm(generalizing Sankoff and
Rousseau 1975)
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
20Running Time
Wu s ? min ( Wv t d(s, t) )
v child t
of u
Average sequence length
O(k?42k ) time per node
Number of species
Total time O(n k (42k l ))
Motif length
21Limits of Motif Finders
0
???
gene
- Given upstream regions of coregulated genes
- Increasing length makes motif finding harder
random motifs clutter the true ones - Decreasing length makes motif finding harder
true motif missing in some sequences
22Limits of Motif Finders
- A (k,d)-motif is a k-long motif with d random
differences per copy - Motif Challenge problem
- Find a (15,4) motif in N sequences of length L
- CONSENSUS, MEME, AlignACE, most other programs
fail for N 20, L 1000
23Example Application Motifs in Yeast
- Group
- Tavazoie et al. 1999, G. Churchs lab, Harvard
- Data
- Microarrays on 6,220 mRNAs from yeast Affymetrix
chips (Cho et al.) - 15 time points across two cell cycles
24Processing of Data
- Selection of 3,000 genes
- Genes with most variable expression were selected
- Clustering according to common expression
- K-means clustering
- 30 clusters, 50-190 genes/cluster
- Clusters correlate well with known function
- AlignACE motif finding
- 600-long upstream regions
- 50 regions/trial
25Motifs in Periodic Clusters
26Motifs in Non-periodic Clusters
27Rapid Global Alignments
- How to align genomic sequences in (more or less)
linear time
28(No Transcript)
29(No Transcript)
30Motivation
- Genomic sequences are very long
- Human genome 3 x 109 long
- Mouse genome 2.7 x 109 long
- Aligning genomic regions is useful for revealing
common gene structure - Useful to compare regions gt 1,000,000-long
31Main Idea
- Genomic regions of interest contain ordered
islands of similarity, such as genes - Find local alignments
- Chain an optimal subset of them
- Refine/complete the alignment
- Systems that use this idea to various degrees
- MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA,
others
32Saving cells in DP
- Find local alignments
- Chain -O(NlogN) L.I.S.
- Restricted DP
33Methods to CHAIN Local Alignments
Sparse Dynamic Programming O(N log N)
34The Problem Find a Chain of Local Alignments
(x,y) ? (x,y) requires x lt x y lt y
Each local alignment has a weight FIND the chain
with highest total weight
35Quadratic Time Solution
- Build Directed Acyclic Graph (DAG)
- Nodes local alignments (xa,xb) ? (ya,yb)
score - Directed edges local alignments that can be
chained - edge ( (xa, xb, ya, yb) , (xc, xd, yc, yd) )
- xa lt xb lt xc lt xd
- ya lt yb lt yc lt yd
- Each local alignment
- is a node vi with
- alignment score si
36Quadratic Time Solution
- Dynamic programming
- Initialization
- Find each node va s.t. there is no edge (u,v0)
- Set score of V(a) to be sa
- Iteration
- For each vi, optimal path ending in vi has total
score - V(i) max ( weight(vj, vi) V(j) )
- Termination
- Optimal global chain
- j argmax ( V(j) ) trace chain from vj
- Worst case time quadratic
37Sparse Dynamic Programming
- Back to the LCS problem
- Given two sequences
- x x1, , xm
- y y1, , yn
- Find the longest common subsequence
- Quadratic solution with DP
- How about when hits xi yj are sparse?
38Sparse Dynamic Programming
15 3 24 16 20 4 24 3 11 18
4
20
24
3
11
15
11
4
18
20
- Imagine a situation where the number of hits is
much smaller than O(nm) maybe O(n) instead
39Sparse Dynamic Programming L.I.S.
- Longest Increasing Subsequence
- Given a sequence over an ordered alphabet
- x x1, , xm
- Find a subsequence
- s s1, , sk
- s1 lt s2 lt lt sk
40Sparse LCS expressed as LIS
x
- Create a sequence w
- Every matching point x-to-y, (i, j), is inserted
into a sequence as follows - For each position j of x, from smallest to
largest, insert in z the points (i, j), in
decreasing column i order - The 11 example points are inserted in the order
given - Any two points (ya, xa), (yb, xb) can be chained
iff - a is before b in w, and
- ya lt yb
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
y
41Sparse LCS expressed as LIS
x
- Create a sequence w
- w (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7)
(4,8) (7,9) (5,9) (9,10) - Consider now ws elements as ordered
lexicographically, where -
- (ya, xa) lt (yb, xb) if ya lt yb
- Claim An increasing subsequence of w is a common
subsequence of x and y
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
y
42Sparse Dynamic Programming for LIS
x
- Algorithm
- initialize empty array L
- / at each point, lj will contain the last
element of the longest j-long increasing
subsequence that ends with the smallest wi / - for i 1 to w
- binary search for wi in L, to find lj lt wi
lj1 - replace lj1 with wi
- keep a backptr lj ? wi
- Thats it!!!
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
y
43Sparse Dynamic Programming for LIS
x
- Example
- w (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7)
(4,8) (7,9) (5,9) (9,10) - L
- (4,2)
- (3,3)
- (3,3) (10,5)
- (2,5) (10,5)
- (2,5) (8,6)
- (1,6) (8,6)
- (1,6) (3,7)
- (1,6) (3,7) (4,8)
- (1,6) (3,7) (4,8) (7,9)
- (1,6) (3,7) (4,8) (5,9)
- (1,6) (3,7) (4,8) (5,9) (9,10)
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
y
Longest common subsequence s 4, 24, 3, 11, 18
44Sparse DP for rectangle chaining
- 1,, N rectangles
- (hj, lj) y-coordinates of rectangle j
- w(j) weight of rectangle j
- V(j) optimal score of chain ending in j
- L list of triplets (lj, V(j), j)
- L is sorted by lj
- L is implemented as a balanced binary tree
h
l
y
45Sparse DP for rectangle chaining
- Go through rectangle x-coordinates, from lowest
to highest - When on the leftmost end of i
- j rectangle in L, with largest lj lt hi
- V(i) w(i) V(j)
- When on the rightmost end of i
- j rectangle in L, with largest lj ? li
- If V(i) gt V(j)
- INSERT (li, V(i), i) in L
- REMOVE all (lk, V(k), k) with V(k) ? V(i) lk ?
li
46Example
x
2
1 5
5
6
2 6
9
10
3 3
11
12
4 4
14
15
5 2
16
y
47Time Analysis
- Sorting the x-coords takes O(N log N)
- Going through x-coords N steps
- Each of N steps requires O(log N) time
- Searching L takes log N
- Inserting to L takes log N
- All deletions are consecutive, so log N per
deletion - Each element is deleted at most once N log N for
all deletions - Recall that INSERT, DELETE, SUCCESSOR, take O(log
N) time in a balanced binary search tree