(Regulatory-) Motif Finding - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

(Regulatory-) Motif Finding

Description:

(Regulatory-) Motif Finding. Finding Regulatory Motifs. Given ... Initialize parameters = (M, B), : Try different values of from ... and Rousseau 1975) ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 48

Provided by: root

Learn more at: http://ai.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: (Regulatory-) Motif Finding

1
(Regulatory-) Motif Finding
2
Finding Regulatory Motifs
. . .

Given a collection of genes with common
expression,
Find the TF-binding motif in common

3
Expectation Maximization

Initialize parameters ? (M, B), ?
Try different values of ? from N-1/2 up to 1/(2K)
Repeat
Expectation
Maximization
Until change in ? (M, B), ? falls below ?
Report results for several good ?

?
1 ?
M1
M1
MK
B
A C G T

4
Gibbs Sampling in Motif Finding
5
Gibbs Sampling

Given
x1, , xN,
motif length K,
background B,
Find
Model M
Locations a1,, aN in x1, , xN
Maximizing log-odds likelihood ratio

6
Gibbs Sampling

AlignACE first statistical motif finder
BioProspector improved version of AlignACE
Algorithm (sketch)
Initialization
Select random locations in sequences x1, , xN
Compute an initial model M from these locations
Sampling Iterations
Remove one sequence xi
Recalculate model
Pick a new location of motif in xi according to
probability the location is a motif occurrence

7
Gibbs Sampling

Initialization
Select random locations a1,, aN in x1, , xN
For these locations, compute M

That is, Mkj is the number of occurrences of
letter j in motif position k, over the total

8
Gibbs Sampling

Predictive Update
Select a sequence x xi
Remove xi, recompute model

where ?j are pseudocounts to avoid 0s,
and B ?j ?j

9
Gibbs Sampling

Sampling
For every K-long word xj,,xjk-1 in x
Qj Prob word motif M(1,xj)??M(k,xjk-1)
Pi Prob word background B(xj)??B(xjk-1)
Let
Sample a random new position ai according to the
probabilities A1,, Ax-k1.

Prob
0
x
10
Gibbs Sampling

Running Gibbs Sampling
Initialize
Run until convergence
Repeat 1,2 several times, report common motifs

11
Advantages / Disadvantages

Very similar to EM
Advantages
Easier to implement
Less dependent on initial parameters
More versatile, easier to enhance with heuristics
Disadvantages
More dependent on all sequences to exhibit the
motif
Less systematic search of initial parameter space

12
Repeats, and a Better Background Model

Repeat DNA can be confused as motif
Especially low-complexity CACACA AAAAA, etc.
Solution
more elaborate background model
0th order B pA, pC, pG, pT
1st order B P(AA), P(AC), , P(TT)
Kth order B P(X b1bK) X, bi?A,C,G,T
Has been applied to EM and Gibbs (up to 3rd
order)

13
Phylogenetic Footprinting(Slides by Martin Tompa)
14
Phylogenetic Footprinting(Tagle et al. 1988)

Functional sequences evolve slower than
nonfunctional ones
Consider a set of orthologous sequences from
different species
Identify unusually well conserved regions

15
Substring Parsimony Problem

Given
phylogenetic tree T,
set of orthologous sequences at leaves of T,
length k of motif
threshold d
Problem
Find each set S of k-mers, one k-mer from each
leaf, such that the parsimony score of S in T
is at most d.
This problem is NP-hard.

16
Small Example
Size of motif sought k 4
17
Solution
Parsimony score 1 mutation
18
CLUSTALW multiple sequence alignment (rbcS
gene) Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT-
--CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA------
-AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATC
TTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-
------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---
CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTA
AATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACAT
TGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA
--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATT
CAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCG
TCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wh
eat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGT
CGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAG
CAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT--
---TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGC
CAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCAC
ACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAAC
AAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC
------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AG
GATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCA
ATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC-
---ACAATCCAACAA-ACTGGTTCT---------A Tobacco AAAAAT
AATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTA
TCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-p
lant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATA
AGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-AC
GATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAAC
CATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATT
TCT---------A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTA
ATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGG
CAATGCTTCTTC-------- Duckweed ATATAATATTAGAAAAAAAT
C-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAG
ACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGG
CCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACA
CA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACC
AATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGA
CTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pe
a GGCAGTGGCC---AACTAC--------------------CACAATTT-
TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACAT
TA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-G
CGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTG
GGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGA
ATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGG
GG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTG
GCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCT
TCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAG
AAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CAT
CTGTACCAAAGAAACGG----GGCTATATATACCGTG Duckweed TTA
GGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATA
TTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC La
rch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATT
TCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TC
TATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGT
AGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCA
ATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TT
AAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAA
AGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTT
CTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Tur
nip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGA
AAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCC
TCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGA
GCAGGCTCAGTCTCCTTCTCG
19
An Exact Algorithm(generalizing Sankoff and
Rousseau 1975)
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
20
Running Time
Wu s ? min ( Wv t d(s, t) )
v child t

of u
Average sequence length
O(k?42k ) time per node
Number of species
Total time O(n k (42k l ))
Motif length
21
Limits of Motif Finders
0
???
gene

Given upstream regions of coregulated genes
Increasing length makes motif finding harder
random motifs clutter the true ones
Decreasing length makes motif finding harder
true motif missing in some sequences

22
Limits of Motif Finders

A (k,d)-motif is a k-long motif with d random
differences per copy
Motif Challenge problem
Find a (15,4) motif in N sequences of length L
CONSENSUS, MEME, AlignACE, most other programs
fail for N 20, L 1000

23
Example Application Motifs in Yeast

Group
Tavazoie et al. 1999, G. Churchs lab, Harvard
Data
Microarrays on 6,220 mRNAs from yeast Affymetrix
chips (Cho et al.)
15 time points across two cell cycles

24
Processing of Data

Selection of 3,000 genes
Genes with most variable expression were selected
Clustering according to common expression
K-means clustering
30 clusters, 50-190 genes/cluster
Clusters correlate well with known function
AlignACE motif finding
600-long upstream regions
50 regions/trial

25
Motifs in Periodic Clusters
26
Motifs in Non-periodic Clusters
27
Rapid Global Alignments

How to align genomic sequences in (more or less)
linear time

28
(No Transcript)
29
(No Transcript)
30
Motivation

Genomic sequences are very long
Human genome 3 x 109 long
Mouse genome 2.7 x 109 long
Aligning genomic regions is useful for revealing
common gene structure
Useful to compare regions gt 1,000,000-long

31
Main Idea

Genomic regions of interest contain ordered
islands of similarity, such as genes
Find local alignments
Chain an optimal subset of them
Refine/complete the alignment
Systems that use this idea to various degrees
MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA,
others

32
Saving cells in DP

Find local alignments
Chain -O(NlogN) L.I.S.
Restricted DP

33
Methods to CHAIN Local Alignments
Sparse Dynamic Programming O(N log N)
34
The Problem Find a Chain of Local Alignments
(x,y) ? (x,y) requires x lt x y lt y
Each local alignment has a weight FIND the chain
with highest total weight
35
Quadratic Time Solution

Build Directed Acyclic Graph (DAG)
Nodes local alignments (xa,xb) ? (ya,yb)
score
Directed edges local alignments that can be
chained
edge ( (xa, xb, ya, yb) , (xc, xd, yc, yd) )
xa lt xb lt xc lt xd
ya lt yb lt yc lt yd
Each local alignment
is a node vi with
alignment score si

36
Quadratic Time Solution

Dynamic programming
Initialization
Find each node va s.t. there is no edge (u,v0)
Set score of V(a) to be sa
Iteration
For each vi, optimal path ending in vi has total
score
V(i) max ( weight(vj, vi) V(j) )
Termination
Optimal global chain
j argmax ( V(j) ) trace chain from vj
Worst case time quadratic

37
Sparse Dynamic Programming

Back to the LCS problem
Given two sequences
x x1, , xm
y y1, , yn
Find the longest common subsequence
Quadratic solution with DP
How about when hits xi yj are sparse?

38
Sparse Dynamic Programming
15 3 24 16 20 4 24 3 11 18

4
20
24
3
11
15
11
4
18
20

Imagine a situation where the number of hits is
much smaller than O(nm) maybe O(n) instead

39
Sparse Dynamic Programming L.I.S.

Longest Increasing Subsequence
Given a sequence over an ordered alphabet
x x1, , xm
Find a subsequence
s s1, , sk
s1 lt s2 lt lt sk

40
Sparse LCS expressed as LIS
x

Create a sequence w
Every matching point x-to-y, (i, j), is inserted
into a sequence as follows
For each position j of x, from smallest to
largest, insert in z the points (i, j), in
decreasing column i order
The 11 example points are inserted in the order
given
Any two points (ya, xa), (yb, xb) can be chained
iff
a is before b in w, and
ya lt yb

15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10

9
5
11
3
4
20
24
3
11
15
11
4
18
20
y
41
Sparse LCS expressed as LIS
x

Create a sequence w
w (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7)
(4,8) (7,9) (5,9) (9,10)
Consider now ws elements as ordered
lexicographically, where
(ya, xa) lt (yb, xb) if ya lt yb
Claim An increasing subsequence of w is a common
subsequence of x and y

15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10

9
5
11
3
4
20
24
3
11
15
11
4
18
20
y
42
Sparse Dynamic Programming for LIS
x

Algorithm
initialize empty array L
/ at each point, lj will contain the last
element of the longest j-long increasing
subsequence that ends with the smallest wi /
for i 1 to w
binary search for wi in L, to find lj lt wi
lj1
replace lj1 with wi
keep a backptr lj ? wi
Thats it!!!

15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10

9
5
11
3
4
20
24
3
11
15
11
4
18
20
y
43
Sparse Dynamic Programming for LIS
x

Example
w (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7)
(4,8) (7,9) (5,9) (9,10)
L
(4,2)
(3,3)
(3,3) (10,5)
(2,5) (10,5)
(2,5) (8,6)
(1,6) (8,6)
(1,6) (3,7)
(1,6) (3,7) (4,8)
(1,6) (3,7) (4,8) (7,9)
(1,6) (3,7) (4,8) (5,9)
(1,6) (3,7) (4,8) (5,9) (9,10)

15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10

9
5
11
3
4
20
24
3
11
15
11
4
18
20
y
Longest common subsequence s 4, 24, 3, 11, 18
44
Sparse DP for rectangle chaining

1,, N rectangles
(hj, lj) y-coordinates of rectangle j
w(j) weight of rectangle j
V(j) optimal score of chain ending in j
L list of triplets (lj, V(j), j)
L is sorted by lj
L is implemented as a balanced binary tree

h
l
y
45
Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from lowest
to highest
When on the leftmost end of i
j rectangle in L, with largest lj lt hi
V(i) w(i) V(j)
When on the rightmost end of i
j rectangle in L, with largest lj ? li
If V(i) gt V(j)
INSERT (li, V(i), i) in L
REMOVE all (lk, V(k), k) with V(k) ? V(i) lk ?
li

46
Example
x
2
1 5
5
6
2 6
9
10
3 3
11
12
4 4
14
15
5 2
16
y
47
Time Analysis

Sorting the x-coords takes O(N log N)
Going through x-coords N steps
Each of N steps requires O(log N) time
Searching L takes log N
Inserting to L takes log N
All deletions are consecutive, so log N per
deletion
Each element is deleted at most once N log N for
all deletions
Recall that INSERT, DELETE, SUCCESSOR, take O(log
N) time in a balanced binary search tree