(Regulatory-) Motif Finding - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

(Regulatory-) Motif Finding

Description:

(Regulatory-) Motif Finding. Finding Regulatory Motifs. Given ... Initialize parameters = (M, B), : Try different values of from ... and Rousseau 1975) ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 48
Provided by: root
Learn more at: http://ai.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: (Regulatory-) Motif Finding


1
(Regulatory-) Motif Finding
2
Finding Regulatory Motifs
. . .
  • Given a collection of genes with common
    expression,
  • Find the TF-binding motif in common

3
Expectation Maximization
  • Initialize parameters ? (M, B), ?
  • Try different values of ? from N-1/2 up to 1/(2K)
  • Repeat
  • Expectation
  • Maximization
  • Until change in ? (M, B), ? falls below ?
  • Report results for several good ?

?
1 ?
M1
M1
MK
B
A C G T








4
Gibbs Sampling in Motif Finding
5
Gibbs Sampling
  • Given
  • x1, , xN,
  • motif length K,
  • background B,
  • Find
  • Model M
  • Locations a1,, aN in x1, , xN
  • Maximizing log-odds likelihood ratio

6
Gibbs Sampling
  • AlignACE first statistical motif finder
  • BioProspector improved version of AlignACE
  • Algorithm (sketch)
  • Initialization
  • Select random locations in sequences x1, , xN
  • Compute an initial model M from these locations
  • Sampling Iterations
  • Remove one sequence xi
  • Recalculate model
  • Pick a new location of motif in xi according to
    probability the location is a motif occurrence

7
Gibbs Sampling
  • Initialization
  • Select random locations a1,, aN in x1, , xN
  • For these locations, compute M
  • That is, Mkj is the number of occurrences of
    letter j in motif position k, over the total

8
Gibbs Sampling
  • Predictive Update
  • Select a sequence x xi
  • Remove xi, recompute model

M
  • where ?j are pseudocounts to avoid 0s,
  • and B ?j ?j

9
Gibbs Sampling
  • Sampling
  • For every K-long word xj,,xjk-1 in x
  • Qj Prob word motif M(1,xj)??M(k,xjk-1)
  • Pi Prob word background B(xj)??B(xjk-1)
  • Let
  • Sample a random new position ai according to the
    probabilities A1,, Ax-k1.

Prob
0
x
10
Gibbs Sampling
  • Running Gibbs Sampling
  • Initialize
  • Run until convergence
  • Repeat 1,2 several times, report common motifs

11
Advantages / Disadvantages
  • Very similar to EM
  • Advantages
  • Easier to implement
  • Less dependent on initial parameters
  • More versatile, easier to enhance with heuristics
  • Disadvantages
  • More dependent on all sequences to exhibit the
    motif
  • Less systematic search of initial parameter space

12
Repeats, and a Better Background Model
  • Repeat DNA can be confused as motif
  • Especially low-complexity CACACA AAAAA, etc.
  • Solution
  • more elaborate background model
  • 0th order B pA, pC, pG, pT
  • 1st order B P(AA), P(AC), , P(TT)
  • Kth order B P(X b1bK) X, bi?A,C,G,T
  • Has been applied to EM and Gibbs (up to 3rd
    order)

13
Phylogenetic Footprinting(Slides by Martin Tompa)
14
Phylogenetic Footprinting(Tagle et al. 1988)
  • Functional sequences evolve slower than
    nonfunctional ones
  • Consider a set of orthologous sequences from
    different species
  • Identify unusually well conserved regions

15
Substring Parsimony Problem
  • Given
  • phylogenetic tree T,
  • set of orthologous sequences at leaves of T,
  • length k of motif
  • threshold d
  • Problem
  • Find each set S of k-mers, one k-mer from each
    leaf, such that the parsimony score of S in T
    is at most d.
  • This problem is NP-hard.

16
Small Example
Size of motif sought k 4
17
Solution
Parsimony score 1 mutation
18
CLUSTALW multiple sequence alignment (rbcS
gene) Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT-
--CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA------
-AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATC
TTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-
------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---
CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTA
AATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACAT
TGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA
--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATT
CAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCG
TCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wh
eat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGT
CGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAG
CAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT--
---TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGC
CAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCAC
ACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAAC
AAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC
------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AG
GATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCA
ATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC-
---ACAATCCAACAA-ACTGGTTCT---------A Tobacco AAAAAT
AATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTA
TCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-p
lant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATA
AGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-AC
GATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAAC
CATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATT
TCT---------A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTA
ATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGG
CAATGCTTCTTC-------- Duckweed ATATAATATTAGAAAAAAAT
C-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAG
ACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGG
CCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACA
CA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACC
AATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGA
CTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pe
a GGCAGTGGCC---AACTAC--------------------CACAATTT-
TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACAT
TA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-G
CGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTG
GGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGA
ATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGG
GG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTG
GCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCT
TCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAG
AAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CAT
CTGTACCAAAGAAACGG----GGCTATATATACCGTG Duckweed TTA
GGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATA
TTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC La
rch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATT
TCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TC
TATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGT
AGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCA
ATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TT
AAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAA
AGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTT
CTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Tur
nip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGA
AAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCC
TCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGA
GCAGGCTCAGTCTCCTTCTCG
19
An Exact Algorithm(generalizing Sankoff and
Rousseau 1975)
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
20
Running Time
Wu s ? min ( Wv t d(s, t) )
v child t

of u
Average sequence length
O(k?42k ) time per node
Number of species
Total time O(n k (42k l ))
Motif length
21
Limits of Motif Finders
0
???
gene
  • Given upstream regions of coregulated genes
  • Increasing length makes motif finding harder
    random motifs clutter the true ones
  • Decreasing length makes motif finding harder
    true motif missing in some sequences

22
Limits of Motif Finders
  • A (k,d)-motif is a k-long motif with d random
    differences per copy
  • Motif Challenge problem
  • Find a (15,4) motif in N sequences of length L
  • CONSENSUS, MEME, AlignACE, most other programs
    fail for N 20, L 1000

23
Example Application Motifs in Yeast
  • Group
  • Tavazoie et al. 1999, G. Churchs lab, Harvard
  • Data
  • Microarrays on 6,220 mRNAs from yeast Affymetrix
    chips (Cho et al.)
  • 15 time points across two cell cycles

24
Processing of Data
  • Selection of 3,000 genes
  • Genes with most variable expression were selected
  • Clustering according to common expression
  • K-means clustering
  • 30 clusters, 50-190 genes/cluster
  • Clusters correlate well with known function
  • AlignACE motif finding
  • 600-long upstream regions
  • 50 regions/trial

25
Motifs in Periodic Clusters
26
Motifs in Non-periodic Clusters
27
Rapid Global Alignments
  • How to align genomic sequences in (more or less)
    linear time

28
(No Transcript)
29
(No Transcript)
30
Motivation
  • Genomic sequences are very long
  • Human genome 3 x 109 long
  • Mouse genome 2.7 x 109 long
  • Aligning genomic regions is useful for revealing
    common gene structure
  • Useful to compare regions gt 1,000,000-long

31
Main Idea
  • Genomic regions of interest contain ordered
    islands of similarity, such as genes
  • Find local alignments
  • Chain an optimal subset of them
  • Refine/complete the alignment
  • Systems that use this idea to various degrees
  • MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA,
    others

32
Saving cells in DP
  1. Find local alignments
  2. Chain -O(NlogN) L.I.S.
  3. Restricted DP

33
Methods to CHAIN Local Alignments
Sparse Dynamic Programming O(N log N)
34
The Problem Find a Chain of Local Alignments
(x,y) ? (x,y) requires x lt x y lt y
Each local alignment has a weight FIND the chain
with highest total weight
35
Quadratic Time Solution
  • Build Directed Acyclic Graph (DAG)
  • Nodes local alignments (xa,xb) ? (ya,yb)
    score
  • Directed edges local alignments that can be
    chained
  • edge ( (xa, xb, ya, yb) , (xc, xd, yc, yd) )
  • xa lt xb lt xc lt xd
  • ya lt yb lt yc lt yd
  • Each local alignment
  • is a node vi with
  • alignment score si

36
Quadratic Time Solution
  • Dynamic programming
  • Initialization
  • Find each node va s.t. there is no edge (u,v0)
  • Set score of V(a) to be sa
  • Iteration
  • For each vi, optimal path ending in vi has total
    score
  • V(i) max ( weight(vj, vi) V(j) )
  • Termination
  • Optimal global chain
  • j argmax ( V(j) ) trace chain from vj
  • Worst case time quadratic

37
Sparse Dynamic Programming
  • Back to the LCS problem
  • Given two sequences
  • x x1, , xm
  • y y1, , yn
  • Find the longest common subsequence
  • Quadratic solution with DP
  • How about when hits xi yj are sparse?

38
Sparse Dynamic Programming
15 3 24 16 20 4 24 3 11 18










4
20
24
3
11
15
11
4
18
20
  • Imagine a situation where the number of hits is
    much smaller than O(nm) maybe O(n) instead

39
Sparse Dynamic Programming L.I.S.
  • Longest Increasing Subsequence
  • Given a sequence over an ordered alphabet
  • x x1, , xm
  • Find a subsequence
  • s s1, , sk
  • s1 lt s2 lt lt sk

40
Sparse LCS expressed as LIS
x
  • Create a sequence w
  • Every matching point x-to-y, (i, j), is inserted
    into a sequence as follows
  • For each position j of x, from smallest to
    largest, insert in z the points (i, j), in
    decreasing column i order
  • The 11 example points are inserted in the order
    given
  • Any two points (ya, xa), (yb, xb) can be chained
    iff
  • a is before b in w, and
  • ya lt yb

15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10

9
5
11
3
4
20
24
3
11
15
11
4
18
20
y
41
Sparse LCS expressed as LIS
x
  • Create a sequence w
  • w (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7)
    (4,8) (7,9) (5,9) (9,10)
  • Consider now ws elements as ordered
    lexicographically, where
  • (ya, xa) lt (yb, xb) if ya lt yb
  • Claim An increasing subsequence of w is a common
    subsequence of x and y

15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10

9
5
11
3
4
20
24
3
11
15
11
4
18
20
y
42
Sparse Dynamic Programming for LIS
x
  • Algorithm
  • initialize empty array L
  • / at each point, lj will contain the last
    element of the longest j-long increasing
    subsequence that ends with the smallest wi /
  • for i 1 to w
  • binary search for wi in L, to find lj lt wi
    lj1
  • replace lj1 with wi
  • keep a backptr lj ? wi
  • Thats it!!!

15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10

9
5
11
3
4
20
24
3
11
15
11
4
18
20
y
43
Sparse Dynamic Programming for LIS
x
  • Example
  • w (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7)
    (4,8) (7,9) (5,9) (9,10)
  • L
  • (4,2)
  • (3,3)
  • (3,3) (10,5)
  • (2,5) (10,5)
  • (2,5) (8,6)
  • (1,6) (8,6)
  • (1,6) (3,7)
  • (1,6) (3,7) (4,8)
  • (1,6) (3,7) (4,8) (7,9)
  • (1,6) (3,7) (4,8) (5,9)
  • (1,6) (3,7) (4,8) (5,9) (9,10)

15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10

9
5
11
3
4
20
24
3
11
15
11
4
18
20
y
Longest common subsequence s 4, 24, 3, 11, 18
44
Sparse DP for rectangle chaining
  • 1,, N rectangles
  • (hj, lj) y-coordinates of rectangle j
  • w(j) weight of rectangle j
  • V(j) optimal score of chain ending in j
  • L list of triplets (lj, V(j), j)
  • L is sorted by lj
  • L is implemented as a balanced binary tree

h
l
y
45
Sparse DP for rectangle chaining
  • Go through rectangle x-coordinates, from lowest
    to highest
  • When on the leftmost end of i
  • j rectangle in L, with largest lj lt hi
  • V(i) w(i) V(j)
  • When on the rightmost end of i
  • j rectangle in L, with largest lj ? li
  • If V(i) gt V(j)
  • INSERT (li, V(i), i) in L
  • REMOVE all (lk, V(k), k) with V(k) ? V(i) lk ?
    li

46
Example
x
2
1 5
5
6
2 6
9
10
3 3
11
12
4 4
14
15
5 2
16
y
47
Time Analysis
  • Sorting the x-coords takes O(N log N)
  • Going through x-coords N steps
  • Each of N steps requires O(log N) time
  • Searching L takes log N
  • Inserting to L takes log N
  • All deletions are consecutive, so log N per
    deletion
  • Each element is deleted at most once N log N for
    all deletions
  • Recall that INSERT, DELETE, SUCCESSOR, take O(log
    N) time in a balanced binary search tree
Write a Comment
User Comments (0)
About PowerShow.com