Motif Finding - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Motif Finding

Description:

Problem: The identification of a motif without any prior knowledge of how the motif looks ... if u is not a leaf. An Exact Algorithm ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 32
Provided by: ryang
Category:

less

Transcript and Presenter's Notes

Title: Motif Finding


1
Motif Finding
2
Motif Finding
  • Can be used identify
  • Promoters
  • Transcription Factor Binding Sites
  • Problem The identification of a motif without
    any prior knowledge of how the motif looks
  • If you do not know what the motif looks like, or
    where it is located, you need an algorithm that
    given a set of sequences, it can find short
    substrings that occurs more often than random.
  • Methods
  • Position Weight Matrices
  • Maximization Expectation
  • Gibbs Sampling
  • Phylogenetic Footprinting

3
An Example
  • 7 32-nucleotide DNA sequences, generated randomly

CTGCGGTACCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAGGACTCATG
CGTAGGCTAAGAGTT GTGTAATGGTCAACGTGTCCCGCCAAACATTA A
ATGTCTCACTGGTGCCATTAATTATAGAATG TTTAACCGATATGAAATA
GGCCTGGCCACATT GCCGTACCGACACACATTCTTTGGCATCCCTA TA
GGTCTCGCTCGGCTGGTCGAATGGTCCGAG
4
An Example
  • Insert a pattern, PATGCAACT of length l8 at
    random positions in each sequence

CTGCGGTACATGCAACTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCATGCAACTGTAGGCTAAGAGTT GTATGCAACTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCATGCAACTTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCATGCAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCAACTCATCCCTA TAGG
TCTCGCTATGCAACTCGGCTGGTCGAATGGTCCGAG
5
Problem
  • If you dont know what the pattern is, or where
    it has been inserted, can you find the pattern?

CTGCGGTACATGCAACTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCATGCAACTGTAGGCTAAGAGTT GTATGCAACTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCATGCAACTTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCATGCAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCAACTCATCCCTA TAGG
TCTCGCTATGCAACTCGGCTGGTCGAATGGTCCGAG
6
Solution
  • Count the number of times each l-mer occurs in
    the sequences.
  • (328)7 280 total nucleotides
  • The probability of finding any 8-mer is less than
    280/480.004
  • After counting all 8-mers, 1 appears many more
    times. This overrepresented 8-mer is the pattern
    P we are trying to find. Use EMBOSSwordcount.

7
Problem 2
  • Suppose we allow for mutations at some positions
    (try to use EMBOSSwordcount)

CTGCGGTACATcCAgCTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCggGCAACTGTAGGCTAAGAGTT GTATGgAtCTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCAaGCAACcTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA TAGG
TCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG
8
Solution
  • Use a profile matrix to allow for variations in
    the pattern
  • The length (number of columns) of the matrix is
    the length of the profile, l.
  • The rows of the matrix correspond to the number
    of possible bases/residues.

9
Solution
  • Given a set of t sequences, construct a 4 x l
    matrix, and align the sequences in all possible
    positions.
  • For each possible starting position for all the
    sequences
  • Count the number of times each nucleotide appears
    in each position.
  • The score of the resulting alignment is the
    maximum score of each nucleotide in each
    position.
  • The alignment with the greatest score corresponds
    to the motif.

10
Example
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCggGCAACTGTAGGCTAAGAGTT GTATGgAtCTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCAaGCAACcTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA TAGG
TCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG
  • Construct a PWM for the alignment in bold

11
Example
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCggGCAACTGTAGGCTAAGAGTT GTATGgAtCTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCAaGCAACcTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA TAGG
TCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG
  • Construct a PWM for the alignment in bold

12
Example
  • Score the alignment by taking the maximum score
    in each column
  • Score 25
  • Consensus Sequence TTGGTCCA

13
Example
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTC
TTACTAAC TGGTCCCAGGACTCATGCggGCAACTGTAGGC
TAAGAGTT
GTATGgAtCTGTAATGGTCAACGTGTCCCGCCAAACATTA
AATGTCAaGCAACcTCACTGGTGCCATTAATTATAGAA
TG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTACATT
GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA
TAGGTCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG
  • Construct a PWM for the alignment in bold

14
Example
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTC
TTACTAAC TGGTCCCAGGACTCATGCggGCAACTGTAGGC
TAAGAGTT
GTATGgAtCTGTAATGGTCAACGTGTCCCGCCAAACATTA
AATGTCAaGCAACcTCACTGGTGCCATTAATTATAGAA
TG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTACATT
GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA
TAGGTCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG
  • Construct a PWM for the alignment in bold

15
Example
  • Score the alignment by taking the maximum score
    in each column
  • Score 42
  • Consensus Sequence ATGCAACT

16
Formally Stated
  • Given
  • Set of t DNA sequences, each with n nucleotides
  • Select one position in each of the t sequences
    forming an array of starting positions
    s(s1,s2,,st) where 1ltsiltn-l1
  • How many starting positions are these in each
    sequence?
  • How many combinations of starting positions are
    there in all?

17
Scoring
  • Let P(s) profile matrix corresponding to
    starting positions s.
  • Mp(s)(j) largest count in column j of P(s).
  • Mp(last example)(1) 5
  • Mp(last example)(3) 6
  • Score

This score is based on absolute frequencies. A
statistically based score would be
18
Representing patterns Matrices
CTTGGTGACGTG GTGAGTGACGTC CGGGTTGACGCA CCTACTTACGT
A TATGGTGACGTC TCGGATGACGAT TAGGATGACGTC CCTGGTGAC
GCC CGCGGTGACGTA GCCGTTGACGCC CGCGATGACGCA CCTGTTG
ACGTG TTGCATGACGTC GTTGGTGACGTG GAGGATGACGTT GGTCG
TGACGTA
Given N sequence fragments of fixed length, one
can assemble a position frequency matrix (number
of times a particular nucleotide appears at a
given position)
A 0 3 0 2 5 0 0 16 0 0 1 5 C 7
5 3 3 1 0 0 0 16 0 5 6 G 5 4 6 11
7 0 15 0 0 16 0 3 T 4 4 7 0 3 16
1 0 0 0 10 2
Position frequency matrix (PFM) (aka raw count
matrix or conservation and substitution matrix)
19
More on scores
Using pseudocounts (N 16)
Converting a PFM into a PWM
A 0 3 0 2 5 0 0 16 0 0 1 5 C 7
5 3 3 1 0 0 0 16 0 5 6 G 5 4 6 11
7 0 15 0 0 16 0 3 T 4 4 7 0 3 16
1 0 0 0 10 2
For each matrix element do
A -2 0 -2 -0.415 0.585 -2 -2
2.088 -2 -2 -1 0.585 C 1 0.585
0 0 -1 -2 -2 -2 2.088 -2
0.585 0.807 G 0.585 0.322 0.807 1.585
1 -2 2 -2 -2 2.088 -2 0 T
0.319 0.322 1 -2 0 2.088 -1
-2 -2 -2 1.459 -0.415
n(b,i) raw count (PFM matrix element) of
nucleotide b in column i N number of
sequences used to create PFM ( column sum)
- pseudocounts (correction for small sample
size) p(b) - background frequency of nucleotide
b
20
Detecting binding sites using a PWM
21
Problems with Matrices
  • Large search space gives larger portion of false
    positives
  • There are (n-l1)t sets of starting points for t
    sequences of length n, when looking for a motif
    of length l.
  • This grows exponentially with the number of
    sequences.

22
  • Phylogenetic footprinting

Using cross-species conservation of regulatory
elements to improve binding site prediction
23
Phylogenetic-Footprinting
  • A method of identifying conserved motifs in a set
    of orthologous sequences from multiple species
    using the notion that
  • Selective pressure causes functional elements to
    evolve at a slower rate than nonfunctional
    sequences
  • Identifies the best conserved motifs in those
    regions
  • Guaranteed to report all sets of motifs with the
    lowest parsimony scores.

24
Parsimony Score
  • Hamming distance (Parsimony Score)
  • Number of changes applied to a sequence to obtain
    another sequence
  • If vATTGTC and wACTCTC, then d(v,w) 2

v ATTGTC x x w ACTCTC
25
Substring Parsimony Problem
  • Given
  • n orthologous sequences S1, S2, S3, , Sn
  • T phylogenetic tree relating sequences
  • k length of motif
  • d maximum parsimony score
  • Problem
  • Find set of k-mers, one from each input sequence
    with parsimony score lt d with respect to T
  • Parameters
  • k and d can be specified by the user

26
Terminology
Leaf Nodes
Internal Nodes
Chicken Human Mouse
v
C(v)Human, Mouse
C(v)?
27
DP Solution
  • Proceeds from the leaves of T up to its root
  • At each node u of T, compute a table W(u)
    containing 4k entries, one for each possible
    k-mer
  • For a string s of length k,
  • Wus the best parsimony score that can be
    achieved for the subtree rooted at u

28
DP Solution
  • C(u) set of children of u
  • d(s,t) Hamming distance between sequences s and
    t
  • ? A,C,G,T
  • Tables W can now be computed
  • 0 if u is a leaf and s is a substring of Su
  • ? if u is a leaf and s is not a substring of Su
  • if u is not a leaf

29
An Exact Algorithm
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
30
Small Example
ATGC... (Chicken) ACGT... (Human) ACGG... (Mouse)
Size of motif sought k 2
31
Problems with Phylogenetic-Footprinting
  • Divergence of organisms being compared needs to
    be sufficient to allow for sequence divergence
  • Comparing non-mammalian organisms to mammalian
    organisms can limit types of functional elements
    being found (primate specific, mammalian
    specific, etc)
  • Requires several different organisms
Write a Comment
User Comments (0)
About PowerShow.com