Finding Motifs in Promoter Regions - PowerPoint PPT Presentation

About This Presentation
Title:

Finding Motifs in Promoter Regions

Description:

Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk Overview Introduction and Definitions P-value Algorithm Experimental Results Generalizations and ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 42
Provided by: tticUchi2
Learn more at: https://home.ttic.edu
Category:

less

Transcript and Presenter's Notes

Title: Finding Motifs in Promoter Regions


1
Finding Motifs in Promoter Regions
  • Libi Hertzberg Or Zuk

2
Overview
  • Introduction and Definitions
  • P-value Algorithm
  • Experimental Results
  • Generalizations and future work

3
Transcriptional Regulationin the cell
4
Motif Representation
  • We need to represent the motif the TF binding
    site.
  • There are three known representations
  • Consensus Most frequent letter in every
    position
  • IUPAC code Allow all letters with frequency
    above a threshold in every position
  • Position Specific Weight Matrix Count number
    of occurrences of every letter in every position
    - More Informative !

Known binding sites
5
An Example
  • An alignment of 5 known binding sites of a TF
  • Position Specific Weight Matrix - F

A 0 1 0 0 0 3
C 1 0 0 3 4 1
T 1 4 5 2 0 1
G 3 0 0 0 1 0
6
Giving a Score to a Potential Binding Site
  • We are given a site R(r1,..., rL). We want to
    know how likely it is to be bound by the TF. We
    compute how well it fits to the weight matrix of
    the TF.
  • We do this by calculating the Likelihood function
    of the site namely, the probability that it
    would have been generated given that it is indeed
    a binding site of this TF.

7
A 0 1 0 0 0 3
C 1 0 0 3 4 1
T 1 4 5 2 0 1
G 3 0 0 0 1 0
F
  • Likelihood(GATTCC) (3/5)(1/5)(5/5)(2/5)(4/
    5)(1/5) 0.00768

8
The Score
  • By taking log on the likelihood of R we get the
    score of R, which is the Loglikelihood of R.
  • Likelihood(GATTCC)
  • (3/5)(1/5)(5/5)(2/5)(4/5)(1/5) 0.00768
  • score(GATTCC) Loglikelihood(GATTCC)
  • log(3/5)log(1/5)log(5/5)log(2/5)log(4/5)log(
    1/5) -4.869

9
The PSSM
  • From the weight matrix F, we compute a Position
    Specific Score Matrix (PSSM) M by

A 0 1 0 0 0 3
C 1 0 0 3 4 1
T 1 4 5 2 0 1
G 3 0 0 0 1 0
F
  • For example, MG,1 log(3/5)

10
Finding The Motif
  • We are given a TF and a gene. We want to know if
    this gene is regulated by the TF.
  • Our Input
  • The sequence of the promoter region of the gene
  • The PSSM of the TF
  • A simple Algorithm Scan the promoter region,
    and at each position calculate the score
    according to the PSSM. Take the best position
    (i.e. the one with the highest score) to be the
    suspected binding site.

Max Score
-2.3
-4.5
-1.2
-5.2
-0.5
AAGTTGCCGAGATCGTAGCTATCGATCGATCGACAGCTAAC
11
The Problem
Problem In any (e.g. random) sequence we will
find some best position (and best score). How do
we assign statistical significance (p-value) to
the position and score we have found.
Max Score
-2.3
-4.5
-1.2
-5.2
-0.5
AAGTTGCCGAGATCGTAGCTATCGATCGATCGACAGCTAAC
The Goal of Our Work
Statistical Evaluation
p value
12
Overview
  • Introduction and Definitions
  • P-value Algorithm
  • Experimental Results
  • Generalizations and future work

13
What Do We Want To Calculate?
  • Let N be the promoter length, and L the length of
    the TF binding site. Suppose we scanned the
    promoter and have found that the maximal score
    had the value t.
  • The p value is the probability that the maximal
    score in a random sequence of length N, will be
    above the threshold t.

14
The Algorithm has two Steps
  • 1. Finding the set of all the sequences of
    length L, with a score above the threshold t.
  • 2. Calculate the probability of finding at least
    one of those sequences in a random sequence of
    length N.

15
Step One Finding the sequences
  • Let K K(t) be the number of sequences of length
    L (out of the 4L) with a score above t.
  • We have a branch and bound algorithm for
    enumerating them in time linear in K.
  • Problem In some cases K might be too large. For
    example, Suppose L20, and only one of a thousand
    sequences of length L has a score higher than t.
    It means K420/1000 415 1billion.

16
Approximating K
  • If K is too large we cannot enumerate all the K
    sequences, but only try to estimate their number
    (i.e. K).
  • There are various methods to do so (Gaussian
    approximation, Statistical Mechanics, Large
    Deviations techniques). We used Generating
    Functions method, which proved to be the best.
  • The method can give both lower and upper bounds
    on the correct number K.

17
(No Transcript)
18
Two Steps
  • 1. Finding the set of all the sequences of
    length L, with score above the threshold t.
  • 2. Calculate the probability of finding at least
    one of those sequences in a random sequence of
    length N.

19
Step Two Calculating Probabilities
  • We are given a set of K sequences. We need to
    calculate the probability of finding at least one
    of them in a random sequence of length N.
  • First, lets consider a simpler problem, where we
    have one target R (K 1).

20
  • Define H Number of occurrences of R in a
    promoter region of length N.
  • R
  • Our p value is P(H gt 0) 1 P(H 0)

AACG
AAACGGTTGTTACAACGGTTCCTCCAACG
H 3
21
A Naive Approximation
  • At a specific position, the probability of R
    appearing is 1/4L, and the probability of R not
    appearing is (1 1/4L )
  • A naive approximation
  • We have N-L1 possible start positions, so
  • P(H 0) (1 1/4L )N-L1
  • Problem We have neglected correlations !

22
Why Do Correlations Matter ?
ATAA
TAAA
AAAA
AAAC
CTAA
TAAC
CAAA
AAAG
TAA
AAA
GTAA
TAAG
GAAA
AAAT
TTAA
TAAT
TAAA
TAA appears in 8 sequences of length 4 P(H gt 0)
8/44
AAA appears in 7 sequences of length 4 P(H gt 0)
7/44
The Difference is in the self-overlapping pattern
of them TAA TAA
AAA AAA TAA
TAA AAA AAA
No Self overlaps
Maximal number of self overlaps
Less self overlaps Higher
P(Hgt0)
23
The effect of self overlaps
  • The Mean of H Is E(H) (N-L1) (1/4L)
  • Independent of the specific sequence R.
  • Proof

24
The effect of self overlaps
  • The correlation between close Xis
  • depends on the specific sequence R.

R1
R2
  • Less self overlaps Higher P(Hgt0)

25
Algorithm
  • We have developed a recursive algorithm which
    takes into account the correlations. It
    calculates the exact value of P(H gt 0).
  • In the more interesting case, where we have a set
    of K target sequences, our method still applies.
  • If we assume that the promoters DNA is not
    random but there are different probabilities for
    A,T,C,G, the same algorithm still works.
  • Time Complexity O(N K log K)

26
Algorithm (Cont.)
  • When K is too large, we dont know the exact
    sequences, but only (an approximation of) their
    number.
  • What we do Take the worst-case scenario (i.e.
    highest P(H gt 0) possible)
  • Highest p value fewest overlaps. We assume no
    overlaps at all.

27
Algorithm (Cont.)
  • The case of no overlaps is possible for K lt 4L/L.
  • We are usually interested in much smaller values
    of K. Thus, for our case of interest, the bound
    we get is quite tight.
  • Upper Bound
  • on P(Hgt0) -the P value!
  • Upper Bound on K
  • Lower Bound on Overlaps

28
Gene's Promoter Regions
Sketch Of The System
Scan to get Max Scores
Transcription Factors Weight Matrices
Input
Estimate K (For each pair!!)
Small
Large
Enumerate K sequences
Bound pvalue
FDR
Calculate pvalue
p values
Statistical Evaluation
Statisticaly Significant Motifs
Output
29
Overview
  • Introduction and Definitions
  • P-value Algorithm
  • Experimental Results
  • Generalizations and future work

30
A Comparison with Matinspector
  • We used the Promoter Database of Saccharomyces
    cerevisiae. It contains genes and for every gene
    the TFs that are known to bind its promoter. We
    took 24 Transcription Factors whose PSWM is
    known, and 135 promoters of genes which are known
    to be bound by at least one of them.
  • Our Results
  • We calculated the p-value (using the algorithm)
    for each of the TFs on each of the genes.

31
Our Results
  • Each gene has 25 p-values of all the TFs. We used
    the FDR method to find the statistical
    significant TFs for every gene.
  • Here the threshold for the FDR is 0.1, but we
    will check the results for a range of threshold
    values

32
Our Results
  • After we found statistical significant TFs for
    every gene, we compared the results with the data
    from the database. There are 2 parameters
  • False positives rate TFs that we found as
    statistical significant, but are not known to be
    bound to the gene.
  • False negatives rate TFs that are known to be
    bound, but we didnt find.
  • Lower parameters values better
    results

33
Our Results Graph
  • We calculated the average of these 2 parameters
    (False positives rate, false negatives rate) on
    all the genes, For a range of FDR threshold
    parameter values, Q 0.01,,0.45
  • Notice that the false positives rate is very
    close to the FDR threshold value

34
Matinspector Results
  • For every gene Matinspector gives the number of
    occurrences of every TF in its promoter, and an
    estimation of the expected number of occurrences
    (re value).
  • To compare to our results, we decided to declare
    a TF as significant if it was found more times
    than it is expected to be found, or, in other
    words, if the ratio
  • (expected number)/(number of occurrences) is
    lower than a certain threshold.

35
Results Graphs
Matinspector Results
In our results the average of the 2 parameters
(green) is always lower, And the false positives
rate (red) is always much lower
36
Comparison with Synthetic Data
Synthetic - All positives
are annotated
The left graph shows lower error rates then in
our true data. The right graph shows error rates
similar to those in our data, thus suggesting an
estimate for the amount of missing real binding
sites in the database.
37
Overview
  • Introduction and Definitions
  • P-value Algorithm
  • Experimental Results
  • Generalizations and future work

38
Markov Models
  • In true DNA sequences, the nucleotides are not
    independent but rather posses statistical
    dependencies at close distances.
  • To model this, we used a markov model, in which
    the distribution of each letter depends on some m
    previous ones. m is denoted as the size of the
    model.
  • Example Each letter depends on its previous

1st\2nd A C T G
A 0.37 0.17 0.18 0.28
C 0.32 0.20 0.17 0.31
T 0.30 0.23 0.19 0.27
G 0.25 0.20 0.17 0.37
39
Markov Models (Cont.)
  • We can use this for both the binding site model
    and the background (random) model. The more
    realistic the model, we hope to get more
    realistic p values.
  • We obtain a tradeoff. As we increase m
  • Advantages
  • More reliable p values ?
  • Reduce false positive/negative errors ?
  • Drawbacks
  • Need more data to represent the model ?
  • Computational complexity increases ?

40
Other Possible Directions
  • Account for multiple occurrences P(H n).
  • Account for combinations of motifs. Find pairs
    (or larger groups) which are statistically
    significant.
  • Use close genomes (e.g. human and mouse) in order
    to reduce false positives rate.
  • Combine with expression data (how ?)

41
The END
Thanks to
  • Ido Kanter
  • Gaddy Getz
  • Eytan Domany
Write a Comment
User Comments (0)
About PowerShow.com