Title: Regulatory Motif Finding
1Regulatory Motif Finding
2Talk Outline
- Biology Background
- Algorithmic Problem
- Papers
- New Motif Finding Algorithm (MotifCut)
- Analysis of Motif Finders Performance
3Talk Outline
- Biology Background
- Algorithmic Problem
- Papers
- New Motif Finding Algorithm (MotifCut)
- Analysis of Motif Finders Performance
4Cell Factory, Proteins Machines
Biovisions, Harvard
5DNA
- Instructions for making the machines
Coding Regions
Regulatory Regions (Regulons)
- Instructions for when and where to make them
6Transcriptional Regulation
- Regulatory regions are comprised of binding
sites - Binding sites attract a special class of
proteins, known as transcription factors - Bound transcription factors can inhibit DNA
transcription
7DNA Regulation
Source Richardson, University College London
8Cell Regulation
- Transcriptional regulation is one of many
regulatory mechanisms in the cell
Focus of Talk
Source Mallery, University of Miami
9Structural Basis of Interaction
10Structural Basis of Interaction
- Key Feature
- Transcription factors are not 100 specific when
binding DNA - Not one sequence, but family of sequences, with
varying affinities
0.54
0.48
0.32
0.25
0.11
0.08
11Talk Outline
- Biology Background
- Algorithmic Problem
- Papers
- New Motif Finding Algorithm (MotifCut)
- Analysis of Motif Finders Performance
12Talk Outline
- Biology Background
- Algorithmic Problem
- Papers
- New Motif Finding Algorithm (MotifCut)
- Analysis of Motif Finders Performance
13Motif Finding
- Basic Objective
- Find regions in the genome that bind
transcription factors - Many classes of algorithms, differ in
- Types of input data
- Motif representation
14Input Data
- Evolutionarily related set of sequences
- Sequence other data
- Microarray expression profile
- ChIP-chip
- Others
15Motif Representation
Focus of Talk
16Motif Representation
- Structural discussion immediately raises
difficulties
17Structural Basis of Interaction
- Key Feature
- Transcription factors are not 100 specific when
binding DNA - Not one sequence, but family of sequences, with
varying affinities
0.54
0.48
0.32
0.25
0.11
0.08
18Motif Representation
- Structural discussion immediately raises
difficulties - Least Expressive
- Single sequence
- Most Expressive
- 4k-dimensional probability distribution
- Independently assign probability for each
possible kmer
19Motif Representation
- Standard Solution
- Position-Specific Scoring Matrix (PSSM)
- Assuming independence of positions, assign a
probability for each position
- Fraught with problems (Will revisit this)
20Talk Outline
- Biology Background
- Algorithmic Problem
- Papers
- New Motif Finding Algorithm (MotifCut)
- Analysis of Motif Finders Performance
21Talk Outline
- Biology Background
- Algorithmic Problem
- Papers
- New Motif Finding Algorithm (MotifCut)
- Analysis of Motif Finders Performance
22Reference
- Authors
- Eugene Fratkin, Brian T. Naughton, Douglas L.
Brutlag, and Serafim Batzoglou - Title
- MotifCut regulatory motifs finding with maximum
density subgraphs - Publication
- Bioinformatics Vol. 22 no. 14 2006, pages
e150e157
23Overview
- Motif Finding Algorithm (MotifCut)
- Motivation
- Oversimplicity of PSSMs
- Intractability of more complex models
24Oversimplicity of PSSMs
- Assumes independence between positions
- 25 of TRANSFAC motifs have been shown to
violate this assumption - Two Examples ADR1 and YAP6
25Oversimplicity of PSSMs
- Assumes independence between positions
- Generates potentially unseen motifs
26Basic Features of MotifCut
- Does not assume an underlying PSSM
- Represents a motif with a graph structure
- In principle maximally expressive
- In practice not quite
- Motif finding cast as maximum density subgraph
- Subquadratic complexity
27Motif Graph Representation
- Nodes are kmers
- Edge weights are distances between kmers
AGTGCGAC
1
AGTGGGAC
1
1
0
2
AGTGGGAC
AGTGCTAC
2
- Generative model Frequency of kmer node equal to
frequency of generating kmer - Distance definition is complicated (Will come
back to) - Same kmer node can appear multiple times
28Motif Finding
- Find highest density subgraph
- Density is defined as sum of edge weights per
node - Somewhat limits representational power
29Motif Finding
- Read new sequence
- Generate graph as previously described
- Kmers are generated by shifting one base pair
- Each kmer in the sequence gets a node, including
identical kmers - Graph contains as many nodes as there are base
pairs - Connect edges with weights based on distances
between nodes - Find densest subgraph
30Edge Weights
- Heart of the algorithm, will focus on this
- Semantics
- Edge weight is the likelihood of two kmers to be
in the same motif - Use Hamming distance as a way to quantify
distance between kmers
0
1
2
3
T
A
C
31Edge Weights
- Heart of the algorithm, will focus on this
- Semantics
- Edge weight is the likelihood of two kmers to be
in the same motif - Use Hamming distance as a way to quantify
distance between kmers - Interpret hamming distance as a measure of the
likelihood of two kmers to be in same motif - F(hamming distance) likelihood of two kmers to
be in same motif
32Edge Weights
- Lets make this a bit more precise
- But how to compute ?
- Simulate it!
- Way too many variables to account for
analytically Background model, kmer length,
hamming distance, etc
33Genome Simulation
- Background Motifs
- No genes, promoters, signaling sequences, etc.
- Background Model
- 3rd order Markov model
- Probability of next base depends on previous 3
bases - Modeled on the yeast genome
- Incorporates GC bias
- Motif Model
- PSSM
- Based on empirically observed information content
of yeast motifs
34Genome Simulation
- Use Markov model to generate 10k 20k length
sequences of background
- Seed with 20 motifs generated by the PSSM
- Result is a simulated genome of yeast
- We know which parts are the real motifs, and
which are not
35Edge Weights
- Back to
- is number of true motifs of k-length
that are l-distance away - is number of non-motifs of k-length
that are l-distance away
36Edge Weights
True Motifs
False Motifs (Part of Background)
37Edge Weights
Lets perform calculation from the perspective of
this motif
- All 1 distance away (Hamming distance)
- a(k 6, l 1) 1
- ß(k 6, l 1) 1
38Edge Weights
- Computation provides an empirical estimate for
- Parameterized by two quantities
- k, the kmer length
- l, the Hamming distance between two kmers
- Fit to a sigmoidal function
39Edge Weights
- Normalization step
- Wont go into details
- This covers problem formulation
- How is motif finding actually done?
40Maximum Density Subgraph
- Standard graph theory method
- Max-flow / min-cut
- O(nm log(n2m))
- Need faster method
- Developed heuristic approach that utilizes
max-flow / min-cut method with modifications
41Maximum Density Subgraph
- Remove all edges below a certain threshold
42Maximum Density Subgraph
- Pick one vertex (do this for every vertex)
43Maximum Density Subgraph
- Put back all neighboring edges for that vertex
44Maximum Density Subgraph
- Use standard algorithm to calculate densest
subgraph
45Results
- Synthetic Tests
- Plenty of test cases
- Measure performance as data set size grows
- Avoid over biasing on empirical data
- Know real answer, can unambiguously test
performance - Yeast Test
- Gold standard data (Harbinson et al., 2004)
46Synthetic Tests
- Varied
- Motif length
- Information content
- Simulated genome (as before)
- Correlated predicted PSSMs to real ones, counted
as true positive if correlation gt 0.7
47Synthetic Tests Results
48Yeast Test Results
49Performance
50Talk Outline
- Biology Background
- Algorithmic Problem
- Papers
- New Motif Finding Algorithm (MotifCut)
- Analysis of Motif Finders Performance
51Talk Outline
- Biology Background
- Algorithmic Problem
- Papers
- New Motif Finding Algorithm (MotifCut)
- Analysis of Motif Finders Performance
- Shorter but more drier (no pretty pictures)
52Reference
- Authors
- Patrick Ng, Niranjan Nagarajan, Neil Jones, and
Uri Keich - Title
- Apples to apples improving the performance of
motif finders and their significance analysis in
the Twilight Zone - Publication
- Bioinformatics Vol. 22 no. 14 2006, pages
e393e401
53Overview
- Twilight Zone
- Non-negligible probability that a maximally
scoring random motif would have a higher score
than motifs that overlap the real motif - Motivation
- Behavior of Motif Finders in Twilight Zone is
poorly understood - Understanding would aid in development of Motif
Finders - Sheds light on whether it is theoretically
possible
54Objectives
- Analyze existing standard (E-value)
- Statistical significance of motifs in Twilight
Zone - Examine and suggest new metrics
- Employ new metric for motif finding
55Objectives
- Analyze existing standard (E-value)
- Statistical significance of motifs in Twilight
Zone - Examine and suggest new metrics
- Employ new metric for motif finding
56E-value
- E-value is defined in terms of information
content - Information Content
- E-value
- Expected number of random alignments exhibiting
an information content at least as high as that
of the given alignment
57Analysis
- Generate 400 random datasets
- Dataset 40 sequences totaling 1485 bases
- Implant a single motif of length 13 per dataset
- High likelihood that motif finders would miss it
58Results
- Reported E-value
- 8 x 1015
- Very high, very statistically insignificant
- In principle, theoretically impossible to find
- Search results
- Alignment covering 30 of motif found in 288/400
cases! - Data generated exactly in accordance with E-value
model
59Whats going on?
- They dont know, hand-waive it
- Many satellite alignments boost up effective
score - Difficult to characterize analytically
60Objectives
- Analyze existing standard (E-value)
- Statistical significance of motifs in Twilight
Zone - Examine and suggest new metrics
- Employ new metric for motif finding
61Objectives
- Analyze existing standard (E-value)
- Statistical significance of motifs in Twilight
Zone - Examine and suggest new metrics
- Employ new metric for motif finding
62New Metric OPV
- Also defined in terms of Information Content
- OPV(s) (Overall p-value)
- Probability that a random sample of the same size
as the input set will contain an alignment with
at least as much information content as s - Contrast
- E-value Expected number of alignments (in
general) - OPV Probability of finding an alignment in a
dataset
63Estimation
- Caveat
- Random sample (no biasing)
- Difficult to calculate analytically
- Estimate empirically
- General OPV
- Finder-specific OPV
64General OPV Estimation
- Generate 1600 random datasets
- No implants
- Run a collection of motif finders on each dataset
- Pick highest scoring motif in each dataset
- Out of all finders
- Sort scores, then pick score with 95 quantile
65General OPV Estimation
Score such that 95 of scores are below it, 5
above it
66General OPV Estimation
- Meaning
- 95 of the time, highest scoring random motif
scored less than s0 - Obtaining a score s0 means 5 chance for the
motif to be random
67General OPV Results
- Run on previous 400 datasets
- 90 of correct runs (288/400) were classified as
noise - Not good
68Finder-Specific OPV Estimation
- Same as before, but use only one finder
- Better biased toward the parameter space of the
specific finder
69Finder-Specific OPV Results
- Tested it on Gibbs
- Same 400 datasets
- 228 TPs
- 13 FPs
- Much better
70Using OPV
- Impractical
- A priori generation is prohibitive given
parameter space of motif finders - Per problem estimation is prohibitive
- Requires 100x more runs
- Not theirs
71Another Metric ILR (Incomplete Likelihood Ratio)
- Not defined in terms of Information Content
72Another Metric ILR (Incomplete Likelihood Ratio)
- Not defined in terms of Information Content
- Ratio of null hypothesis to OOPS hypothesis
- OOPS Once occurrence per sequence
- Intuition behind it
73Another Metric ILR (Incomplete Likelihood Ratio)
74Objectives
- Analyze existing standard (E-value)
- Statistical significance of motifs in Twilight
Zone - Examine and suggest new metrics
- Employ new metric for motif finding
75Objectives
- Analyze existing standard (E-value)
- Statistical significance of motifs in Twilight
Zone - Examine and suggest new metrics
- Employ new metric for motif finding
76Motif Finding using ILR
- Used existing algorithms, ranked final output by
ILR - Developed simple new algorithm that uses ILR as
objective function
77ILR Motif Finding Results
78ILR Motif Finding Results
79ILR Motif Finding Results
80Objectives
- Analyze existing standard (E-value)
- Statistical significance of motifs in Twilight
Zone - Examine and suggest new metrics
- Employ new metric for motif finding
81Objectives
- Analyze existing standard (E-value)
- Statistical significance of motifs in Twilight
Zone - Examine and suggest new metrics
- Employ new metric for motif finding
- One More Thing!
82(No Transcript)
83Thank You