Regulatory Motif Finding - PowerPoint PPT Presentation

1 / 83

About This Presentation

Title:

Regulatory Motif Finding

Description:

Regulatory Motif Finding – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 84

Provided by: mohammeda7

Category:

more less

Transcript and Presenter's Notes

Title: Regulatory Motif Finding

1
Regulatory Motif Finding

Mohammed AlQuraishi

2
Talk Outline

Biology Background
Algorithmic Problem
Papers
New Motif Finding Algorithm (MotifCut)
Analysis of Motif Finders Performance

3
Talk Outline

Biology Background
Algorithmic Problem
Papers
New Motif Finding Algorithm (MotifCut)
Analysis of Motif Finders Performance

4
Cell Factory, Proteins Machines
Biovisions, Harvard
5
DNA

Instructions for making the machines

Coding Regions
Regulatory Regions (Regulons)

Instructions for when and where to make them

6
Transcriptional Regulation

Regulatory regions are comprised of binding
sites
Binding sites attract a special class of
proteins, known as transcription factors
Bound transcription factors can inhibit DNA
transcription

7
DNA Regulation
Source Richardson, University College London
8
Cell Regulation

Transcriptional regulation is one of many
regulatory mechanisms in the cell

Focus of Talk
Source Mallery, University of Miami
9
Structural Basis of Interaction
10
Structural Basis of Interaction

Key Feature
Transcription factors are not 100 specific when
binding DNA
Not one sequence, but family of sequences, with
varying affinities

0.54
0.48
0.32
0.25
0.11
0.08
11
Talk Outline

Biology Background
Algorithmic Problem
Papers
New Motif Finding Algorithm (MotifCut)
Analysis of Motif Finders Performance

12
Talk Outline

Biology Background
Algorithmic Problem
Papers
New Motif Finding Algorithm (MotifCut)
Analysis of Motif Finders Performance

13
Motif Finding

Basic Objective
Find regions in the genome that bind
transcription factors
Many classes of algorithms, differ in
Types of input data
Motif representation

14
Input Data

Single sequence

Evolutionarily related set of sequences

Sequence other data
Microarray expression profile
ChIP-chip
Others

15
Motif Representation

Probabilistic
Word-Based

Focus of Talk
16
Motif Representation

Structural discussion immediately raises
difficulties

17
Structural Basis of Interaction

Key Feature
Transcription factors are not 100 specific when
binding DNA
Not one sequence, but family of sequences, with
varying affinities

0.54
0.48
0.32
0.25
0.11
0.08
18
Motif Representation

Structural discussion immediately raises
difficulties
Least Expressive
Single sequence

Most Expressive
4k-dimensional probability distribution
Independently assign probability for each
possible kmer

19
Motif Representation

Standard Solution
Position-Specific Scoring Matrix (PSSM)
Assuming independence of positions, assign a
probability for each position

Fraught with problems (Will revisit this)

20
Talk Outline

Biology Background
Algorithmic Problem
Papers
New Motif Finding Algorithm (MotifCut)
Analysis of Motif Finders Performance

21
Talk Outline

Biology Background
Algorithmic Problem
Papers
New Motif Finding Algorithm (MotifCut)
Analysis of Motif Finders Performance

22
Reference

Authors
Eugene Fratkin, Brian T. Naughton, Douglas L.
Brutlag, and Serafim Batzoglou
Title
MotifCut regulatory motifs finding with maximum
density subgraphs
Publication
Bioinformatics Vol. 22 no. 14 2006, pages
e150e157

23
Overview

Motif Finding Algorithm (MotifCut)
Motivation
Oversimplicity of PSSMs
Intractability of more complex models

24
Oversimplicity of PSSMs

Assumes independence between positions
25 of TRANSFAC motifs have been shown to
violate this assumption
Two Examples ADR1 and YAP6

25
Oversimplicity of PSSMs

Assumes independence between positions
Generates potentially unseen motifs

26
Basic Features of MotifCut

Does not assume an underlying PSSM
Represents a motif with a graph structure
In principle maximally expressive
In practice not quite
Motif finding cast as maximum density subgraph
Subquadratic complexity

27
Motif Graph Representation

Nodes are kmers
Edge weights are distances between kmers

AGTGCGAC
1
AGTGGGAC
1
1
0
2
AGTGGGAC
AGTGCTAC
2

Generative model Frequency of kmer node equal to
frequency of generating kmer
Distance definition is complicated (Will come
back to)
Same kmer node can appear multiple times

28
Motif Finding

Find highest density subgraph

Density is defined as sum of edge weights per
node
Somewhat limits representational power

29
Motif Finding

Read new sequence
Generate graph as previously described
Kmers are generated by shifting one base pair
Each kmer in the sequence gets a node, including
identical kmers
Graph contains as many nodes as there are base
pairs
Connect edges with weights based on distances
between nodes
Find densest subgraph

30
Edge Weights

Heart of the algorithm, will focus on this
Semantics
Edge weight is the likelihood of two kmers to be
in the same motif
Use Hamming distance as a way to quantify
distance between kmers

0
1
2
3
T
A
C
31
Edge Weights

Heart of the algorithm, will focus on this
Semantics
Edge weight is the likelihood of two kmers to be
in the same motif
Use Hamming distance as a way to quantify
distance between kmers
Interpret hamming distance as a measure of the
likelihood of two kmers to be in same motif
F(hamming distance) likelihood of two kmers to
be in same motif

32
Edge Weights

Lets make this a bit more precise
But how to compute ?
Simulate it!
Way too many variables to account for
analytically Background model, kmer length,
hamming distance, etc

33
Genome Simulation

Background Motifs
No genes, promoters, signaling sequences, etc.
Background Model
3rd order Markov model
Probability of next base depends on previous 3
bases
Modeled on the yeast genome
Incorporates GC bias
Motif Model
PSSM
Based on empirically observed information content
of yeast motifs

34
Genome Simulation

Use Markov model to generate 10k 20k length
sequences of background

Seed with 20 motifs generated by the PSSM
Result is a simulated genome of yeast
We know which parts are the real motifs, and
which are not

35
Edge Weights

Back to
is number of true motifs of k-length
that are l-distance away
is number of non-motifs of k-length
that are l-distance away

36
Edge Weights
True Motifs
False Motifs (Part of Background)
37
Edge Weights
Lets perform calculation from the perspective of
this motif

All 1 distance away (Hamming distance)
a(k 6, l 1) 1
ß(k 6, l 1) 1

38
Edge Weights

Computation provides an empirical estimate for
Parameterized by two quantities
k, the kmer length
l, the Hamming distance between two kmers
Fit to a sigmoidal function

39
Edge Weights

Normalization step
Wont go into details
This covers problem formulation
How is motif finding actually done?

40
Maximum Density Subgraph

Standard graph theory method
Max-flow / min-cut
O(nm log(n2m))
Need faster method
Developed heuristic approach that utilizes
max-flow / min-cut method with modifications

41
Maximum Density Subgraph

Remove all edges below a certain threshold

42
Maximum Density Subgraph

Pick one vertex (do this for every vertex)

43
Maximum Density Subgraph

Put back all neighboring edges for that vertex

44
Maximum Density Subgraph

Use standard algorithm to calculate densest
subgraph

45
Results

Synthetic Tests
Plenty of test cases
Measure performance as data set size grows
Avoid over biasing on empirical data
Know real answer, can unambiguously test
performance
Yeast Test
Gold standard data (Harbinson et al., 2004)

46
Synthetic Tests

Varied
Motif length
Information content
Simulated genome (as before)
Correlated predicted PSSMs to real ones, counted
as true positive if correlation gt 0.7

47
Synthetic Tests Results
48
Yeast Test Results
49
Performance
50
Talk Outline

Biology Background
Algorithmic Problem
Papers
New Motif Finding Algorithm (MotifCut)
Analysis of Motif Finders Performance

51
Talk Outline

Biology Background
Algorithmic Problem
Papers
New Motif Finding Algorithm (MotifCut)
Analysis of Motif Finders Performance
Shorter but more drier (no pretty pictures)

52
Reference

Authors
Patrick Ng, Niranjan Nagarajan, Neil Jones, and
Uri Keich
Title
Apples to apples improving the performance of
motif finders and their significance analysis in
the Twilight Zone
Publication
Bioinformatics Vol. 22 no. 14 2006, pages
e393e401

53
Overview

Twilight Zone
Non-negligible probability that a maximally
scoring random motif would have a higher score
than motifs that overlap the real motif
Motivation
Behavior of Motif Finders in Twilight Zone is
poorly understood
Understanding would aid in development of Motif
Finders
Sheds light on whether it is theoretically
possible

54
Objectives

Analyze existing standard (E-value)
Statistical significance of motifs in Twilight
Zone
Examine and suggest new metrics
Employ new metric for motif finding

55
Objectives

Analyze existing standard (E-value)
Statistical significance of motifs in Twilight
Zone
Examine and suggest new metrics
Employ new metric for motif finding

56
E-value

E-value is defined in terms of information
content
Information Content
E-value
Expected number of random alignments exhibiting
an information content at least as high as that
of the given alignment

57
Analysis

Generate 400 random datasets
Dataset 40 sequences totaling 1485 bases
Implant a single motif of length 13 per dataset
High likelihood that motif finders would miss it

58
Results

Reported E-value
8 x 1015
Very high, very statistically insignificant
In principle, theoretically impossible to find
Search results
Alignment covering 30 of motif found in 288/400
cases!
Data generated exactly in accordance with E-value
model

59
Whats going on?

They dont know, hand-waive it
Many satellite alignments boost up effective
score
Difficult to characterize analytically

60
Objectives

Analyze existing standard (E-value)
Statistical significance of motifs in Twilight
Zone
Examine and suggest new metrics
Employ new metric for motif finding

61
Objectives

Analyze existing standard (E-value)
Statistical significance of motifs in Twilight
Zone
Examine and suggest new metrics
Employ new metric for motif finding

62
New Metric OPV

Also defined in terms of Information Content
OPV(s) (Overall p-value)
Probability that a random sample of the same size
as the input set will contain an alignment with
at least as much information content as s
Contrast
E-value Expected number of alignments (in
general)
OPV Probability of finding an alignment in a
dataset

63
Estimation