Regulatory Motif Finding - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Regulatory Motif Finding

Description:

Regulatory Motif Finding – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 84
Provided by: mohammeda7
Category:

less

Transcript and Presenter's Notes

Title: Regulatory Motif Finding


1
Regulatory Motif Finding
  • Mohammed AlQuraishi

2
Talk Outline
  • Biology Background
  • Algorithmic Problem
  • Papers
  • New Motif Finding Algorithm (MotifCut)
  • Analysis of Motif Finders Performance

3
Talk Outline
  • Biology Background
  • Algorithmic Problem
  • Papers
  • New Motif Finding Algorithm (MotifCut)
  • Analysis of Motif Finders Performance

4
Cell Factory, Proteins Machines
Biovisions, Harvard
5
DNA
  • Instructions for making the machines

Coding Regions
Regulatory Regions (Regulons)
  • Instructions for when and where to make them

6
Transcriptional Regulation
  • Regulatory regions are comprised of binding
    sites
  • Binding sites attract a special class of
    proteins, known as transcription factors
  • Bound transcription factors can inhibit DNA
    transcription

7
DNA Regulation
Source Richardson, University College London
8
Cell Regulation
  • Transcriptional regulation is one of many
    regulatory mechanisms in the cell

Focus of Talk
Source Mallery, University of Miami
9
Structural Basis of Interaction
10
Structural Basis of Interaction
  • Key Feature
  • Transcription factors are not 100 specific when
    binding DNA
  • Not one sequence, but family of sequences, with
    varying affinities

0.54
0.48
0.32
0.25
0.11
0.08
11
Talk Outline
  • Biology Background
  • Algorithmic Problem
  • Papers
  • New Motif Finding Algorithm (MotifCut)
  • Analysis of Motif Finders Performance

12
Talk Outline
  • Biology Background
  • Algorithmic Problem
  • Papers
  • New Motif Finding Algorithm (MotifCut)
  • Analysis of Motif Finders Performance

13
Motif Finding
  • Basic Objective
  • Find regions in the genome that bind
    transcription factors
  • Many classes of algorithms, differ in
  • Types of input data
  • Motif representation

14
Input Data
  • Single sequence
  • Evolutionarily related set of sequences
  • Sequence other data
  • Microarray expression profile
  • ChIP-chip
  • Others

15
Motif Representation
  • Probabilistic
  • Word-Based

Focus of Talk
16
Motif Representation
  • Structural discussion immediately raises
    difficulties

17
Structural Basis of Interaction
  • Key Feature
  • Transcription factors are not 100 specific when
    binding DNA
  • Not one sequence, but family of sequences, with
    varying affinities

0.54
0.48
0.32
0.25
0.11
0.08
18
Motif Representation
  • Structural discussion immediately raises
    difficulties
  • Least Expressive
  • Single sequence
  • Most Expressive
  • 4k-dimensional probability distribution
  • Independently assign probability for each
    possible kmer

19
Motif Representation
  • Standard Solution
  • Position-Specific Scoring Matrix (PSSM)
  • Assuming independence of positions, assign a
    probability for each position
  • Fraught with problems (Will revisit this)

20
Talk Outline
  • Biology Background
  • Algorithmic Problem
  • Papers
  • New Motif Finding Algorithm (MotifCut)
  • Analysis of Motif Finders Performance

21
Talk Outline
  • Biology Background
  • Algorithmic Problem
  • Papers
  • New Motif Finding Algorithm (MotifCut)
  • Analysis of Motif Finders Performance

22
Reference
  • Authors
  • Eugene Fratkin, Brian T. Naughton, Douglas L.
    Brutlag, and Serafim Batzoglou
  • Title
  • MotifCut regulatory motifs finding with maximum
    density subgraphs
  • Publication
  • Bioinformatics Vol. 22 no. 14 2006, pages
    e150e157

23
Overview
  • Motif Finding Algorithm (MotifCut)
  • Motivation
  • Oversimplicity of PSSMs
  • Intractability of more complex models

24
Oversimplicity of PSSMs
  • Assumes independence between positions
  • 25 of TRANSFAC motifs have been shown to
    violate this assumption
  • Two Examples ADR1 and YAP6

25
Oversimplicity of PSSMs
  • Assumes independence between positions
  • Generates potentially unseen motifs

26
Basic Features of MotifCut
  • Does not assume an underlying PSSM
  • Represents a motif with a graph structure
  • In principle maximally expressive
  • In practice not quite
  • Motif finding cast as maximum density subgraph
  • Subquadratic complexity

27
Motif Graph Representation
  • Nodes are kmers
  • Edge weights are distances between kmers

AGTGCGAC
1
AGTGGGAC
1
1
0
2
AGTGGGAC
AGTGCTAC
2
  • Generative model Frequency of kmer node equal to
    frequency of generating kmer
  • Distance definition is complicated (Will come
    back to)
  • Same kmer node can appear multiple times

28
Motif Finding
  • Find highest density subgraph
  • Density is defined as sum of edge weights per
    node
  • Somewhat limits representational power

29
Motif Finding
  • Read new sequence
  • Generate graph as previously described
  • Kmers are generated by shifting one base pair
  • Each kmer in the sequence gets a node, including
    identical kmers
  • Graph contains as many nodes as there are base
    pairs
  • Connect edges with weights based on distances
    between nodes
  • Find densest subgraph

30
Edge Weights
  • Heart of the algorithm, will focus on this
  • Semantics
  • Edge weight is the likelihood of two kmers to be
    in the same motif
  • Use Hamming distance as a way to quantify
    distance between kmers

0
1
2
3
T
A
C
31
Edge Weights
  • Heart of the algorithm, will focus on this
  • Semantics
  • Edge weight is the likelihood of two kmers to be
    in the same motif
  • Use Hamming distance as a way to quantify
    distance between kmers
  • Interpret hamming distance as a measure of the
    likelihood of two kmers to be in same motif
  • F(hamming distance) likelihood of two kmers to
    be in same motif

32
Edge Weights
  • Lets make this a bit more precise
  • But how to compute ?
  • Simulate it!
  • Way too many variables to account for
    analytically Background model, kmer length,
    hamming distance, etc

33
Genome Simulation
  • Background Motifs
  • No genes, promoters, signaling sequences, etc.
  • Background Model
  • 3rd order Markov model
  • Probability of next base depends on previous 3
    bases
  • Modeled on the yeast genome
  • Incorporates GC bias
  • Motif Model
  • PSSM
  • Based on empirically observed information content
    of yeast motifs

34
Genome Simulation
  • Use Markov model to generate 10k 20k length
    sequences of background
  • Seed with 20 motifs generated by the PSSM
  • Result is a simulated genome of yeast
  • We know which parts are the real motifs, and
    which are not

35
Edge Weights
  • Back to
  • is number of true motifs of k-length
    that are l-distance away
  • is number of non-motifs of k-length
    that are l-distance away

36
Edge Weights
True Motifs
False Motifs (Part of Background)
37
Edge Weights
Lets perform calculation from the perspective of
this motif
  • All 1 distance away (Hamming distance)
  • a(k 6, l 1) 1
  • ß(k 6, l 1) 1

38
Edge Weights
  • Computation provides an empirical estimate for
  • Parameterized by two quantities
  • k, the kmer length
  • l, the Hamming distance between two kmers
  • Fit to a sigmoidal function

39
Edge Weights
  • Normalization step
  • Wont go into details
  • This covers problem formulation
  • How is motif finding actually done?

40
Maximum Density Subgraph
  • Standard graph theory method
  • Max-flow / min-cut
  • O(nm log(n2m))
  • Need faster method
  • Developed heuristic approach that utilizes
    max-flow / min-cut method with modifications

41
Maximum Density Subgraph
  • Remove all edges below a certain threshold

42
Maximum Density Subgraph
  • Pick one vertex (do this for every vertex)

43
Maximum Density Subgraph
  • Put back all neighboring edges for that vertex

44
Maximum Density Subgraph
  • Use standard algorithm to calculate densest
    subgraph

45
Results
  • Synthetic Tests
  • Plenty of test cases
  • Measure performance as data set size grows
  • Avoid over biasing on empirical data
  • Know real answer, can unambiguously test
    performance
  • Yeast Test
  • Gold standard data (Harbinson et al., 2004)

46
Synthetic Tests
  • Varied
  • Motif length
  • Information content
  • Simulated genome (as before)
  • Correlated predicted PSSMs to real ones, counted
    as true positive if correlation gt 0.7

47
Synthetic Tests Results
48
Yeast Test Results
49
Performance
50
Talk Outline
  • Biology Background
  • Algorithmic Problem
  • Papers
  • New Motif Finding Algorithm (MotifCut)
  • Analysis of Motif Finders Performance

51
Talk Outline
  • Biology Background
  • Algorithmic Problem
  • Papers
  • New Motif Finding Algorithm (MotifCut)
  • Analysis of Motif Finders Performance
  • Shorter but more drier (no pretty pictures)

52
Reference
  • Authors
  • Patrick Ng, Niranjan Nagarajan, Neil Jones, and
    Uri Keich
  • Title
  • Apples to apples improving the performance of
    motif finders and their significance analysis in
    the Twilight Zone
  • Publication
  • Bioinformatics Vol. 22 no. 14 2006, pages
    e393e401

53
Overview
  • Twilight Zone
  • Non-negligible probability that a maximally
    scoring random motif would have a higher score
    than motifs that overlap the real motif
  • Motivation
  • Behavior of Motif Finders in Twilight Zone is
    poorly understood
  • Understanding would aid in development of Motif
    Finders
  • Sheds light on whether it is theoretically
    possible

54
Objectives
  • Analyze existing standard (E-value)
  • Statistical significance of motifs in Twilight
    Zone
  • Examine and suggest new metrics
  • Employ new metric for motif finding

55
Objectives
  • Analyze existing standard (E-value)
  • Statistical significance of motifs in Twilight
    Zone
  • Examine and suggest new metrics
  • Employ new metric for motif finding

56
E-value
  • E-value is defined in terms of information
    content
  • Information Content
  • E-value
  • Expected number of random alignments exhibiting
    an information content at least as high as that
    of the given alignment

57
Analysis
  • Generate 400 random datasets
  • Dataset 40 sequences totaling 1485 bases
  • Implant a single motif of length 13 per dataset
  • High likelihood that motif finders would miss it

58
Results
  • Reported E-value
  • 8 x 1015
  • Very high, very statistically insignificant
  • In principle, theoretically impossible to find
  • Search results
  • Alignment covering 30 of motif found in 288/400
    cases!
  • Data generated exactly in accordance with E-value
    model

59
Whats going on?
  • They dont know, hand-waive it
  • Many satellite alignments boost up effective
    score
  • Difficult to characterize analytically

60
Objectives
  • Analyze existing standard (E-value)
  • Statistical significance of motifs in Twilight
    Zone
  • Examine and suggest new metrics
  • Employ new metric for motif finding

61
Objectives
  • Analyze existing standard (E-value)
  • Statistical significance of motifs in Twilight
    Zone
  • Examine and suggest new metrics
  • Employ new metric for motif finding

62
New Metric OPV
  • Also defined in terms of Information Content
  • OPV(s) (Overall p-value)
  • Probability that a random sample of the same size
    as the input set will contain an alignment with
    at least as much information content as s
  • Contrast
  • E-value Expected number of alignments (in
    general)
  • OPV Probability of finding an alignment in a
    dataset

63
Estimation
  • Caveat
  • Random sample (no biasing)
  • Difficult to calculate analytically
  • Estimate empirically
  • General OPV
  • Finder-specific OPV

64
General OPV Estimation
  • Generate 1600 random datasets
  • No implants
  • Run a collection of motif finders on each dataset
  • Pick highest scoring motif in each dataset
  • Out of all finders
  • Sort scores, then pick score with 95 quantile

65
General OPV Estimation
Score such that 95 of scores are below it, 5
above it
66
General OPV Estimation
  • Meaning
  • 95 of the time, highest scoring random motif
    scored less than s0
  • Obtaining a score s0 means 5 chance for the
    motif to be random

67
General OPV Results
  • Run on previous 400 datasets
  • 90 of correct runs (288/400) were classified as
    noise
  • Not good

68
Finder-Specific OPV Estimation
  • Same as before, but use only one finder
  • Better biased toward the parameter space of the
    specific finder

69
Finder-Specific OPV Results
  • Tested it on Gibbs
  • Same 400 datasets
  • 228 TPs
  • 13 FPs
  • Much better

70
Using OPV
  • Impractical
  • A priori generation is prohibitive given
    parameter space of motif finders
  • Per problem estimation is prohibitive
  • Requires 100x more runs
  • Not theirs

71
Another Metric ILR (Incomplete Likelihood Ratio)
  • Not defined in terms of Information Content

72
Another Metric ILR (Incomplete Likelihood Ratio)
  • Not defined in terms of Information Content
  • Ratio of null hypothesis to OOPS hypothesis
  • OOPS Once occurrence per sequence
  • Intuition behind it

73
Another Metric ILR (Incomplete Likelihood Ratio)
74
Objectives
  • Analyze existing standard (E-value)
  • Statistical significance of motifs in Twilight
    Zone
  • Examine and suggest new metrics
  • Employ new metric for motif finding

75
Objectives
  • Analyze existing standard (E-value)
  • Statistical significance of motifs in Twilight
    Zone
  • Examine and suggest new metrics
  • Employ new metric for motif finding

76
Motif Finding using ILR
  • Used existing algorithms, ranked final output by
    ILR
  • Developed simple new algorithm that uses ILR as
    objective function

77
ILR Motif Finding Results
78
ILR Motif Finding Results
79
ILR Motif Finding Results
  • Promising

80
Objectives
  • Analyze existing standard (E-value)
  • Statistical significance of motifs in Twilight
    Zone
  • Examine and suggest new metrics
  • Employ new metric for motif finding

81
Objectives
  • Analyze existing standard (E-value)
  • Statistical significance of motifs in Twilight
    Zone
  • Examine and suggest new metrics
  • Employ new metric for motif finding
  • One More Thing!

82
(No Transcript)
83
Thank You
Write a Comment
User Comments (0)
About PowerShow.com