Title: Do Funnels Exist
1Do Funnels Exist ?
First Rotation Group Meeting Noa
Rappaport 23/1/04
2Lecture Outline
- What is a funnel Challenges and ideas
- Analysis Methods
- Analysis Performed
- Simulation
- Signal Smoothing
- Cebp Promoter Funnel
- Plans for the future
3DNA Binding Transcriptional Regulators
- Bind DNA in order to influence mRNA
transcription. - They control cell growth, cell development and
differentiation. - Function by binding to specific DNA sequences
located upstream to the gene and induce or
repress gene expression. - These sequence are usually short (5-15bp) and
frequently degenerate -gt which confers different
levels of activity upon different promoters
(Bulyk et. al)
4DNA Binding Transcriptional Regulators
- Grouped into families according to sequence and
structural homologies, such as - Helix Turn Helix -Zinc-Finger
- Helix-Loop-Helix -Betta Ribon
5DNA Binding Transcriptional Regulators
- Protein DNA interactions are governed by
- Amino acid base pair interaction
- Van der Waals interactions
- Water mediated protein-DNA hydrogen bonds
- Binding of small ligands
- Homo and hetero protein dimerization
- Binding of an associated transcription factor
- Translational modifications.
- (Marmorstein et al.)
6DNA Binding Transcriptional Regulators
(Jacobson, 1997)
7What is a Funnel ?
funnel
funnel
DNA strand
?G
AGGTTGCAATTTCTTTTTCTATTAGTAGCTAAAAATGGGTCACGTGATCT
ATATTCGAAAGGGGCGGTTGCCTCAGGAA
8Challenges and ideas
- Creating the energy layout of the sequence
- Understanding the hopping\sliding mechanism
- Smoothing of energy signal
- Funnel Qualification
- Multiplicity of Sites
- Same motif different promoters
- Conservation of funnel
- Adi Shamirs problem
- Simulations
9Funnel - Expected Features
Finding time ?t(finding)
Escape time ?t(escape)
DNA strand
Orthologs/Paralogs
Touch down
?
?
Noise Problems
10First Evidences of Funnel
- In the literature, one negative evidence
(Gerland et. al, 2002)
Recent suspect for a funnel for the TF cebp
in the IL18BP promoter
Some qualitative observations in Yeast.
11Some Problems
- Very few TFs have experimentally verified binding
sites. - Binding sites are usually predited by points who
have the greatest score, while this doesnt
necessarily has to be so. - Very few (100) structure of DNA protein were
solved by crystallography and NMR. - Much information exists regarding PSSMs, but it
has errors and their correction is a theory on
its own (pseudo counts).
Analysis has to be performed on predicted sites
and predicted affinity signal.
12Methods of Producing the Affinity Signal
- Exact CalculationPossible, but limited due to
small DB of solved protein DNA structures
(proNIT).
13Methods of Producing the Affinity Signal
PSSM Scoring
AGGTTGCAATTTCTTTTTCTATTAGTAGCTAAAAATGGGTCACGTGATCT
ATATTCGA
Score -log(0.50.20.250.70.10.250.1)
Score -log(0.050.20.250.70.20.250.5)
Score -log(0.050.30.250.10.60.250.5)
Scores Vector
4.36
4.98
4.56
And we can keep going
14Methods of Producing the Affinity Signal
PSSM Scoring
- Assumes positions are independent doesnt
allow for logic.
- Might be problematic for values which are zero.
- Some PSSMs are defective.
- multiplication of the probabilities correlates
with the thermodynamic constant K.
- Taking the log correlate with calculation of ?G.
15Methods of Producing the Affinity Signal
Bayesian Network Models
(Barash et. al, 2003)
16Possible Approaches
Given a relatively reliable affinity signal,
there could be a few approaches to attack the
problem
- Simulations Giving the kinetic aspect
- Working on real\smoothed signals.
- Scan the space of possible theoretical funnels
for our 4 demands.
- Analytical Calculations
- Working on real\smoothed signals.
- Scan the space of possible theoretical funnels
for our 4 demands.
17Monte Carlo Modeling
- Any method which solves a problem by generating
suitable random numbers and observing that
fraction of the numbers obeying some property or
properties.
- The method is useful for obtaining numerical
solutions to problems which are too complicated
to solve analytically.
- The name Monte Carlo'' was given by Metropolis
during the Manhattan Project of World War II,
because the capital of Monaco was a center for
gambling.
18Monte Carlo Modeling
- The only requirement is that the physical (or
mathematical) system can be described by
probability density functions (pdf's).
- Once the pdf's are known, the Monte Carlo
simulation can proceed by random sampling from
the pdf's.
- Many simulations are then performed, and the
desired result is taken as an average over the
number of observations.
19Analysis Methods - Simulation
- Represent the space by a discrete lattice.
- Represent the DNA energy layout as a topographic
terrain. - Represent the TF as a moving particle.
- Take discrete points of time
- The Lattice represent the cell, and the
possibility of attaching other exposed sites on
the DNA.
(Halford et. al 2002)
20Analysis Methods - Simulation
T 0
T 1
T 2
Eureka !
T 3
T 4
TF
site
T 5
21Probability distribution functions
Intermediate State
- Treat the DNA-TF interactions as a set of second
order elementary reactions.
?G
ka
A B ? A-B complex
kd
22Probability distribution functions
- We have three possible transitions
- Sliding on the DNA Linear diffusion
-10 -5
-5 -10
- Dessociation- Reassociation
23Probability distribution functions
- In the first stage we can assume that the energy
barrier when moving from the higher energy state
to the intermediate state is uniform.
- Sliding on the DNA Linear diffusion
24Parameters checked with the simulation
Touch down
Real Promoter vs. Shuffled
25Problems with the simulation
- Activation Energy for
- Sliding
- Attachment-Detachment from the DNA
- Lattice Movement
- Lattice Size
- Lattice Energy
- Assumption on second order elementary reactions.
(Ferreiro et. al 2003)
26Simulation - Results
Coordinates on the DNA
The energy terrain
Number of time steps
PSSM score
Coordinate on the DNA
The energy terrain
Number of time steps
PSSM score
Coordinate on the DNA
27Artificial Funnel - Results
Step function terrain
Step function in comparison to funnel
Funnel terrain
Number of time steps
200 initial points
28Multi-Width-Depth Anal.
200 Initial Positions Stop at endpoint
Funnel width
Distracter depth
Compare
29Two Parabolas Analysis - Results
Width Funnel- 0.1 Width Distracter 0.1 Depth F.
7 Depth D. - 5
Number of time steps
kinetic verification
30Effect of Funnels Width on Finding Time
1000 initial points
Average of time steps
Funnel Width
31Effect of Distracter Depth on Finding Time
Funnel Width 0.1 100 initial points.
Average of time steps
Distracters Depth
32Step Function Terrain
100 initial points Finding time mean
2.9736e004 Finding time std 3.3282e004 Error
in mean finding time 3.3282e003
33Verifying exponential search time in a flat
surface
Step function terrain
Average of time steps
Average of time steps
Distance from end point
Log(Distance) from end point
34Verifying linear search time in an all-funnel
surface
Funnel terrain
Average of time steps
Distance from end point
35Distribution of Finding Times starting from the
same IP
The simulation was run over the following
terrain Three initial points were tested, each
repeated 200 times.
Funnel Edge
right
Left
Finding Time Distribution
Finding Time Distribution
Finding Time Distribution
36Does Funnel Improve Capturing of the TF in its
Vicinity ?
- This question can be regarded by the simulation.
The TF was put at time zero at the funnels
minima and a few factors were checked1. The
first time the TF escaped the funnels
vicinity.2. The max distance from the funnels
minima the TF got to. - Those factors were checked for a set of
combinations of different widths and depths of
the funnel. Each time the simulation was let to
run 100000 time steps, and each such run was
repeated 100 times.
37Results First Escape Time
Escape time was found to increase with funnels
depth as expected. It can be seen as well that
for increasing funnel width the first escape time
increases for the same distracter depth.
Average of time steps
Distracters Depth
38Results First Escape Time
- The Number of time steps until the first escape
was compared to the number on a stair-step
terrain. The following graph was received. - The number drops on deeper funnels.
- The vicinity was taken to be all the area
contained between the right and left borders,
including the lattice.
Mean F.E.T funnel/Mean F.E.T flat
Funnels Depth
39Results Max Distance
- The Maximal distance the TF got to on the DNA was
measured for a few funnel depths and widths.
Max Distance
Funnels Depth
40Yeast Analysis
- The non coding region of the Yeast genome is
relatively compact, complete genome available,
well characterized phylogeny.
- Working on a dataset containing sequences of 4483
promoter sequences in yeast. For each TF it is
possible to get a prediction of the promoters it
is likely to be found on by ScanAce. Then, we can
generate a score of each point of the promoter.
41Gal4 Analysis
- One of the binding sites of Gal4 was taken. It is
found in the promoter of the Gelatos permease.
The area around it was expanded
42Gal4 Transcription Factor
- Zink Finger transcription factor
- Bind as a homodimer to the DNA
- Recognizes inverted CGG half sites repeats with
11 base pairs spacer
43Gal4 Binding Site
44Finding Time Analysis
kT4
kT4
kT6
kT5
45Finding Time Analysis
kT7
kT8
kT15
kT20
46Vicinity Analysis
Mean FT promoter/Mean FT shuffled
kT
47Same IP analysiskT 4
Original
rand
Original
rand
Window on the Edge of the Funnel (680-700)
Window at the far end (100-120)
48Same IP analysisDistribution
Original promoter - close
Shuffled promoter - close
49Same IP analysisDistribution
Original promoter - far
Shuffled promoter - far
50Using Fourier transform for signal smoothing
- The fast discrete Fourier transform was used to
transform between the spatial domain and the
frequency domain.
- In the frequency domain, high frequencies were
zeroed.
- Transforming back to the spatial domain resulted
in a smoother signal.
51Using Fourier transform for signal smoothing -
Example
52IL18BP - Promoter
Scanned with cebpß PSSM
Smoothed Signal
Real Signal Smoothed Signal
53Conservation map for cebp signal
54An Unexpected Result
Alignment of Mouse and Human Promoters
Calculate Similarity Percent
Produce random sequences for which the similarity
is in the same percent
Generate scores vector for each random sequence
- Took the mouse and human promoters. Check the
similarity in sequence. Produce random sequences
for which the similarity is in the same percent
(by randomly changing the same X percent of their
positions). Generate scores vector for each
random sequence. Check the correlation for each
of the random sequences with the human funnel.
Generate distribution of the correlation
coefficient as a function of the number of
sequences that have it. Check where the
mouse-human coefficient is situated on the
distribution. If it is on this side it means
that the conservation in the funnel is more than
due to sequence similarity but more due to funnel
similarity. - Results
- Comparison between the conservation pattern
between the human and mouse promoter and between
the human promoter and my random promoter
55An Unexpected Result
Generate scores vector for each random sequence
Smooth it by Fourier
Check the correlation for each of the random
sequences with the human funnel
Generate scores vector for each random sequence
56An Unexpected Result
Check where the mouse-human correlation
coefficient is situated on the distribution
Expectation it will be localized in the right
hand side of the distribution, having high
p-value, meaning the reason for the funnels
similarity is sequence similarity.
The p-value for the mouse-human signals is zero.
57Scanning human promoters with cebp PSSM
58Scanning IL18BP promoter with human PSSMs
Original Promoter
Shuffled Promoter
59Scanning IL18BP promoter with Statx and IRF PSSMs
60Methods of Producing the Affinity Signal
MatInspector Algorithm
On the Motif
Employ an alignment algorithm
Calculate the nucleotide distribution matrix
Calculate ci value for each position
Define a core region
61Methods of Producing the Affinity Signal
MatInspector Algorithm
Scanning
Calculate the core similarity for each position
of the sequence
Calculate the matrix similarity if core
similarity reaches threshold
Binding Sites - the sequences that reach the
minimum core and matrix similarity thresholds.
62For the Future
- Large Scale Analysis for all yeast promoters
- Analytical Calculation for hopefully reducing
run-time
- Optimization of the scoring method
- Use of DB of known binding sites verified
experimentally with high credibility.
- Analysis over the parameters space
63For the Future
- Check for all the criterions with other methods
of shuffling e.g. k-mer preserving.
- Analytical calculation for specific protein and
DNA interaction.
- If funnels exists, check for common features of
promoters containing them.
64Acknowledgments
Thanks to Tzachi, Ran, , Reut, Arren and
everyone else !
Igor
65PSSM
PSSM Position Specific Scoring Matrix.
A
T
C
G
66ScanAce
- ScanACE (Scans for Nucleic Acid Conserved
Elements) is a program which scans DNA sequence
for elements which match a DNA motif.
67Simulations Methods
Gillsepie Algorithm
- An algorithm for simulating system with multiple
reaction channels and multiple chemical species.
- Consider a system of r chemical reactions
- For each time step, the system is exactly at one
state.
- A transition occurs when executing a reaction,
and then the state changes.
- Gillespies direct method calculates which
reaction occurs next and when it occurs.
68Simulations Methods
Gillsepie Algorithm
- The probability for reaction channel µ to be the
next reaction
- Derive from the probability distribution
for times
69Monte Carlo Modeling
- Probability distribution functions (pdf's) - the
physical (or mathematical) system must be
described by a set of pdf's.
- Random number generator - a source of random
numbers uniformly distributed on the unit
interval must be available.
- Sampling rule - a prescription for sampling from
the specified pdf's, assuming the availability of
random numbers on the unit interval, must be
given.
70Monte Carlo Modeling
- Scoring (or tallying) - the outcomes must be
accumulated into overall tallies or scores for
the quantities of interest.
- Error estimation - an estimate of the statistical
error (variance) as a function of the number of
trials and other quantities must be determined.
- Variance reduction techniques - methods for
reducing the variance in the estimated solution
to reduce the computational time for Monte Carlo
simulation
- Parallelization and vectorization - algorithms to
allow Monte Carlo methods to be implemented
efficiently on advanced computer architectures.
71Probability distribution functions
Intermediate State
?G
ka
A B ? A-B -gt C complex
kd