Title: Diapositiva 1
1- Developing computational methodologies
- for genome transcript mapping
Daniela Bianchi, Raffaele Calogero, Brunello
Tirozzi
Department of Physics, University La Sapienza
Bioinformatics and Genomics Unit, Turin
2Structure of the thesis
Aim of the thesis
Understanding the mechanisms underlying
transcription expression modulation
I have investigated the efficacy of Kohonen
algorithm
- Part 1.
- Transcription expression clustering
- Part 2
- Promoter structure analysis
I have addressed the problem of motifs detection
in putative promoter regions
3WHAT DOES IT MEAN GENE CLUSTERING?
- It means to divide the interval, to which the
expression levels of genes belong, into an
optimal partition.
The partition is optimal if the associated global
classification error E is minimal
4CLASSIFICATION
- Let I a partition of the interval (0,A) in N
disjoint intervals. - Let ?i i 1,N, the centers of such
intervals. - Then a given gene is classified ?i if x Ii.
- The classification error is
x-?i
5KOHONEN NETWORK
- It is an artificial neural network with a single
layer of neurons. - Each neuron is associated with a weight ?i, the
center of atom Ii. - The number of neurons is equal to the number of
atoms. - When an input pattern x is presented to the
network this input is mapped to the neuron with
the closest weigth. The weight changes during
the learning process and tends to the values of
the distribution of the input data.
6LEARNING OF KOHONEN NETWORK
- A data point x is presented to the network and
all the differences - ?i (0)- x
- are computed
- The winner neuron is chosen, in such way it is
the neuron v with minimal difference - ?v (0)- x
- The weight of this neuron is changed, or in some
case the weigths of neighboring neurons are
changed. - This procedure is repeated with another input
- At the end of process the set of input data is
partitioned in disjoint sets (the clusters) and
the weights associated with neurons are the
values of the centers of partition s groups (the
fixed values which the weights converge).
7UPDATE RULE
- Each weight vector is updated by
?i(n1) ?i(n) ?(n) G (i,v) ( ?(n1) - ?i(n) )
where 0?(n)lt 1 and ?(n) ?(n1)
?(n) is called the learning parameter and it is
basic for the algorithm convergence G (i,v) is
called the neighborhood function of the winner
and determines the width of the activation area
near the winner
8NEIGHBORHOOD FUNCTION (1/2)
9NEIGHBORHOOD FUNCTION (2/2)
Another choice is
10THE ORDER
- Remembering that for order configuration in one
dimension means
The Kohonen network has the property of order
If Kohonen learning algorithm applied to one
dimensional configuration of weights converges,
the configuration orders itself at a certain step
of the process. The same order is preserved at
each subsequent step of the algorithm.
11THE PARAMETER ? (n)
The convergence of Kohonen algorithm strongly
depends on the rate of decay of the learning
parameter ?(n)
I have seen numerically that the there is no
convergence ( in mean o not almost everywhere)
I have seen numerically that the there is
convergence in mean (but not almost everywhere)
I have seen numerically that the there is
convergence almost everywhere
12Parameter of neighborhood function
- The convergence of the Kohonen algorithm depends
also on the values of parameters concerning the
neighborhood function
13Numerical analysis (1/3)
- I run the algorithm 1000 times
- I used sets of uniformly distributed data in
(0,1) containing 4000, 10000,20000,30000,60000,12
0000,150000,250000 elements - The procedure was done for all the mentioned ?(n)
and both ? and h.
I had 1000 cases of weights limit values at the
end of the algorithm running
Results the mean value of these cases converged
to the center of the optimal partition of the
interval (0,1) for the previous mentioned ?(n)
and both ? and h.
14Numerical analysis (2/3)
15Numerical analysis (3/4)
The average error of limit weights with respect
to the exact values of the centers decreases on
increasing the number of iterations using ? the
error decreases more quickly.
The almost everywhere convergence of the
algorithm is obtained
16Numerical analysis (4/4)
17APPLICATIONS
- I applied Kohonen network to microarrays data
generated using a breast cancer mouse model. - Data were derived by the paper Quaglino et al.
JCI 2004. - The authors studied the effect of HER2 DNA
vaccination to halt breast cancer progression in
BALB-neuT mice. A small set of genes (34) only
associated to the transcriptional profile of the
vaccinated mice was identified by hierarchical
agglomerative clustering.
18RESULTS
Using this approach I identified 34 genes
described in the paper and I also managed to
identify a subset of other vaccination specific
genes (25) that could not be discovered using the
clustering approach described in paper of
Quaglino.
19Conclusion
(FIRST PART)
- Kohonen network in one dimension converges
almost everywhere for appropriate learning
parameters and that makes it powerful and more
adaptable - It has the drawback of the choice of number of
clusters - The Kohonen algorithm in more than one dimension
works well but there exists only a sufficient
proof of the convergence
20Promoter structure analysis
It is basic to define the most likely
transcription factor binding locations on the
promoter
21Binding site models
- Matrix representation
- Position weight matrix
- Energy matrix
- String-based representation
- Consensus sequence
22Position weight matrix (1/2)
Every element of matrix is the number of times
each nucleotide is found at every position of an
alignment
23Position weigth matrix (2/2)
- From this matrix
- Position specific frequency matrix (PSFM)
- Log-odds matrix
- Information content
24Searching binding sites
- De novo method
- Novel motifs which are enriched are found
- Scanning method
- Using a given motif model a genome sequence is
scanned to find more motif matches
25Motif detection
To find instances of a given motif I have used a
PSFM and a higher order background model.
A higher background model means that a DNA
sequence can be generated with a Markov model of
order m
Observation this type of score is log likelihood
ratio of observing the data given a motif model
versus a model of DNA background. Therefore a
high score at a specific position suggests high
degree of the presence of motif in that
particular location of DNA sequence
I have computed score as
Where
26Statistical significance of scores (1/2)
- Is the score unlikely to have arisen by chance?
To answer it is necessary to know the p-value of
the score x
P-value of the score x
(score of the match of the motif with a given DNA
sequence)
If the p-value is very low the motif is
significantly represented in the DNA sequence
27Statistical signficance of scores (2/2)
P-value of the score of the motif match with a
random sequence (identically independent model,
iid all the position in the sequence have the
same distributions and are independent of each
other)
It is a good approach to reduce the number of
false positive matches of the motif
It does not give the right information on the
over representation of the motif in a given DNA
sequence
28Extreme theory value
- Extreme value theory (EVT) provides a framework
to formalize the study of behaviour in the tails
of a distribution. EVT allows us to use extreme
observations to measure the density in the tail.
Peak over threshold (POT) Analysis of large
observations which exceed a high threshold
The problem is to estimate the distribution
function Fu Fu(y)P(X-uyXgtu)P(YyXgtu),
0yxF-u
yx-u, xF right endpoint
29Applications
- I have applied the mentioned studies to a set of
promoter regions of human DNA (2836 sequences)
containg at least one regulatory estrogen
responsive element (ERE).
30Preliminary analysis (1/2)
- Making histograms, box-plots,tests for normality
The data set is not normally distributed but its
distribution tends towards the gaussian
distribution
31Preliminary analysis (2/2)
- Analysis on random sequences ( all positions
have the same letter distribution and are
independet of each other)
The distribution of the sum of iid variables over
0,a is
- I have checked
- the normality
- type of extreme value distribution
- the Von Mises condition
32Results
- Focusing the attention on detection one ERE at
once - the distribution of maximum of scores using
random sequences is almost like a Weibull
distribution (only 2/100 are not like a Weibull) - the distribution of maximum of scores using DNA
sequences is mainly like a Weibull. There are
413/2836 score sequences which are like a Frechet
and 13 which are Gumbel. -
I show the result of
These genes have been proved to contain at least
one ERE by ChIP
33Results
34Results
35Results
36Results
I have applied the same procedure mentioned
previously considering also the distance between
the two EREs
- Only for some particular distances I have
detected a couple of EREs - The position for one of the two EREs of the
couple is the same position found biologically. - A relatevely high homology is conserved for the
mouse ortholog in the 3 genes - Only for the DVL1 gene the high homology is
conserved for the three orthologs ( mouse, rat,
dog)
37Conclusion (1/2)
- I have improved the detection of motifs
- Using a higher order of background
- Computing scores for two or three motifs
- Using POT on both DNA and random sequences
- Testing the pvalue
The model fits well it is able to detect those
gene validated by ChIP and which are considered
in literature containing at least one ERE.
38Conclusion (2/2)
- Problems
- Analyzing more than three motifs the model
creates files with critic dimension.
- Future work
- Detecting different motifs
- Implementing new biologically results in the
model - Implementing a single routine in R
39The End
Thanks to Prof. B.Tirozzi Prof. R. Calogero The
laboratory of bioinformatic in Orbassano