Diapositiva 1 - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Diapositiva 1

Description:

I have addressed the problem of motifs detection in putative promoter regions. WHAT DOES IT MEAN ... It means to divide the interval, to which the expression ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 40

Provided by: Dani357

Category:

more less

Transcript and Presenter's Notes

Title: Diapositiva 1

1

Developing computational methodologies

for genome transcript mapping

Daniela Bianchi, Raffaele Calogero, Brunello
Tirozzi
Department of Physics, University La Sapienza
Bioinformatics and Genomics Unit, Turin
2
Structure of the thesis
Aim of the thesis
Understanding the mechanisms underlying
transcription expression modulation
I have investigated the efficacy of Kohonen
algorithm

Part 1.
Transcription expression clustering
Part 2
Promoter structure analysis

I have addressed the problem of motifs detection
in putative promoter regions
3
WHAT DOES IT MEAN GENE CLUSTERING?

It means to divide the interval, to which the
expression levels of genes belong, into an
optimal partition.

The partition is optimal if the associated global
classification error E is minimal
4
CLASSIFICATION

Let I a partition of the interval (0,A) in N
disjoint intervals.
Let ?i i 1,N, the centers of such
intervals.
Then a given gene is classified ?i if x Ii.
The classification error is

x-?i
5
KOHONEN NETWORK

It is an artificial neural network with a single
layer of neurons.
Each neuron is associated with a weight ?i, the
center of atom Ii.
The number of neurons is equal to the number of
atoms.
When an input pattern x is presented to the
network this input is mapped to the neuron with
the closest weigth. The weight changes during
the learning process and tends to the values of
the distribution of the input data.

6
LEARNING OF KOHONEN NETWORK

A data point x is presented to the network and
all the differences
?i (0)- x
are computed
The winner neuron is chosen, in such way it is
the neuron v with minimal difference
?v (0)- x
The weight of this neuron is changed, or in some
case the weigths of neighboring neurons are
changed.
This procedure is repeated with another input
At the end of process the set of input data is
partitioned in disjoint sets (the clusters) and
the weights associated with neurons are the
values of the centers of partition s groups (the
fixed values which the weights converge).

7
UPDATE RULE

Each weight vector is updated by

?i(n1) ?i(n) ?(n) G (i,v) ( ?(n1) - ?i(n) )
where 0?(n)lt 1 and ?(n) ?(n1)
?(n) is called the learning parameter and it is
basic for the algorithm convergence G (i,v) is
called the neighborhood function of the winner
and determines the width of the activation area
near the winner
8
NEIGHBORHOOD FUNCTION (1/2)

A convenient choice is

9
NEIGHBORHOOD FUNCTION (2/2)
Another choice is
10
THE ORDER

Remembering that for order configuration in one
dimension means

The Kohonen network has the property of order
If Kohonen learning algorithm applied to one
dimensional configuration of weights converges,
the configuration orders itself at a certain step
of the process. The same order is preserved at
each subsequent step of the algorithm.
11
THE PARAMETER ? (n)
The convergence of Kohonen algorithm strongly
depends on the rate of decay of the learning
parameter ?(n)
I have seen numerically that the there is no
convergence ( in mean o not almost everywhere)
I have seen numerically that the there is
convergence in mean (but not almost everywhere)
I have seen numerically that the there is
convergence almost everywhere
12
Parameter of neighborhood function

The convergence of the Kohonen algorithm depends
also on the values of parameters concerning the
neighborhood function

13
Numerical analysis (1/3)

I run the algorithm 1000 times
I used sets of uniformly distributed data in
(0,1) containing 4000, 10000,20000,30000,60000,12
0000,150000,250000 elements
The procedure was done for all the mentioned ?(n)
and both ? and h.

I had 1000 cases of weights limit values at the
end of the algorithm running
Results the mean value of these cases converged
to the center of the optimal partition of the
interval (0,1) for the previous mentioned ?(n)
and both ? and h.
14
Numerical analysis (2/3)
15
Numerical analysis (3/4)
The average error of limit weights with respect
to the exact values of the centers decreases on
increasing the number of iterations using ? the
error decreases more quickly.
The almost everywhere convergence of the
algorithm is obtained
16
Numerical analysis (4/4)
17
APPLICATIONS

I applied Kohonen network to microarrays data
generated using a breast cancer mouse model.
Data were derived by the paper Quaglino et al.
JCI 2004.
The authors studied the effect of HER2 DNA
vaccination to halt breast cancer progression in
BALB-neuT mice. A small set of genes (34) only
associated to the transcriptional profile of the
vaccinated mice was identified by hierarchical
agglomerative clustering.

18
RESULTS
Using this approach I identified 34 genes
described in the paper and I also managed to
identify a subset of other vaccination specific
genes (25) that could not be discovered using the
clustering approach described in paper of
Quaglino.

19
Conclusion
(FIRST PART)

Kohonen network in one dimension converges
almost everywhere for appropriate learning
parameters and that makes it powerful and more
adaptable
It has the drawback of the choice of number of
clusters
The Kohonen algorithm in more than one dimension
works well but there exists only a sufficient
proof of the convergence

20
Promoter structure analysis
It is basic to define the most likely
transcription factor binding locations on the
promoter
21
Binding site models

Matrix representation
Position weight matrix
Energy matrix
String-based representation
Consensus sequence

22
Position weight matrix (1/2)
Every element of matrix is the number of times
each nucleotide is found at every position of an
alignment
23
Position weigth matrix (2/2)

From this matrix
Position specific frequency matrix (PSFM)
Log-odds matrix
Information content

24
Searching binding sites

De novo method
Novel motifs which are enriched are found
Scanning method
Using a given motif model a genome sequence is
scanned to find more motif matches

25
Motif detection
To find instances of a given motif I have used a
PSFM and a higher order background model.
A higher background model means that a DNA
sequence can be generated with a Markov model of
order m
Observation this type of score is log likelihood
ratio of observing the data given a motif model
versus a model of DNA background. Therefore a
high score at a specific position suggests high
degree of the presence of motif in that
particular location of DNA sequence
I have computed score as
Where
26
Statistical significance of scores (1/2)

Is the score unlikely to have arisen by chance?

To answer it is necessary to know the p-value of
the score x
P-value of the score x
(score of the match of the motif with a given DNA
sequence)
If the p-value is very low the motif is
significantly represented in the DNA sequence
27
Statistical signficance of scores (2/2)
P-value of the score of the motif match with a
random sequence (identically independent model,
iid all the position in the sequence have the
same distributions and are independent of each
other)
It is a good approach to reduce the number of
false positive matches of the motif
It does not give the right information on the
over representation of the motif in a given DNA
sequence
28
Extreme theory value

Extreme value theory (EVT) provides a framework
to formalize the study of behaviour in the tails
of a distribution. EVT allows us to use extreme
observations to measure the density in the tail.

Peak over threshold (POT) Analysis of large
observations which exceed a high threshold
The problem is to estimate the distribution
function Fu Fu(y)P(X-uyXgtu)P(YyXgtu),
0yxF-u

yx-u, xF right endpoint
29
Applications

I have applied the mentioned studies to a set of
promoter regions of human DNA (2836 sequences)
containg at least one regulatory estrogen
responsive element (ERE).

30
Preliminary analysis (1/2)

Making histograms, box-plots,tests for normality

The data set is not normally distributed but its
distribution tends towards the gaussian
distribution
31
Preliminary analysis (2/2)

Analysis on random sequences ( all positions
have the same letter distribution and are
independet of each other)

The distribution of the sum of iid variables over
0,a is

I have checked
the normality
type of extreme value distribution
the Von Mises condition

32
Results

Focusing the attention on detection one ERE at
once
the distribution of maximum of scores using
random sequences is almost like a Weibull
distribution (only 2/100 are not like a Weibull)
the distribution of maximum of scores using DNA
sequences is mainly like a Weibull. There are
413/2836 score sequences which are like a Frechet
and 13 which are Gumbel.

I show the result of
These genes have been proved to contain at least
one ERE by ChIP
33
Results
34
Results
35
Results
36
Results

Detection of two EREs

I have applied the same procedure mentioned
previously considering also the distance between
the two EREs

Only for some particular distances I have
detected a couple of EREs
The position for one of the two EREs of the
couple is the same position found biologically.
A relatevely high homology is conserved for the
mouse ortholog in the 3 genes
Only for the DVL1 gene the high homology is
conserved for the three orthologs ( mouse, rat,
dog)

37
Conclusion (1/2)

I have improved the detection of motifs
Using a higher order of background
Computing scores for two or three motifs
Using POT on both DNA and random sequences
Testing the pvalue

The model fits well it is able to detect those
gene validated by ChIP and which are considered
in literature containing at least one ERE.
38
Conclusion (2/2)