Diapositiva 1 - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Diapositiva 1

Description:

I have addressed the problem of motifs detection in putative promoter regions. WHAT DOES IT MEAN ... It means to divide the interval, to which the expression ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 40
Provided by: Dani357
Category:

less

Transcript and Presenter's Notes

Title: Diapositiva 1


1
  • Developing computational methodologies
  • for genome transcript mapping

Daniela Bianchi, Raffaele Calogero, Brunello
Tirozzi
Department of Physics, University La Sapienza
Bioinformatics and Genomics Unit, Turin
2
Structure of the thesis
Aim of the thesis
Understanding the mechanisms underlying
transcription expression modulation
I have investigated the efficacy of Kohonen
algorithm
  • Part 1.
  • Transcription expression clustering
  • Part 2
  • Promoter structure analysis

I have addressed the problem of motifs detection
in putative promoter regions
3
WHAT DOES IT MEAN GENE CLUSTERING?
  • It means to divide the interval, to which the
    expression levels of genes belong, into an
    optimal partition.

The partition is optimal if the associated global
classification error E is minimal
4
CLASSIFICATION
  • Let I a partition of the interval (0,A) in N
    disjoint intervals.
  • Let ?i i 1,N, the centers of such
    intervals.
  • Then a given gene is classified ?i if x Ii.
  • The classification error is

x-?i
5
KOHONEN NETWORK
  • It is an artificial neural network with a single
    layer of neurons.
  • Each neuron is associated with a weight ?i, the
    center of atom Ii.
  • The number of neurons is equal to the number of
    atoms.
  • When an input pattern x is presented to the
    network this input is mapped to the neuron with
    the closest weigth. The weight changes during
    the learning process and tends to the values of
    the distribution of the input data.

6
LEARNING OF KOHONEN NETWORK
  • A data point x is presented to the network and
    all the differences
  • ?i (0)- x
  • are computed
  • The winner neuron is chosen, in such way it is
    the neuron v with minimal difference
  • ?v (0)- x
  • The weight of this neuron is changed, or in some
    case the weigths of neighboring neurons are
    changed.
  • This procedure is repeated with another input
  • At the end of process the set of input data is
    partitioned in disjoint sets (the clusters) and
    the weights associated with neurons are the
    values of the centers of partition s groups (the
    fixed values which the weights converge).

7
UPDATE RULE
  • Each weight vector is updated by

?i(n1) ?i(n) ?(n) G (i,v) ( ?(n1) - ?i(n) )
where 0?(n)lt 1 and ?(n) ?(n1)
?(n) is called the learning parameter and it is
basic for the algorithm convergence G (i,v) is
called the neighborhood function of the winner
and determines the width of the activation area
near the winner
8
NEIGHBORHOOD FUNCTION (1/2)
  • A convenient choice is

9
NEIGHBORHOOD FUNCTION (2/2)
Another choice is
10
THE ORDER
  • Remembering that for order configuration in one
    dimension means

The Kohonen network has the property of order
If Kohonen learning algorithm applied to one
dimensional configuration of weights converges,
the configuration orders itself at a certain step
of the process. The same order is preserved at
each subsequent step of the algorithm.
11
THE PARAMETER ? (n)
The convergence of Kohonen algorithm strongly
depends on the rate of decay of the learning
parameter ?(n)
I have seen numerically that the there is no
convergence ( in mean o not almost everywhere)
I have seen numerically that the there is
convergence in mean (but not almost everywhere)
I have seen numerically that the there is
convergence almost everywhere
12
Parameter of neighborhood function
  • The convergence of the Kohonen algorithm depends
    also on the values of parameters concerning the
    neighborhood function

13
Numerical analysis (1/3)
  • I run the algorithm 1000 times
  • I used sets of uniformly distributed data in
    (0,1) containing 4000, 10000,20000,30000,60000,12
    0000,150000,250000 elements
  • The procedure was done for all the mentioned ?(n)
    and both ? and h.

I had 1000 cases of weights limit values at the
end of the algorithm running
Results the mean value of these cases converged
to the center of the optimal partition of the
interval (0,1) for the previous mentioned ?(n)
and both ? and h.
14
Numerical analysis (2/3)
15
Numerical analysis (3/4)
The average error of limit weights with respect
to the exact values of the centers decreases on
increasing the number of iterations using ? the
error decreases more quickly.
The almost everywhere convergence of the
algorithm is obtained
16
Numerical analysis (4/4)
17
APPLICATIONS
  • I applied Kohonen network to microarrays data
    generated using a breast cancer mouse model.
  • Data were derived by the paper Quaglino et al.
    JCI 2004.
  • The authors studied the effect of HER2 DNA
    vaccination to halt breast cancer progression in
    BALB-neuT mice. A small set of genes (34) only
    associated to the transcriptional profile of the
    vaccinated mice was identified by hierarchical
    agglomerative clustering.

18
RESULTS
Using this approach I identified 34 genes
described in the paper and I also managed to
identify a subset of other vaccination specific
genes (25) that could not be discovered using the
clustering approach described in paper of
Quaglino.

19
Conclusion
(FIRST PART)
  • Kohonen network in one dimension converges
    almost everywhere for appropriate learning
    parameters and that makes it powerful and more
    adaptable
  • It has the drawback of the choice of number of
    clusters
  • The Kohonen algorithm in more than one dimension
    works well but there exists only a sufficient
    proof of the convergence

20
Promoter structure analysis
It is basic to define the most likely
transcription factor binding locations on the
promoter
21
Binding site models
  • Matrix representation
  • Position weight matrix
  • Energy matrix
  • String-based representation
  • Consensus sequence

22
Position weight matrix (1/2)
Every element of matrix is the number of times
each nucleotide is found at every position of an
alignment
23
Position weigth matrix (2/2)
  • From this matrix
  • Position specific frequency matrix (PSFM)
  • Log-odds matrix
  • Information content

24
Searching binding sites
  • De novo method
  • Novel motifs which are enriched are found
  • Scanning method
  • Using a given motif model a genome sequence is
    scanned to find more motif matches

25
Motif detection
To find instances of a given motif I have used a
PSFM and a higher order background model.
A higher background model means that a DNA
sequence can be generated with a Markov model of
order m
Observation this type of score is log likelihood
ratio of observing the data given a motif model
versus a model of DNA background. Therefore a
high score at a specific position suggests high
degree of the presence of motif in that
particular location of DNA sequence
I have computed score as
Where
26
Statistical significance of scores (1/2)
  • Is the score unlikely to have arisen by chance?

To answer it is necessary to know the p-value of
the score x
P-value of the score x
(score of the match of the motif with a given DNA
sequence)
If the p-value is very low the motif is
significantly represented in the DNA sequence
27
Statistical signficance of scores (2/2)
P-value of the score of the motif match with a
random sequence (identically independent model,
iid all the position in the sequence have the
same distributions and are independent of each
other)
It is a good approach to reduce the number of
false positive matches of the motif
It does not give the right information on the
over representation of the motif in a given DNA
sequence
28
Extreme theory value
  • Extreme value theory (EVT) provides a framework
    to formalize the study of behaviour in the tails
    of a distribution. EVT allows us to use extreme
    observations to measure the density in the tail.

Peak over threshold (POT) Analysis of large
observations which exceed a high threshold
The problem is to estimate the distribution
function Fu Fu(y)P(X-uyXgtu)P(YyXgtu),
0yxF-u

yx-u, xF right endpoint
29
Applications
  • I have applied the mentioned studies to a set of
    promoter regions of human DNA (2836 sequences)
    containg at least one regulatory estrogen
    responsive element (ERE).

30
Preliminary analysis (1/2)
  • Making histograms, box-plots,tests for normality

The data set is not normally distributed but its
distribution tends towards the gaussian
distribution
31
Preliminary analysis (2/2)
  • Analysis on random sequences ( all positions
    have the same letter distribution and are
    independet of each other)

The distribution of the sum of iid variables over
0,a is
  • I have checked
  • the normality
  • type of extreme value distribution
  • the Von Mises condition

32
Results
  • Focusing the attention on detection one ERE at
    once
  • the distribution of maximum of scores using
    random sequences is almost like a Weibull
    distribution (only 2/100 are not like a Weibull)
  • the distribution of maximum of scores using DNA
    sequences is mainly like a Weibull. There are
    413/2836 score sequences which are like a Frechet
    and 13 which are Gumbel.

I show the result of
These genes have been proved to contain at least
one ERE by ChIP
33
Results
34
Results
35
Results
36
Results
  • Detection of two EREs

I have applied the same procedure mentioned
previously considering also the distance between
the two EREs
  • Only for some particular distances I have
    detected a couple of EREs
  • The position for one of the two EREs of the
    couple is the same position found biologically.
  • A relatevely high homology is conserved for the
    mouse ortholog in the 3 genes
  • Only for the DVL1 gene the high homology is
    conserved for the three orthologs ( mouse, rat,
    dog)

37
Conclusion (1/2)
  • I have improved the detection of motifs
  • Using a higher order of background
  • Computing scores for two or three motifs
  • Using POT on both DNA and random sequences
  • Testing the pvalue

The model fits well it is able to detect those
gene validated by ChIP and which are considered
in literature containing at least one ERE.
38
Conclusion (2/2)
  • Problems
  • Analyzing more than three motifs the model
    creates files with critic dimension.
  • Future work
  • Detecting different motifs
  • Implementing new biologically results in the
    model
  • Implementing a single routine in R

39
The End
Thanks to Prof. B.Tirozzi Prof. R. Calogero The
laboratory of bioinformatic in Orbassano
Write a Comment
User Comments (0)
About PowerShow.com