Title: Simple cluster structure of
1- Simple cluster structure of
- triplet distributions in
- genetic texts
- Andrei Zinovyev
- Institute des Hautes Etudes Scientifique,
- Bures-sur-Yvette
2Markov chain models
- Transition probabilities
- Frequencies of N-grams
- AGGTCGATC
- AGGTCGATC
- AGGTCGATC
3Sliding window
width W
- AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
AACATGACAATC - AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
AACATGACAATC - AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
AACATGACAATC - AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
AACATGACAATC - AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
AACATGACAATC
4Protein-coding sequences
- AGGTCGATGAATCCGTATTGACAAATGAATCCGTAATGACATGACAATCC
AACATGACAAT
bacterial gene
5Shadow genes
- TCCAGCTTA TGAGGCATAACTGTTTACTGAGGCCAT ACT
GTACTGTTAGGTTGTACTGTTA - AGGTCGAATACTCCGTATTGACAAATGACTCCGGTATGACATGACAATCC
AACATGACAAT
shadow gene
,
6When we can detect genes (by their content)?
- When non-coding regions are very different in
base composition (e.g., different GC-content) - When distances between the phases are large
,
7Simple experiment
- Only the forward strands of genomes are used
for triplet counting - Every p positions in the sequence, open a window
(x-W/2,xW/2,) of size W and centered at
position x - Every window, starting from the first base-pair,
is divided into W/3 non-overlapping triplets,
and the frequencies of all triplets fijk are
calculated - The dataset consists of NÂ Â L/p points, where
L is the entire length of the sequence - Every data point Xixis corresponds to one
window and has 64 coordinates, corresponding to
the frequencies of all possible triplets s
1,,64
,
8Principal Component Analysis
,
9ViDaExpert tool
,
10Caulobacter crescentus (GenBank NC_002696)
,
11Path of sliding window
,
12Helicobacter pylori (GenBank NC_000921)
,
13Saccharomyces cerevisiae chromosome IV
,
14Model sequences (random codon usage)
,
15Model sequences (random codon usage 50 of
frequencies are set to 0)
,
16Graph of coding phase
,
17Assessment
Sequence L W of coding bases Sn1 Sp1 Sn2 Sp2
Helicobacter pylori, complete genome (NC_000921) Caulobacter crescentus, complete genome (NC_002696) Prototheca wickerhamii mitochondrion (NC_001613) Saccharomyces cerevisiae chromosome III (NC_001135) Saccharomyces cerevisiae chromosome IV (NC_001136) 1643831 4016947 55328 316613 1531929 300 300 120 399 399 90 91 49 69 73 0.93 0.93 0.82 0.90 0.89 0.97 0.97 0.93 0.88 0.91 0.93 0.94 0.84 0.90 0.92 0.98 0.98 0.95 0.90 0.92
Model text RANDOM Model text RANDOM_BIAS 100000 100000 500 500 49 45 0.90 0.99 0.61 0.83 0.82 0.94 0.77 0.90
,
Completely blind prediction
18Dependence on window size
,
19Dependence on window size
,
20State of art GLIMMER strategy
- Use MM of 5th order (hexamers)
- Use interpolation for transition probabilities
- Use long ORF (gt500bp) as learning dataset
- Problems
- The number of hexamers to be evaluatedis still
big - Applicable only for collected genomes of good
quality (lt1frameshift/1000bp)
,
21What can we learn from this game?
-
- Learning can be replaced with self-learning
- Bacterial gene-finders work relatively well,
when concentration of coding sequences is high - Correlations in the order of codons are small
- Codon usage is approximately the same along the
genome - The method presented allows self-learning on
piecesof even uncollected DNA (gt150 bp) - The method gives alternative to HMM view on the
problem of gene recognition
,
22Acknowledgements
Professor Alexander Gorban Professor Misha
Gromov
,
My coordinates http//www.ihes.fr/zinovyev