Simple cluster structure of

About This Presentation

Title:

Simple cluster structure of

Description:

(random codon usage 50% of frequencies are set to 0) Graph of coding phase. Assessment ... in the order of codons are small. Codon usage is approximately the ... – PowerPoint PPT presentation

Number of Views:12

Avg rating:3.0/5.0

Slides: 23

Provided by: andre548

Category:

more less

Transcript and Presenter's Notes

Title: Simple cluster structure of

1

Simple cluster structure of
triplet distributions in
genetic texts
Andrei Zinovyev
Institute des Hautes Etudes Scientifique,
Bures-sur-Yvette

2
Markov chain models

Transition probabilities
Frequencies of N-grams
AGGTCGATC
AGGTCGATC
AGGTCGATC

3
Sliding window
width W

AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
AACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
AACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
AACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
AACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
AACATGACAATC

4
Protein-coding sequences

AGGTCGATGAATCCGTATTGACAAATGAATCCGTAATGACATGACAATCC
AACATGACAAT

bacterial gene
5
Shadow genes

TCCAGCTTA TGAGGCATAACTGTTTACTGAGGCCAT ACT
GTACTGTTAGGTTGTACTGTTA
AGGTCGAATACTCCGTATTGACAAATGACTCCGGTATGACATGACAATCC
AACATGACAAT

shadow gene
,
6
When we can detect genes (by their content)?

When non-coding regions are very different in
base composition (e.g., different GC-content)
When distances between the phases are large

,
7
Simple experiment

Only the forward strands of genomes are used
for triplet counting
Every p positions in the sequence, open a window
(x-W/2,xW/2,) of size W and centered at
position x
Every window, starting from the first base-pair,
is divided into W/3 non-overlapping triplets,
and the frequencies of all triplets fijk are
calculated
The dataset consists of N L/p points, where
L is the entire length of the sequence
Every data point Xixis corresponds to one
window and has 64 coordinates, corresponding to
the frequencies of all possible triplets s
1,,64

,
8
Principal Component Analysis
,
9
ViDaExpert tool
,
10
Caulobacter crescentus (GenBank NC_002696)
,
11
Path of sliding window
,
12
Helicobacter pylori (GenBank NC_000921)
,
13
Saccharomyces cerevisiae chromosome IV
,
14
Model sequences (random codon usage)
,
15
Model sequences (random codon usage 50 of
frequencies are set to 0)
,
16
Graph of coding phase
,
17
Assessment
Sequence L W of coding bases Sn1 Sp1 Sn2 Sp2
Helicobacter pylori, complete genome (NC_000921) Caulobacter crescentus, complete genome (NC_002696) Prototheca wickerhamii mitochondrion (NC_001613) Saccharomyces cerevisiae chromosome III (NC_001135) Saccharomyces cerevisiae chromosome IV (NC_001136) 1643831 4016947 55328 316613 1531929 300 300 120 399 399 90 91 49 69 73 0.93 0.93 0.82 0.90 0.89 0.97 0.97 0.93 0.88 0.91 0.93 0.94 0.84 0.90 0.92 0.98 0.98 0.95 0.90 0.92
Model text RANDOM Model text RANDOM_BIAS 100000 100000 500 500 49 45 0.90 0.99 0.61 0.83 0.82 0.94 0.77 0.90
,
Completely blind prediction
18
Dependence on window size
,
19
Dependence on window size
,
20
State of art GLIMMER strategy

Use MM of 5th order (hexamers)
Use interpolation for transition probabilities
Use long ORF (gt500bp) as learning dataset
Problems
The number of hexamers to be evaluatedis still
big
Applicable only for collected genomes of good
quality (lt1frameshift/1000bp)

,
21
What can we learn from this game?

Learning can be replaced with self-learning
Bacterial gene-finders work relatively well,
when concentration of coding sequences is high
Correlations in the order of codons are small
Codon usage is approximately the same along the
genome
The method presented allows self-learning on
piecesof even uncollected DNA (gt150 bp)
The method gives alternative to HMM view on the
problem of gene recognition

,
22
Acknowledgements
Professor Alexander Gorban Professor Misha
Gromov
,
My coordinates http//www.ihes.fr/zinovyev

Write a Comment

User Comments (0)

About PowerShow.com

Simple cluster structure of - PowerPoint PPT Presentation

Simple cluster structure of

(random codon usage 50% of frequencies are set to 0) Graph of coding phase. Assessment ... in the order of codons are small. Codon usage is approximately the ... – PowerPoint PPT presentation