Simple cluster structure of - PowerPoint PPT Presentation

About This Presentation
Title:

Simple cluster structure of

Description:

(random codon usage 50% of frequencies are set to 0) Graph of coding phase. Assessment ... in the order of codons are small. Codon usage is approximately the ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 23
Provided by: andre548
Category:

less

Transcript and Presenter's Notes

Title: Simple cluster structure of


1
  • Simple cluster structure of
  • triplet distributions in
  • genetic texts
  • Andrei Zinovyev
  • Institute des Hautes Etudes Scientifique,
  • Bures-sur-Yvette

2
Markov chain models
  • Transition probabilities
  • Frequencies of N-grams
  • AGGTCGATC
  • AGGTCGATC
  • AGGTCGATC

3
Sliding window
width W
  • AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
    AACATGACAATC
  • AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
    AACATGACAATC
  • AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
    AACATGACAATC
  • AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
    AACATGACAATC
  • AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCC
    AACATGACAATC

4
Protein-coding sequences
  • AGGTCGATGAATCCGTATTGACAAATGAATCCGTAATGACATGACAATCC
    AACATGACAAT

bacterial gene
5
Shadow genes
  • TCCAGCTTA TGAGGCATAACTGTTTACTGAGGCCAT ACT
    GTACTGTTAGGTTGTACTGTTA
  • AGGTCGAATACTCCGTATTGACAAATGACTCCGGTATGACATGACAATCC
    AACATGACAAT

shadow gene
,
6
When we can detect genes (by their content)?
  1. When non-coding regions are very different in
    base composition (e.g., different GC-content)
  2. When distances between the phases are large

,
7
Simple experiment
  1. Only the forward strands of genomes are used
    for triplet counting
  2. Every p positions in the sequence, open a window
    (x-W/2,xW/2,) of size W and centered at
    position x
  3. Every window, starting from the first base-pair,
    is divided into W/3 non-overlapping triplets,
    and the frequencies of all triplets fijk are
    calculated
  4. The dataset consists of N  L/p points, where
    L is the entire length of the sequence
  5. Every data point Xixis corresponds to one
    window and has 64 coordinates, corresponding to
    the frequencies of all possible triplets s
    1,,64

,
8
Principal Component Analysis
,
9
ViDaExpert tool
,
10
Caulobacter crescentus (GenBank NC_002696)
,
11
Path of sliding window
,
12
Helicobacter pylori (GenBank NC_000921)
,
13
Saccharomyces cerevisiae chromosome IV
,
14
Model sequences (random codon usage)
,
15
Model sequences (random codon usage 50 of
frequencies are set to 0)
,
16
Graph of coding phase
,
17
Assessment
Sequence L W of coding bases Sn1 Sp1 Sn2 Sp2
Helicobacter pylori, complete genome (NC_000921) Caulobacter crescentus, complete genome (NC_002696) Prototheca wickerhamii mitochondrion (NC_001613) Saccharomyces cerevisiae chromosome III (NC_001135) Saccharomyces cerevisiae chromosome IV (NC_001136) 1643831 4016947 55328 316613 1531929 300 300 120 399 399 90 91 49 69 73 0.93 0.93 0.82 0.90 0.89 0.97 0.97 0.93 0.88 0.91 0.93 0.94 0.84 0.90 0.92 0.98 0.98 0.95 0.90 0.92
Model text RANDOM Model text RANDOM_BIAS 100000 100000 500 500 49 45 0.90 0.99 0.61 0.83 0.82 0.94 0.77 0.90
,
Completely blind prediction
18
Dependence on window size
,
19
Dependence on window size
,
20
State of art GLIMMER strategy
  • Use MM of 5th order (hexamers)
  • Use interpolation for transition probabilities
  • Use long ORF (gt500bp) as learning dataset
  • Problems
  • The number of hexamers to be evaluatedis still
    big
  • Applicable only for collected genomes of good
    quality (lt1frameshift/1000bp)

,
21
What can we learn from this game?
  • Learning can be replaced with self-learning
  • Bacterial gene-finders work relatively well,
    when concentration of coding sequences is high
  • Correlations in the order of codons are small
  • Codon usage is approximately the same along the
    genome
  • The method presented allows self-learning on
    piecesof even uncollected DNA (gt150 bp)
  • The method gives alternative to HMM view on the
    problem of gene recognition

,
22
Acknowledgements
Professor Alexander Gorban Professor Misha
Gromov
,
My coordinates http//www.ihes.fr/zinovyev
Write a Comment
User Comments (0)
About PowerShow.com