Tools for core promoter prediction - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Tools for core promoter prediction

Description:

The amount of energy that is required to separate two strands of DNA ... 11. Different properties (human) X-axis: ... Compared with 10 state-of-the-art tools ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 46
Provided by: thomas244
Category:

less

Transcript and Presenter's Notes

Title: Tools for core promoter prediction


1
Tools for core promoter prediction
  • Thomas Abeel
  • Department of Plant Systems Biology, VIB, Ghent
    University

2
Overview
  • Introduction
  • Core promoter prediction
  • Physical properties of DNA
  • Tools for promoter prediction
  • EP3
  • ProSOM
  • Validation on the human genome

3
Introduction
  • Core promoter
  • Physical properties of DNA

4
Core promoter
  • Promoter controlling element of a gene when,
    where, how much expression
  • Core promoter binds transcription machinery,
    small (50 bp), contains transcription start site
    (TSS)
  • Prediction identify these regions using
    computational resources

5
  • Core promoter is the region 50 nt upstream of
    the TSS

6
Promoter prediction challenges
  • A lot of unannotated sequence data is generated
  • Experimental identification of promoters is
    expensive money, labor, time
  • Computational techniques to identify promoters
    based only on sequence
  • Help annotation projects to identify genes
  • Help lab people to do targeted experiments

7
Physical properties of DNA
  • Physical and chemical properties
  • Examples bendability, denaturation energy
  • Conversion tables
  • Dinucleotide value
  • Experimentally determined
  • Replace each dinucleotide with value
  • Average
  • with a sliding window (in case of genome DNA)
  • with multiple sequences

8
Example conversion table
9
Sample conversion
  • AGTTA ? 85.12 108.8 66.51 50.11
  • GGAAT ? 99.31 80.03 66.51 72.29
  • CTGCA ? 85.12 64.92 135.83 64.92
  • Profile ? 88.85 84.58 89.62 62.44

10
DNA denaturation energy
  • Thermal stability of DNA
  • The amount of energy that is required to separate
    two strands of DNA
  • High values indicate stable regions
  • Low values indicate unstable regions
  • Dinucleotide conversion table
  • Correlated with GC-content

11
Different properties (human)
  • X-axis position relative to TSS
  • Y-axis normalized average of the physical
    property
  • 30,000 sequences

12
Small versus large scale (human)
13
Promoter elements
14
Different organisms (1)
15
Different organisms (2)
16
Introduction summary
  • Core promoter has unique features
  • Physical properties
  • Large and small scale
  • Valid for various organisms
  • Valid irrespective of promoter elements
  • Two unstable locations next to TSS
  • Stable in large region

17
Promoter prediction tools
  • EP3
  • ProSOM

18
EP3
19
Sliding window conversion
  • AGTTACTGCAGGAAT
  • 4 nt sliding window
  • ? 85.12 108.8 66.51 50.11 ...
  • 86.81

20
On a genomic scale
  • 400 nt sliding window
  • 1 Mbp of chromosome 21
  • Yellow ? forward strand genes
  • Blue ? reverse strand genes
  • Threshold on the profile for predictions

21
EP3 summary
  • Large scale properties
  • DNA denaturation
  • 400 bp sliding window
  • Threshold on the average
  • value

22
ProSOM
  • Small scale features
  • DNA denaturation
  • Inverse peak at -35
  • Related to TBP
  • Instability of 10 bp
  • Inverse peak at 0
  • Related to TSS

23
ProSOM
  • Do not use profile directly
  • Cluster known promoters and other sequences using
    an unsupervised method
  • Clustering method separates promoters from other
    sequences
  • New sequences that map to clusters with promoter
    content gt 75 are putative promoters

24
Training data
  • DBTSS
  • 30,000 promoter sequences -200,50
  • Ensembl
  • 30,000 transcribed sequences start5k,end-5k
  • 30,000 intergenic sequences start5k,end-5k

25
Self-organizing maps
  • Type of neural network
  • Uses unsupervised learning competitive learning
  • Maps high-dimensional input space to a lower
    dimension (2), while preserving topological
    properties
  • Clustering based on the full physical profile of
    the sequences

26
Training procedure
  • Randomly initialize 36 clusters (6x6 grid)
  • Each sequence is added to the algorithm
  • Features full physical profile of the sequence
  • Compare this profile to the existing clusters
  • Update the best matching cluster and its
    neighbors to look more like the added profile
  • Identify clusters with high promoter content

27
Clustering result
28
Tools summary
  • EP3
  • Prediction based on large-scale physical
    properties
  • Very simple ? fast, no training
  • 400 bp sliding window
  • ProSOM
  • Predictions based on small-scale physical
    properties (indirectly)
  • Slower, requires some training (unsupervised)
  • 250 bp sliding window

29
Validation of EP3 and ProSOM
  • Human genome
  • Proof-of-concept on other genomes

30
Validation
  • How many of our predictions correspond to known
    core promoters?
  • For predictions that do not have a known
    associated core promoter, is there other
    evidence?
  • How many of the known core promoters can we
    predict?
  • Why are some known core promoters missed?

31
Human validation data
  • Genome sequences for the complete human genome
  • Coordinates of 123,400 known core promoters,
    transcription start sites (TSS)
  • Determined in the lab
  • Transcribed fragments and other evidence for
    ENCODE regions
  • Predictions made by ProSOM and EP3

32
Quantifying right and wrong
  • Counting scheme
  • True positive (TP) correctly predicted TSS
  • False positive (FP) false predictions, i.e.
    predictions not associated with a known TSS
  • False negative (FN) unpredicted TSS, i.e.
    transcription start sites without prediction

33
Validation - graphical
TSS (FN)
TSS (TP)
-500 bp
500 bp
-500 bp
500 bp
FP
FP
Prediction
Region near TSS
Transcription Start Site
Region far from TSS
34
Performance measure
  • Recall
  • TP/(total tss)
  • of sites recovered
  • Precision
  • TP/(total predictions)
  • of predictions that is correct
  • F-measure harmonic mean of recall and precision

35
Tool performance
  • Among the best available
  • Compared with gt10 state-of-the-art tools
  • Preference for CpG-islands promoters, i.e.
    house-keeping genes

36
Performance ENCODE
  • 85 of predictions have evidence for
    transcription within 50 bp
  • Transcribed fragments
  • CAGE tags
  • ? Some false predictions in the genome are likely
    to be associated with transcription

37
Answers to the questions (1)
  • How many of our predictions correspond to known
    core promoters?
  • 66
  • For predictions that do not have a known
    associated core promoter, is there other
    evidence?
  • 85 within 50 bp, 98 within 500 bp

38
Answers to the questions (2)
  • How many of the known core promoters can we
    predict?
  • 34 (EP3) - 38 (ProSOM)
  • Why are some known core promoters missed?
  • Preference for GC-rich promoters
  • Threshold is trade-off between recall and
    precision
  • The profile is an average, not every promoter
    follows this pattern

39
Proof-of-concept on other organisms
  • Whole genome sequences
  • Gene annotation ? 5UTR end gene TSS
  • EP3 12 genomes
  • P. falciparum, O. pacifica, O. tauri, A.
    thaliana, O. sativa, P. trichocarpa, S.
    cerevisiae, S. pombe, C. elegans, D.
    melanogaster, T. nigroviridis, M. musculus
  • ProSOM mouse (without retraining)

40
Why does it work?
41
ProSOM - EP3
  • Human 48
  • Mouse 45

42
Caveats
  • 500 bp distance is large for plant and fungal
    genomes
  • Prediction coverage is larger due to smaller
    genomes and higher gene density
  • Protists genomes do not work because the profile
    is nearly flat
  • Only gene annotation, we have no real core
    promoter data sets

43
Summary validation
  • Two promoter predictors EP3 and ProSOM
  • Works well on the human genome
  • Recall 35, precision 66
  • 98 of predictions have evidence
  • Works on other eukaryotic genomes
  • Some better than others
  • Some caveats apply, only proof-of-concept

44
Conclusions
  • Unique physical properties of core promoter
  • EP3 (large scale)
  • Additional validation on 12 eukaryotic genomes
  • Requires no training, only threshold tuning
  • ProSOM (small scale)
  • Requires some training, but not species specific
  • Extensively validated on the human genome
  • Proof-of-concept on some other genomes

45
References and contact
  • Abeel, T., Saeys, Y., Bonnet, E., RouzĂ©, P., Van
    de Peer, Y. (2008) Generic eukaryotic core
    promoter prediction using structural features of
    DNA. Genome Research 18, 310-23.
  • Abeel, T., Saeys, Y., RouzĂ©, P., Van de Peer, Y.
    (2008) ProSOM Core promoter prediction based on
    unsupervised clustering of DNA physical profiles.
    Bioinformatics 24, i24-i31.
  • E-mail thomas.abeel_at_psb.ugent.be
  • Web http//bioinformatics.psb.ugent.be/
Write a Comment
User Comments (0)
About PowerShow.com