Gene Prediction - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Gene Prediction

Description:

TATA boxes. CAP site. Poly-A. Trained on Arabidopsis ... TATA boxes. Hexamer frequencies. Consensus sequence matching ... Matches TATA box, CCAAT bvox, CpG ... – PowerPoint PPT presentation

Number of Views:186

Avg rating:3.0/5.0

Slides: 31

Provided by: patt86

Category:

more less

Transcript and Presenter's Notes

Title: Gene Prediction

1
Chapter 8 Gene Prediction
2

Automated sequencing of genomes require automated
gene assignment
Includes detection of open reading frames (ORFs)
Identification of the introns and exons
Gene prediction a very difficult problem in
pattern recognition
Coding regions generally do not have conserved
sequences
Much progress made with prokaryotic gene
prediction
Eukaryotic genes more difficult to predict
correctly

Ab initio methods
Predict genes on given sequence alone
Uses gene signals
Start/stop codon
Intron splice sites
Transcription factor binding sitesribosomal
binding sites
Poly-A sites
Codon demand multiple of three nucleotides
Gene content
Nucleotide composition use HMMs
Homology based methods
Matches to known genes
Matches to cDNA
Consensus based
Uses output from more than one program

Prokaryotic gene structure
ATG (GTG or TTG less frequent) is start codon
Ribosome binding site (Shine-Dalgarno sequence)
complementary to 16S rRNA of ribosome
AGGAGGT
TAG stop codon
Transcription termination site (?-independent
termination)
Stem-loop secondary structure followed by string
of Ts

Translate sequence into 6 reading frames
Stop codon randomly every 20 codons
Look for frame longer that 30 codons (normally
50-60 codons)
Presence of start codon and Shine-Dalgarno
sequence
Translate putative ORF into protein, and search
databases
Non-randomness of 3rd base of codon, more
frequently G/C
Plotting wobble base GC can identify ORFs
3rd base also repeats, thus repetition gives clue
on gene location

Markov chains and HMMs
Order depends on k previous positions
The higher the order of a Markov model to
describe a gene, the more non-randomness the
model includes
Genes described in codons or hexamers
HMMs trained with known genes
Codon pairs are often found, thus 6 nucleotide
patterns often occur in ORFs 5th-order Markov
chain
5th-order HMM gives very accurate gene
predictions
Problem may be that in short genes there are not
enough hexamers
Interpolated Markov Model (IMM) samples different
length Markov chains
Weighing scheme places less weight on rare k-mers
Final probability is the probability of all
weighted k-mers
Typical and atypical genes

7
GeneMark (http//exon.gatech.edu/genemark/) Traine
d on complete microbial genomes Most closely
related organism used for predictions Glimmer
(Gene Locator and Interpolation Markov
Model) (http//www.cbcb.umd.edu/software/glimmer/)
FGENESB (http//linux1.softberry.com/) 5th-order
HMM Trained with bacterial sequences Linear
discriminant analysis (LDA) RBSFinder
(ftp//ftp.tigr.org )Takes output from Glimmer
and searches for S-D sequences close to start
sites
8
(No Transcript)
9

Performance evaluation
Sensitivity Sn TP/(TPFN)
Specificity Sp TP/(TPFP)
CCTP.TN-FP.FN/(TPFPTNFNTPTN)1/2

10
Gene prediction in Eukaryotes Low gene density
(3 in humans) Space between genes very large
with multiply repeated sequences and transposable
elements Eukaryotic genes are split
(introns/exons) Transcript is capped (methylation
of 5 residue) Splicing in spliceosome Alternative
splicing Poly adenylation (250 As added)
downstream of CAATAAA(T/C) consensus box Major
issue identification of splicing sites GT-AG rule
(GTAAGT/ Y12NCAG 5/3 intron splice
junctions) Codon use frequencies ATG start
codon Kozak sequence (CCGCCATGG)
11

Ab initio programs
Gene signals
Start/stop
Putative splice signals
Consensus sequences
Poly-A sites
Gene content
Coding statistics
Non-random nucleotide distributions
Hexamer frequencies
HMMs

Discriminant analysis
Plot 2D graph of coding length versus 3 splice
site
Place diagonal line (LDA) that separates true
coding from non-coding sequences based on learnt
knowledge
QDA fits quadratic curve
FGENES uses LDA
MZEF(Michael Zangs Exon Finder uses QDA)

Neural Nets
A series of input, hidden and output layers
Gene structure information is fed to input layer,
and is separated into several classes
Hexamer frequencies
splice sites
GC composition
Weights are calculated in the hidden layer to
generate output of exon
When input layer is challenged with new sequence,
the rules that was generated to output exon is
applied to new sequence

HHMs
GenScan (http//genes.mit.edu/GENSCAN.html)5th-or
der HMM
Combined hexamer frequencies with coding signals
Initiation codons
TATA boxes
CAP site
Poly-A
Trained on Arabidopsis and maize data
Extensively used in human genome project
HMMgene (http//www.cbs.dtu.dk/services/HMMgene)
Identified sub regions of exons from cDNA or
proteins
Locks such regions and used HMM extension into
neighboring regions

15
(No Transcript)
16
(No Transcript)
17

Homology based programs
Uses translations to search for EST, cDNA and
proteins in databases
GenomeScan (http//genes.mit.edu/genomescan.html)
Combined GENSCAN with BLASTX
EST2Genome (http//bioweb.pasteur.fr/seqanal/inter
faces/est2genome.html)
Compares EST and cDNA to user sequence
TwinScan
Similar to GenomeScan

18
(No Transcript)
19

Consensus-based programs
Uses several different programs to generate lists
of predicted exons
Only common predicted exons are retained
GeneComber (http//www.bioinformatics.ubc.ca/genco
mbver/index.php)
Combined HMMgene with GenScan
DIGIT (http//digit.gsc.riken.go.jp/cgi-bin/index.
cgi)
Combines FGENESH, GENSCAN and HMMgene

20
Accuracy
21
Chapter 9 Promoter and regulatory element
prediction
22

Promoters are short regions upstream of
transcription start site
Contains short (6-8nt) transcription factor
recognition site
Extremely laborious to define by experiment
Sequence is not translated into protein, so no
homology matching is possible
Each promoter is unique with a unique combination
of factor binding sites thus no consensus
promoter

23
Prokaryotic gene
TF site
polymerase
TF
ORF
-35 box
-10 box

?70 factor binds to -35 and -10 boxes and recruit
full polymerase enzyme
-35 box consensus sequence TTGACA
-10 box consensus sequence TATAAT
Transcription factors that activate or repress
transcription
Bind to regulatory elements
DNA loops to allow long-distance interactions

24
Eukaryotic gene structure
TF site
Pol II
TF site
TATA
Inr
Polymerase I, II and III Basal transcription
factors (TFIID, TFIIA, TFIIB, etc.) TATA box
(TATA(A/T)A(A/T) Housekeeping genes often do
not contain TATA boxes Initiatior site (Inr)
(C/T) (C/T) CA(C/T) (C/T) coincides with
transcription start Many TF sites Activation/repre
ssion
25

Ab initio methods
Promoter signals
TATA boxes
Hexamer frequencies
Consensus sequence matching
PSSM
Numerous FPs
HMMs incorporate neighboring information

Promoter prediction in prokaryotes
Find operon
Upstream offirst gene is promoter
Wang rules (distance between genes, no
?-independent termination, number of genomes that
display linkage)
BPROM (http//www.softberry.com)
Based of arbitarry setting of operon egen
distances
200bop uopstream of first gene
many FPs
FindTerm (http//sun1.softberry.com)
Searches for ?-independent termination signals

27
Prediction in eukaryotes

Searching for consensus sequences in databases
(TransFac)
Increase specuificity by searching for CpG
islands
High density fo trasncription factor binding
sitres
CpGProD (http//pbil.univ-lyon1.fr/software/cpgpro
d.html)
CG inmoving window
Eponine (http//servlet.sanger.ac.uk8080/eponine/
)
Matches TATA box, CCAAT bvox, CpG island to PSSM
Cluster-Buster (http//zlab.bu.edu/cluster-buster/
cbust.html)
Detects high concentrations of TF sites
FirstEF (http//rulai.cshl.org/tools/FirstEF/)
QDA of fisrt exonboundary
McPromoter (http//genes.mit.edu/McPromoter.html)
Neural net of DNA bendability, TAT box,initator
box
Trained for Drosophila and human sequences

28
Phylogenetic footprinting technique

Identify conserved regulatory sites
Human-chimpanzee too close
Human fish too distant
Human0-mouse appropriate
ConSite (http//mordor.cgb.ki.se/cgi-bin/CONSITE/c
onsite)
Align two sequences by global alignment
algorithm
Identify conserved regions and compare to
TRANSFAC database
High scoring hits returned as positives
rVISTA (http//rvista.dcode.org)
Identified TRANSFAC sites in two orthologous
sequences
Aligns sequences with local alignment algorithm
Highest identity regions returned as hits
Bayes aligner (http//www.bioinfo.rpi.edu/applicat
ions/bayesian/bayes/bayes.align12.pl)
Aligns two sequences with Bayesian algorithm
Even weakly conserved regions identified

29
Expression-profiling based method
Microarray analyses allows identification of
co-regulated genes Assume that promoters contain
similar regulatory sites Find such sites by EM
and Gibbs sampling using iteration of
PSSM Co-expressed genes may be regulated at
higher levels MEME (http//meme.sdsc.edu/meme/webs
ite/meme-intro.html) AlignACE (http//atlas.med.ha
rvard.edu/cgi-bin/alignace.pl) Gibbs sampling
algorithm
30
Web humour

Write a Comment

User Comments (0)