Title: Gene Prediction
1Chapter 8 Gene Prediction
2- Automated sequencing of genomes require automated
gene assignment - Includes detection of open reading frames (ORFs)
- Identification of the introns and exons
- Gene prediction a very difficult problem in
pattern recognition - Coding regions generally do not have conserved
sequences - Much progress made with prokaryotic gene
prediction - Eukaryotic genes more difficult to predict
correctly
3- Ab initio methods
- Predict genes on given sequence alone
- Uses gene signals
- Start/stop codon
- Intron splice sites
- Transcription factor binding sitesribosomal
binding sites - Poly-A sites
- Codon demand multiple of three nucleotides
- Gene content
- Nucleotide composition use HMMs
- Homology based methods
- Matches to known genes
- Matches to cDNA
- Consensus based
- Uses output from more than one program
4- Prokaryotic gene structure
- ATG (GTG or TTG less frequent) is start codon
- Ribosome binding site (Shine-Dalgarno sequence)
complementary to 16S rRNA of ribosome - AGGAGGT
- TAG stop codon
- Transcription termination site (?-independent
termination) - Stem-loop secondary structure followed by string
of Ts
5- Translate sequence into 6 reading frames
- Stop codon randomly every 20 codons
- Look for frame longer that 30 codons (normally
50-60 codons) - Presence of start codon and Shine-Dalgarno
sequence - Translate putative ORF into protein, and search
databases - Non-randomness of 3rd base of codon, more
frequently G/C - Plotting wobble base GC can identify ORFs
- 3rd base also repeats, thus repetition gives clue
on gene location
6- Markov chains and HMMs
- Order depends on k previous positions
- The higher the order of a Markov model to
describe a gene, the more non-randomness the
model includes - Genes described in codons or hexamers
- HMMs trained with known genes
- Codon pairs are often found, thus 6 nucleotide
patterns often occur in ORFs 5th-order Markov
chain - 5th-order HMM gives very accurate gene
predictions - Problem may be that in short genes there are not
enough hexamers - Interpolated Markov Model (IMM) samples different
length Markov chains - Weighing scheme places less weight on rare k-mers
- Final probability is the probability of all
weighted k-mers - Typical and atypical genes
7GeneMark (http//exon.gatech.edu/genemark/) Traine
d on complete microbial genomes Most closely
related organism used for predictions Glimmer
(Gene Locator and Interpolation Markov
Model) (http//www.cbcb.umd.edu/software/glimmer/)
FGENESB (http//linux1.softberry.com/) 5th-order
HMM Trained with bacterial sequences Linear
discriminant analysis (LDA) RBSFinder
(ftp//ftp.tigr.org )Takes output from Glimmer
and searches for S-D sequences close to start
sites
8(No Transcript)
9- Performance evaluation
- Sensitivity Sn TP/(TPFN)
- Specificity Sp TP/(TPFP)
- CCTP.TN-FP.FN/(TPFPTNFNTPTN)1/2
10Gene prediction in Eukaryotes Low gene density
(3 in humans) Space between genes very large
with multiply repeated sequences and transposable
elements Eukaryotic genes are split
(introns/exons) Transcript is capped (methylation
of 5 residue) Splicing in spliceosome Alternative
splicing Poly adenylation (250 As added)
downstream of CAATAAA(T/C) consensus box Major
issue identification of splicing sites GT-AG rule
(GTAAGT/ Y12NCAG 5/3 intron splice
junctions) Codon use frequencies ATG start
codon Kozak sequence (CCGCCATGG)
11- Ab initio programs
- Gene signals
- Start/stop
- Putative splice signals
- Consensus sequences
- Poly-A sites
- Gene content
- Coding statistics
- Non-random nucleotide distributions
- Hexamer frequencies
- HMMs
12- Discriminant analysis
- Plot 2D graph of coding length versus 3 splice
site - Place diagonal line (LDA) that separates true
coding from non-coding sequences based on learnt
knowledge - QDA fits quadratic curve
- FGENES uses LDA
- MZEF(Michael Zangs Exon Finder uses QDA)
13- Neural Nets
- A series of input, hidden and output layers
- Gene structure information is fed to input layer,
and is separated into several classes - Hexamer frequencies
- splice sites
- GC composition
- Weights are calculated in the hidden layer to
generate output of exon - When input layer is challenged with new sequence,
the rules that was generated to output exon is
applied to new sequence
14- HHMs
- GenScan (http//genes.mit.edu/GENSCAN.html)5th-or
der HMM - Combined hexamer frequencies with coding signals
- Initiation codons
- TATA boxes
- CAP site
- Poly-A
- Trained on Arabidopsis and maize data
- Extensively used in human genome project
- HMMgene (http//www.cbs.dtu.dk/services/HMMgene)
- Identified sub regions of exons from cDNA or
proteins - Locks such regions and used HMM extension into
neighboring regions
15(No Transcript)
16(No Transcript)
17- Homology based programs
- Uses translations to search for EST, cDNA and
proteins in databases - GenomeScan (http//genes.mit.edu/genomescan.html)
- Combined GENSCAN with BLASTX
- EST2Genome (http//bioweb.pasteur.fr/seqanal/inter
faces/est2genome.html) - Compares EST and cDNA to user sequence
- TwinScan
- Similar to GenomeScan
18(No Transcript)
19- Consensus-based programs
- Uses several different programs to generate lists
of predicted exons - Only common predicted exons are retained
- GeneComber (http//www.bioinformatics.ubc.ca/genco
mbver/index.php) - Combined HMMgene with GenScan
- DIGIT (http//digit.gsc.riken.go.jp/cgi-bin/index.
cgi) - Combines FGENESH, GENSCAN and HMMgene
20Accuracy
21Chapter 9 Promoter and regulatory element
prediction
22- Promoters are short regions upstream of
transcription start site - Contains short (6-8nt) transcription factor
recognition site - Extremely laborious to define by experiment
- Sequence is not translated into protein, so no
homology matching is possible - Each promoter is unique with a unique combination
of factor binding sites thus no consensus
promoter
23Prokaryotic gene
TF site
polymerase
TF
ORF
-35 box
-10 box
- ?70 factor binds to -35 and -10 boxes and recruit
full polymerase enzyme - -35 box consensus sequence TTGACA
- -10 box consensus sequence TATAAT
- Transcription factors that activate or repress
transcription - Bind to regulatory elements
- DNA loops to allow long-distance interactions
24Eukaryotic gene structure
TF site
Pol II
TF site
TATA
Inr
Polymerase I, II and III Basal transcription
factors (TFIID, TFIIA, TFIIB, etc.) TATA box
(TATA(A/T)A(A/T) Housekeeping genes often do
not contain TATA boxes Initiatior site (Inr)
(C/T) (C/T) CA(C/T) (C/T) coincides with
transcription start Many TF sites Activation/repre
ssion
25- Ab initio methods
- Promoter signals
- TATA boxes
- Hexamer frequencies
- Consensus sequence matching
- PSSM
- Numerous FPs
- HMMs incorporate neighboring information
26- Promoter prediction in prokaryotes
- Find operon
- Upstream offirst gene is promoter
- Wang rules (distance between genes, no
?-independent termination, number of genomes that
display linkage) - BPROM (http//www.softberry.com)
- Based of arbitarry setting of operon egen
distances - 200bop uopstream of first gene
- many FPs
- FindTerm (http//sun1.softberry.com)
- Searches for ?-independent termination signals
27Prediction in eukaryotes
- Searching for consensus sequences in databases
(TransFac) - Increase specuificity by searching for CpG
islands - High density fo trasncription factor binding
sitres - CpGProD (http//pbil.univ-lyon1.fr/software/cpgpro
d.html) - CG inmoving window
- Eponine (http//servlet.sanger.ac.uk8080/eponine/
) - Matches TATA box, CCAAT bvox, CpG island to PSSM
- Cluster-Buster (http//zlab.bu.edu/cluster-buster/
cbust.html) - Detects high concentrations of TF sites
- FirstEF (http//rulai.cshl.org/tools/FirstEF/)
- QDA of fisrt exonboundary
- McPromoter (http//genes.mit.edu/McPromoter.html)
- Neural net of DNA bendability, TAT box,initator
box - Trained for Drosophila and human sequences
28Phylogenetic footprinting technique
- Identify conserved regulatory sites
- Human-chimpanzee too close
- Human fish too distant
- Human0-mouse appropriate
- ConSite (http//mordor.cgb.ki.se/cgi-bin/CONSITE/c
onsite) - Align two sequences by global alignment
algorithm - Identify conserved regions and compare to
TRANSFAC database - High scoring hits returned as positives
- rVISTA (http//rvista.dcode.org)
- Identified TRANSFAC sites in two orthologous
sequences - Aligns sequences with local alignment algorithm
- Highest identity regions returned as hits
- Bayes aligner (http//www.bioinfo.rpi.edu/applicat
ions/bayesian/bayes/bayes.align12.pl) - Aligns two sequences with Bayesian algorithm
- Even weakly conserved regions identified
29Expression-profiling based method
Microarray analyses allows identification of
co-regulated genes Assume that promoters contain
similar regulatory sites Find such sites by EM
and Gibbs sampling using iteration of
PSSM Co-expressed genes may be regulated at
higher levels MEME (http//meme.sdsc.edu/meme/webs
ite/meme-intro.html) AlignACE (http//atlas.med.ha
rvard.edu/cgi-bin/alignace.pl) Gibbs sampling
algorithm
30Web humour