Title: Genomics:
1Genomics Gene prediction and Annotations
Kishor K. Shende Information Officer Bioinformatic
s Center, Barkatullah University Bhopal
2Gene Prediction Strategies
TAA TAG TGA
Prokaryotes Gene Architecture
Initiation
Termination
ATG
-10
-36
Protein 1
Protein 2
Protein 3
Promoter
Gene
Termination
Regulatory Seq.
ATG
Exon-1
Intron-1
Exon-2
Splicing Sites
TAA TAG TGA
Initiation
Eukaryotes Gene Architecture
3Codon Usage Tables
- Each amino acid can be encoded by several codons
- Each organism has characteristic pattern of
codon usage
4Problems in Gene Prediction
- Distinguishing Pseudogenes from Genes
- Exon-Intron Structure in Eukaryotes, Exon
flanking regions not very well conserved - Alternative Splicing Shuffling of Exons
- Genes can overlap each other and occur on
different strand of DNA
5Gene Identification
- 1. Homology Based Gene prediction
- Sequence Similarity Search against gene database
using BLAST and FAST searching tools - EST (Expressed Sequence Tags) similarity search
- 1. Homology Based Gene prediction
- Sequence Similarity Search against gene database
using BLAST and FAST searching tools - EST (Expressed Sequence Tags) similarity search
- 2. Ab initio Gene Prediction
- Prokaryotes
- - ORF finding
- Eukaryotes
- - Promoter prediction
- - Start-Stop codon prediction
- - Splice site Prediction (Exon-Intron and Intron
Exon) - - PolyA signal prediction
- 2. Ab initio Gene Prediction
- Prokaryotes
- - ORF finding
- Eukaryotes
- - Promoter prediction
- - Start-Stop codon prediction
- - Splice site Prediction (Exon-Intron and Intron
Exon) - - PolyA signal prediction
6ORF Finding in Prokaryotes
- Easier due to ..
- Small Genome have high gene density (Haemophilus
influenza 85 genic) - No Introns or Few Introns
- Operons
- - One Transcript, many genes
- Open Reading Frames (ORF)
- - Contigous set of codons, start with Met-codon,
ends with stop codon
7- 1. ORF Findings
- Simplest method
- Length of DNA sequence that contains a
contiguous set of codons, each of which
specifies an Amino Acid - Six possible reading frames
Start Codon
1
2
3
5
3
A T G C C A T C A G
Sense Strand
Antisense Strand
T G C C A T T G T A
5
3
3
1
2
Position 3 Position 2 Position 1
Start Codon
Central Dogma
DNA
mRAN
Protein
8ORF Prediction Based on Position of Start Codon
Stop Codon
ORF
Start Codon
Stop Codon
A U G
U G A
OR
U A A
OR
U A G
Protein Coding Region
No Protein Due to the Presence of many
in-frame stop codons
Code for Protein
9Example of ORF
There are six possible ORFs in each sequence for
both directions of transcription.
10- Difficulty in ORF Prediction
- Prokaryotes Viruses Presence of multiple genes
on mRNA and Overlapping genes in which two
different proteins may be encoded in different
reading frames of the same mRNA - Eukaryotes Protein coding region (Exon) is
followed by non-coding region (Intron) - Differential mRNA splicing create different
mRNA, hence different proteins - Variation in Genetic Code from Universal code
- Reliability of ORF Prediction Characteristics of
ORF regions - Ordered list of specific codons that reflects the
evolutionary origin of the gene and constraints
associated with gene expressions - Characteristics pattern of use of synonymous
codons i.e. codons that stands for same Amino
Acid - In Eukaryotes strong preferences for codon pairs
at Intron-Exon or Exon-Intron junction - High genome content of GC have a strong bias of
G C in the third codon positions
113 Test of ORF First Test It is based on an
unusual type of sequence variation that is found
in ORF have been devised to variety that a
predicted ORF is in fact likely to encode a
protein Second Test It is analyzed, to
determine whether the codon in the ORF correspond
to these used in other genes of the same
organism Third Test ORF may be translated into
an amino acid sequence and the resulting sequence
then compound to the databases of existing
sequence
12Repeated Sequence Elements and Nucleosome
Structure 1. Eukaryotic DNA is wrapped around
histon-protein complexes 2. Some base pairs in
the major or minor grooves of the DNA molecules
face the nucleosome surface 3. Other pair face
outside of the structures 4. Nucleosome located
in the promoter regions are remodeled in a manner
that can influence the availability of binding
sites for regulatory proteins making them more or
less available
Hidden Morkov Model (HMM) of Eukaryotic Internal
Exon Computational Background Repeated patterns
of sequence have been found in the Introns and
Exons and near the start site of Transcriptuion
of Eukaryotic genes
- Bending Pattern Bending is influenced by
- Repeated pattern i.e. not T, A or G, G
- AA/TT dinucleotide
13Ab initio gene prediction
- Predictions are based on the observation that
gene DNA sequence is not random - - Gene-coding sequence has start and stop
codons. - Each species has a characteristic pattern of
synonymous codon usage. - Non-coding ORFs are very short.
- Gene would correspond to the longest ORF.
- These methods look for the characteristic
features of genes and score them high.
14Ab initio gene prediction methods
- GeneScan Fourier transform of DNA sequence to
find characteristic patterns. - GeneParser predicts the most likely combination
of exons/introns. Dynamic programming. - GeneMark mostly for prokaryotes, Hidden Markov
Models. Also for Eukaryotes - Grail II predicts exons, promoters, Poly(A)
sites. Neural network plus dynamic programming.
15Gene Preference Score Important indicator of
coding region
Observation frequencies of codons and codon
pairs in coding and non-coding regions are
different. Given a sequence of codons
and assuming independence, the probability
of finding coding region The
probability of finding sequence C in non-coding
regions The gene preference score
16Confirming gene location using EST libraries
- Expressed Sequence Tags (ESTs) sequenced short
segments of cDNA. They are organized in the
database UniGene. - If region matches ESTs with high statistical
significance, then it is a gene or pseudogene.
17Gene prediction accuracy
True positives (TP) nucleotides, which are
correctly predicted to be within the gene. Actual
positives (AP) nucleotides, which are located
within the actual gene. Predicted positives (PP)
nucleotides, which are predicted in the gene.
Sensitivity TP / AP
Specificity TP / PP
18Gene prediction accuracy
19Common Difficulties of Gene Prediction
- First and last exons difficult to annotate
because they contain UTRs. - Smaller genes are not statistically significant
so they are thrown out. - Algorithms are trained with sequences from known
genes which biases them against genes about which
nothing is known.
20Genome Analysis for Gene Prediction
Genome analysis
Genome the sum of genes and intergenic
sequences of haploid cell.
The value of genome sequences lies in their
annotation
- Annotation Characterizing genomic features
using computational and experimental methods - Genes levels of annotation
- Gene Prediction Where are genes?
- What do they encode?
- What proteins/pathways involved in?
21Flowchart Gene Prediction Process
- Translate in all
- six Reading Frames
- compare to Protein
- sequence database
- 2. Perform database
- similarity search of
- EST database of
- some Organism
Genomic DNA Sequence
Use Gene Prediction program to locate genes
Analyze the Regulatory Sequences in the Gene
22ORF Finding
Try this first using BLAST FASTA
PSI-BLAST, PHI-BLAST Other BLAST/FASTA
programs EST, cDNA database search
Promoter, Splicing Site, Poly-A tail, 5 TUR, 3
UTR
Compare with Genome of Other Organism
23Lets have some Practice on Gene Finding using
some Gene Finding Programs
- GenMark (http//exon.gatech.edu/GeneMark/ )
- Genscan (http//genes.mit.edu/GENSCAN.html )
- Grail II (http//compbio.ornl.gov/Grail-1.3/ )
- Gene Finder in GlimmerM (http//www.tigr.org/tdb/g
limmerm/glmr_form.html )
24 HMMgene - Prediction of genes in vertebrate
and C. elegans Gene Discovery Page
FramePlot - protein-coding region prediction tool
for high GC-content bacteria tRNAscan-SE
Search for transfer RNA genes in genomic sequence
NETGENE - Predict splice sites in human
genes ORF Finder BCM Gene Finder
Grail Genemark Genie A Gene
Finder Based on Generalized Hidden Markov Models
GENSCAN - predict complete gene structures
Splice Site Prediction by Neural Network
Procrustes GenePrimer GenLang
MZEF Gene Finder Webgene - Tools for
prediction and analysis of protein-coding gene
structure MAR-Finder - Nuclear matrix
attachment region prediction Glimmer
bacterial/archael gene finder
25- Promoter Region, Transscription Factor and
Signals - TRANSFAC - Transcription Factor database
TFD Transcription Factor Database TransTerm -
A Translational Signal Database PLACE - a
database of plant cis-acting regulatory DNA
elements NNPP Promoter Prediction by Neural
Network FastM/ModelInspector TFSEARCH
MatInd and MatInspector Transcription Element
Search Software (TESS) CorePromoter
(Core-Promoter Prediction Program) Gene Express
- analysis of genomic regulatory sequences
Signal Scan PromoterInspector Promoter Scan
II Pol3scan TargetFinder - finds DNA-binding
proteins.
26Overview GENE PREDICTION TOOLS
27GenMarkTM (http//exon.gatech.edu/GeneMark/
) Mark Borodovsky's Bioinformatics Group at the
Georgia Institute of Technology, Atlanta, Georgia
28GeneMark.hmm for Prokaryotes (Version 2.4)
Reference Lukashin A. and Borodovsky M., GeneMark.hmm new solutions for gene finding, NAR, 1998, Vol. 26, No. 4, pp. 1107-1115
Bacterial and archaeal gene prediction, you can
use the parallel combination of the GeneMark and
GeneMark.hmm programs Heuristic Approach for
Gene Prediction in Prokaryotes If the DNA
sequence of interest belongs to a species whose
name is not in the list of available models, use
the Heuristic models option Self Training
Program of Genmarks If the sequence is longer
than 1 Mb, generate models with the self-training
program GeneMarkS
29(No Transcript)
30Gene Prediction in Eukaryotes
Eukaryotic gene prediction Use the parallel
combination of the GeneMark and GeneMark.hmm
31Select the Related Organisms from this list
32Gene Prediction in EST and cDNA
To analyze ESTs and cDNAs
33(No Transcript)
34Gene Prediction in Viruses
Viral gene prediction through virus database
VIOLIN
35(No Transcript)
36GenMark Output
37GenMark Output
38New GENSCAN Web Server at MIT
39(No Transcript)
40Genescan Output
41(No Transcript)
42(No Transcript)
43GrailEXP
- Locate protein coding genes within DNA sequence,
- Locate EST/mRNA alignments,
- Locate certain types of promoters,
polyadenylation sites, CpG islands, and
repetitive elements.
- GrailEXP is a gene finder.
- EST alignment utility
- exon prediction program,
- a promoter/polya recognizer,
- a CpG island finer,
- a repeat masker,
44GrailEXP
Predicts exons, genes, promoters, polyas, CpG
islands, EST similarities, and repetitive
elements within DNA sequence
45(No Transcript)
46GlimmerM http//www.tigr.org/tdb/glimmerm/glmr_fo
rm.html
A system for finding genes in microbial DNA,
especially the genomes of bacteria and
archaea.Glimmer (Gene Locator and Interpolated
Markov Modeler) uses interpolated Markov models
(IMMs) to identify the coding regions and
distinguish them from noncoding DNA.
GlimmerHMM For Eukaryotic Organisms
Genesplicer Fast, flexible system for detecting
splice sites in the genomic DNA of various
eukaryotes.
47GLimmerM Gene Finder
48(No Transcript)
49THANK YOU
Kishor K. Shende Information Officer Bioinformatic
s Center, Barkatullah University Bhopal