Genomics:

About This Presentation

Transcript and Presenter's Notes

Title: Genomics:

1
Genomics Gene prediction and Annotations
Kishor K. Shende Information Officer Bioinformatic
s Center, Barkatullah University Bhopal
2
Gene Prediction Strategies
TAA TAG TGA
Prokaryotes Gene Architecture
Initiation
Termination
ATG
-10
-36
Protein 1
Protein 2
Protein 3
Promoter
Gene
Termination
Regulatory Seq.
ATG
Exon-1
Intron-1
Exon-2
Splicing Sites
TAA TAG TGA
Initiation
Eukaryotes Gene Architecture
3
Codon Usage Tables

Each amino acid can be encoded by several codons
Each organism has characteristic pattern of
codon usage

4
Problems in Gene Prediction

Distinguishing Pseudogenes from Genes
Exon-Intron Structure in Eukaryotes, Exon
flanking regions not very well conserved
Alternative Splicing Shuffling of Exons
Genes can overlap each other and occur on
different strand of DNA

5
Gene Identification

1. Homology Based Gene prediction
Sequence Similarity Search against gene database
using BLAST and FAST searching tools
EST (Expressed Sequence Tags) similarity search

1. Homology Based Gene prediction
Sequence Similarity Search against gene database
using BLAST and FAST searching tools
EST (Expressed Sequence Tags) similarity search

2. Ab initio Gene Prediction
Prokaryotes
- ORF finding
Eukaryotes
- Promoter prediction
- Start-Stop codon prediction
- Splice site Prediction (Exon-Intron and Intron
Exon)
- PolyA signal prediction

2. Ab initio Gene Prediction
Prokaryotes
- ORF finding
Eukaryotes
- Promoter prediction
- Start-Stop codon prediction
- Splice site Prediction (Exon-Intron and Intron
Exon)
- PolyA signal prediction

6
ORF Finding in Prokaryotes

Easier due to ..
Small Genome have high gene density (Haemophilus
influenza 85 genic)
No Introns or Few Introns
Operons
- One Transcript, many genes
Open Reading Frames (ORF)
- Contigous set of codons, start with Met-codon,
ends with stop codon

1. ORF Findings
Simplest method
Length of DNA sequence that contains a
contiguous set of codons, each of which
specifies an Amino Acid
Six possible reading frames

Start Codon
1
2
3
5
3
A T G C C A T C A G
Sense Strand
Antisense Strand
T G C C A T T G T A
5
3
3
1
2
Position 3 Position 2 Position 1
Start Codon
Central Dogma
DNA
mRAN
Protein
8
ORF Prediction Based on Position of Start Codon
Stop Codon
ORF
Start Codon
Stop Codon
A U G
U G A
OR
U A A
OR
U A G
Protein Coding Region
No Protein Due to the Presence of many
in-frame stop codons
Code for Protein
9
Example of ORF
There are six possible ORFs in each sequence for
both directions of transcription.
10

Difficulty in ORF Prediction
Prokaryotes Viruses Presence of multiple genes
on mRNA and Overlapping genes in which two
different proteins may be encoded in different
reading frames of the same mRNA
Eukaryotes Protein coding region (Exon) is
followed by non-coding region (Intron)
Differential mRNA splicing create different
mRNA, hence different proteins
Variation in Genetic Code from Universal code

Reliability of ORF Prediction Characteristics of
ORF regions
Ordered list of specific codons that reflects the
evolutionary origin of the gene and constraints
associated with gene expressions
Characteristics pattern of use of synonymous
codons i.e. codons that stands for same Amino
Acid
In Eukaryotes strong preferences for codon pairs
at Intron-Exon or Exon-Intron junction
High genome content of GC have a strong bias of
G C in the third codon positions

11
3 Test of ORF First Test It is based on an
unusual type of sequence variation that is found
in ORF have been devised to variety that a
predicted ORF is in fact likely to encode a
protein Second Test It is analyzed, to
determine whether the codon in the ORF correspond
to these used in other genes of the same
organism Third Test ORF may be translated into
an amino acid sequence and the resulting sequence
then compound to the databases of existing
sequence
12
Repeated Sequence Elements and Nucleosome
Structure 1. Eukaryotic DNA is wrapped around
histon-protein complexes 2. Some base pairs in
the major or minor grooves of the DNA molecules
face the nucleosome surface 3. Other pair face
outside of the structures 4. Nucleosome located
in the promoter regions are remodeled in a manner
that can influence the availability of binding
sites for regulatory proteins making them more or
less available
Hidden Morkov Model (HMM) of Eukaryotic Internal
Exon Computational Background Repeated patterns
of sequence have been found in the Introns and
Exons and near the start site of Transcriptuion
of Eukaryotic genes

Bending Pattern Bending is influenced by
Repeated pattern i.e. not T, A or G, G
AA/TT dinucleotide

13
Ab initio gene prediction

Predictions are based on the observation that
gene DNA sequence is not random
- Gene-coding sequence has start and stop
codons.
Each species has a characteristic pattern of
synonymous codon usage.
Non-coding ORFs are very short.
Gene would correspond to the longest ORF.
These methods look for the characteristic
features of genes and score them high.

14
Ab initio gene prediction methods

GeneScan Fourier transform of DNA sequence to
find characteristic patterns.
GeneParser predicts the most likely combination
of exons/introns. Dynamic programming.
GeneMark mostly for prokaryotes, Hidden Markov
Models. Also for Eukaryotes
Grail II predicts exons, promoters, Poly(A)
sites. Neural network plus dynamic programming.

15
Gene Preference Score Important indicator of
coding region
Observation frequencies of codons and codon
pairs in coding and non-coding regions are
different. Given a sequence of codons
and assuming independence, the probability
of finding coding region The
probability of finding sequence C in non-coding
regions The gene preference score

16
Confirming gene location using EST libraries

Expressed Sequence Tags (ESTs) sequenced short
segments of cDNA. They are organized in the
database UniGene.
If region matches ESTs with high statistical
significance, then it is a gene or pseudogene.

17
Gene prediction accuracy
True positives (TP) nucleotides, which are
correctly predicted to be within the gene. Actual
positives (AP) nucleotides, which are located
within the actual gene. Predicted positives (PP)
nucleotides, which are predicted in the gene.
Sensitivity TP / AP
Specificity TP / PP
18
Gene prediction accuracy
19
Common Difficulties of Gene Prediction

First and last exons difficult to annotate
because they contain UTRs.
Smaller genes are not statistically significant
so they are thrown out.
Algorithms are trained with sequences from known
genes which biases them against genes about which
nothing is known.

20
Genome Analysis for Gene Prediction
Genome analysis
Genome the sum of genes and intergenic
sequences of haploid cell.
The value of genome sequences lies in their
annotation

Annotation Characterizing genomic features
using computational and experimental methods
Genes levels of annotation
Gene Prediction Where are genes?
What do they encode?
What proteins/pathways involved in?

21
Flowchart Gene Prediction Process

Translate in all
six Reading Frames
compare to Protein
sequence database
2. Perform database
similarity search of
EST database of
some Organism

Genomic DNA Sequence
Use Gene Prediction program to locate genes
Analyze the Regulatory Sequences in the Gene
22
ORF Finding
Try this first using BLAST FASTA
PSI-BLAST, PHI-BLAST Other BLAST/FASTA
programs EST, cDNA database search
Promoter, Splicing Site, Poly-A tail, 5 TUR, 3
UTR
Compare with Genome of Other Organism
23
Lets have some Practice on Gene Finding using
some Gene Finding Programs

GenMark (http//exon.gatech.edu/GeneMark/ )
Genscan (http//genes.mit.edu/GENSCAN.html )
Grail II (http//compbio.ornl.gov/Grail-1.3/ )
Gene Finder in GlimmerM (http//www.tigr.org/tdb/g
limmerm/glmr_form.html )

24
   HMMgene - Prediction of genes in vertebrate
and C. elegans    Gene Discovery Page
FramePlot - protein-coding region prediction tool
for high GC-content bacteria    tRNAscan-SE
Search for transfer RNA genes in genomic sequence
   NETGENE - Predict splice sites in human
genes    ORF Finder    BCM Gene Finder
   Grail    Genemark    Genie A Gene
Finder Based on Generalized Hidden Markov Models
   GENSCAN - predict complete gene structures
   Splice Site Prediction by Neural Network
   Procrustes    GenePrimer    GenLang
   MZEF Gene Finder    Webgene - Tools for
prediction and analysis of protein-coding gene
structure    MAR-Finder - Nuclear matrix
attachment region prediction    Glimmer
bacterial/archael gene finder
25

Promoter Region, Transscription Factor and
Signals
TRANSFAC - Transcription Factor database
TFD Transcription Factor Database TransTerm -
A Translational Signal Database PLACE - a
database of plant cis-acting regulatory DNA
elements NNPP Promoter Prediction by Neural
Network FastM/ModelInspector TFSEARCH
MatInd and MatInspector Transcription Element
Search Software (TESS) CorePromoter
(Core-Promoter Prediction Program) Gene Express
- analysis of genomic regulatory sequences
Signal Scan PromoterInspector Promoter Scan
II Pol3scan TargetFinder - finds DNA-binding
proteins.

26
Overview GENE PREDICTION TOOLS
27
GenMarkTM (http//exon.gatech.edu/GeneMark/
) Mark Borodovsky's Bioinformatics Group at the
Georgia Institute of Technology, Atlanta, Georgia

28
GeneMark.hmm for Prokaryotes (Version 2.4)
Reference Lukashin A. and Borodovsky M., GeneMark.hmm new solutions for gene finding, NAR, 1998, Vol. 26, No. 4, pp. 1107-1115
Bacterial and archaeal gene prediction, you can
use the parallel combination of the GeneMark and
GeneMark.hmm programs Heuristic Approach for
Gene Prediction in Prokaryotes If the DNA
sequence of interest belongs to a species whose
name is not in the list of available models, use
the Heuristic models option Self Training
Program of Genmarks If the sequence is longer
than 1 Mb, generate models with the self-training
program GeneMarkS
29
(No Transcript)
30
Gene Prediction in Eukaryotes
Eukaryotic gene prediction Use the parallel
combination of the GeneMark and GeneMark.hmm
31
Select the Related Organisms from this list
32
Gene Prediction in EST and cDNA
To analyze ESTs and cDNAs
33
(No Transcript)
34
Gene Prediction in Viruses
Viral gene prediction through virus database
VIOLIN
35
(No Transcript)
36
GenMark Output
37
GenMark Output
38
New GENSCAN Web Server at MIT
39
(No Transcript)
40
Genescan Output
41
(No Transcript)
42
(No Transcript)
43
GrailEXP

Locate protein coding genes within DNA sequence,
Locate EST/mRNA alignments,
Locate certain types of promoters,
polyadenylation sites, CpG islands, and
repetitive elements.

GrailEXP is a gene finder.
EST alignment utility
exon prediction program,
a promoter/polya recognizer,
a CpG island finer,
a repeat masker,

44
GrailEXP
Predicts exons, genes, promoters, polyas, CpG
islands, EST similarities, and repetitive
elements within DNA sequence
45
(No Transcript)
46
GlimmerM http//www.tigr.org/tdb/glimmerm/glmr_fo
rm.html
A system for finding genes in microbial DNA,
especially the genomes of bacteria and
archaea.Glimmer (Gene Locator and Interpolated
Markov Modeler) uses interpolated Markov models
(IMMs) to identify the coding regions and
distinguish them from noncoding DNA.
GlimmerHMM For Eukaryotic Organisms
Genesplicer Fast, flexible system for detecting
splice sites in the genomic DNA of various
eukaryotes.
47
GLimmerM Gene Finder
48
(No Transcript)
49
THANK YOU
Kishor K. Shende Information Officer Bioinformatic
s Center, Barkatullah University Bhopal

Write a Comment

User Comments (0)

About PowerShow.com

Genomics: PowerPoint PPT Presentation