Title: Genome Bioinformatics
1Genome Bioinformatics
- Tyler Alioto
- Center for Genomic Regulation Barcelona, Spain
2Node 1 of the INB
- GN1 Bioinformática y Genómica
- Genome Bioinformatic Lab, CRG
- Roderic Guigó (PI)
3Themes
- Gene prediction
- ab initio gt GeneID
- dual-genome gt SGP2
- u12 introns gt GeneID v1.3 and U12DB
- combiner gt GenePC
- Genome feature visualization
- gff2ps
- Alternative splicing
- ASTALAVISTA
- Gene expression regulatory elements
- meta and mmeta alignment
4Eukaryotic gene structure
5Eukaryotic gene structure
INTRONS
PROMOTOR
acceptor
donor
EXONS
DOWNSTREAM REGULATOR
UPSTREAM REGULATOR
6The Splicing Code
7Gene Prediction Strategies
- Expressed Sequence (cDNA) or protein sequence
available? - Yes ? Spliced alignment
- BLAT, Exonerate, est_genome, spidey, GMAP,
Genewise - No ? Integrated gene prediction
- Informant genome(s) available?
- Yes ? Dual or n-genome de novo predictors
- SGP2, Twinscan, NSCAN,
- (Genomescan same or cross genome protein
blastx) - No ? ab initio predictors
- geneid, genscan, augustus, fgenesh, genemark,
etc. - Many newer gene predictors can run in multiple
modes depending on the evidence available.
8Frameworks for gene prediction
- Hierarchical exon-buliding and chaining
- Hidden Markov Models (many flavors)
- HMM, GHMM, GPHMM, Phylo-HMM
- Conditional Random Fields (new!)
- Conrad, Contrast... and, no doubt, more to come
- All of them involve parsing the optimal path of
exons using dynamic programming - (e.g. GenAmic, Viterbi algorithms)
9How does GeneID approach gene prediction?
10The gene prediction problem
sites
a4
a2
a1
a3
exons
d1
d2
e1
d3
d4
e2
e3
d5
e4
e5
e6
e7
genes
e8
e1
11GeneID
- Geneid follows a hierarchical structure
- signal ? exon ? gene
- Exon score
- Score of exon-defining signals protein-coding
potential (log-likelihood ratios) - Dynamic programming algorithm
- maximize score of assembled exons ? assembled gene
12Training GeneID
1 2 3 4 5 6 7 8 9
A 0.3 0.6 0.1 0.0 0.0 0.6 0.7 0.2 0.1
C 0.2 0.2 0.1 0.0 0.0 0.2 0.1 0.1 0.2
G 0.1 0.1 0.7 1.0 0.0 0.1 0.1 0.5 0.1
T 0.4 0.1 0.1 0.0 1.0 0.1 0.1 0.2 0.6
13Running GeneID command line or on geneid server
NAME geneid - a program to annotate genomic
sequences SYNOPSIS geneid -bdaefitnxszr -DA
-Z -p gene_prefix -G -3 -X -M
-m -WCF -o -j lower_bound_coord -k
upper_bound_coord -O ltgff_exons_filegt -R
ltgff_annotation-filegt -S ltgff_homology_filegt
-P ltparameter_filegt -E exonweight -V
evidence_exonweight -Bv -h ltlocus_seq_in_
fasta_formatgt RELEASE geneid v 1.3 OPTIONS -b
Output Start codons -d Output Donor splice
sites -a Output Acceptor splice sites -e
Output Stop codons -f Output Initial exons -i
Output Internal exons -t Output Terminal
exons -n Output introns -s Output Single
genes -x Output all predicted exons -z Output
Open Reading Frames -D Output genomic sequence
of exons in predicted genes -A Output amino
acid sequence derived from predicted CDS -p
Prefix this value to the names of predicted
genes, peptides and CDS -G Use GFF format to
print predictions -3 Use GFF3 format to print
predictions -X Use extended-format to print
gene predictions -M Use XML format to print
gene predictions -m Show DTD for XML-format
output -j Begin prediction at this
coordinate -k End prediction at this
coordinate -W Only Forward sense prediction
(Watson) -C Only Reverse sense prediction
(Crick) -U Allow U12 introns (Requires
appropriate U12 parameters to be set in the
parameter file) -r Use recursive splicing -F
Force the prediction of one gene structure -o
Only running exon prediction (disable gene
prediction) -O ltexons_filenamegt Only running
gene prediction (not exon prediction) -Z
Activate Open Reading Frames searching -R
ltexons_filenamegt Provide annotations to improve
predictions -S ltHSP_filenamegt Using
information from protein sequence alignments to
improve predictions -E Add this value to the
exon weight parameter (see parameter file) -V
Add this value to the score of evidence exons
-P ltparameter_filegt Use other than default
parameter file (human) -B Display memory
required to execute geneid given a sequence -v
Verbose. Display info messages -h Show this
help AUTHORS geneid_v1.3 has been developed by
Enrique Blanco, Tyler Alioto and Roderic
Guigo. Parameter files have been created by
Genis Parra and Tyler Alioto. Any bug or
suggestion can be reported to geneid_at_imim.es
14GeneID output
gff-version 2 date Mon Nov 26 143715
2007 source-version geneid v 1.2 --
geneid_at_imim.es Sequence HS307871 - Length
4514 bps Optimal Gene Structure. 1 genes. Score
16.20 Gene 1 (Forward). 9 exons. 391 aa.
Score 16.20 HS307871 geneid_v1.2 Internal
1710 1860 -0.11 0 HS307871_1 HS307871 geneid
_v1.2 Internal 1976 2055 0.24 2
HS307871_1 HS307871 geneid_v1.2 Internal
2132 2194 0.44 0 HS307871_1 HS307871 geneid
_v1.2 Internal 2434 2682 4.66 0
HS307871_1 HS307871 geneid_v1.2 Internal
2749 2910 3.19 0 HS307871_1 HS307871 geneid
_v1.2 Internal 3279 3416 0.97 0
HS307871_1 HS307871 geneid_v1.2 Internal
3576 3676 3.23 0 HS307871_1 HS307871 geneid
_v1.2 Internal 3780 3846 -0.96 1
HS307871_1 HS307871 geneid_v1.2 Terminal
4179 4340 4.55 0 HS307871_1
15GFF a standard annotation format
- Stands for
- Gene Finding Format -or- General Feature Format
- Designed as a single line record for describing
features on DNA sequence -- originally used for
gene prediction output - 9 tab-delimited fields common to all versions
- seq source feature begin end score strand
frame group - The group field differs between versions, but in
every case no tabs are allowed - GFF2 group is a unique description, usually the
gene name. - NCOA1
- GFF2.5 / GTF (Gene Transfer Format) tag-value
pairs introduced, start_codon and stop_codon are
required features for CDS - transcript_id NM_056789 gene_id NCOA1
- GFF3 Capitalized tags follow Sequence Ontology
(SO) relationships, FASTA seqs can be embedded - IDNM_056789_exon1 ParentNM_056789 note5
UTR exon
16GeneID output
gff-version 2 date Mon Nov 26 143715
2007 source-version geneid v 1.2 --
geneid_at_imim.es Sequence HS307871 - Length
4514 bps Optimal Gene Structure. 1 genes. Score
16.20 Gene 1 (Forward). 9 exons. 391 aa.
Score 16.20 HS307871 geneid_v1.2 Internal
1710 1860 -0.11 0 HS307871_1 HS307871 geneid
_v1.2 Internal 1976 2055 0.24 2
HS307871_1 HS307871 geneid_v1.2 Internal
2132 2194 0.44 0 HS307871_1 HS307871 geneid
_v1.2 Internal 2434 2682 4.66 0
HS307871_1 HS307871 geneid_v1.2 Internal
2749 2910 3.19 0 HS307871_1 HS307871 geneid
_v1.2 Internal 3279 3416 0.97 0
HS307871_1 HS307871 geneid_v1.2 Internal
3576 3676 3.23 0 HS307871_1 HS307871 geneid
_v1.2 Internal 3780 3846 -0.96 1
HS307871_1 HS307871 geneid_v1.2 Terminal
4179 4340 4.55 0 HS307871_1
17Visualizing features with gff2ps
generated by Josep Abril
18Visualizing features on UCSC genome browser
(custom tracks)
- If your genome is served by UCSC, this is a
good option because - browsing is dynamic
- access to other annotations
- can view DNA sequence
- can do complex intersections and filtering
- gff2ps is good when
- your genome is not on UCSC
- you want more flexible layout options
- you want to run it offline
19Extensions to GeneID
- Syntenic Gene Prediction (dual-genome)
- Evidence-based (constrained) gene prediction
- U12 intron detection
- Combining gene predictions
- Selenoprotein gene prediction
20Syntenic Gene Prediction SGP2
21Minor splicing and U12 introns
- U12 introns make up a minor proportion of all
introns (0.33 in human, less in insects) - But they can be found in 2-3 of genes
- Normally ignored, but this causes annotation
problems - Easy to predict due to highly conserved donor and
branch sites
22Splice Signal Profiles major and minor
23Gathering U12 Introns
Human
Fruit Fly
2084
aln to EST/ mRNA
aln to EST/ mRNA
predict
predict
genome
genome
563
score
score
568
385
merge
merge
all annotated introns
all annotated introns
658
ENSEMBL?
ortholog search (17 species) spliced alignment
597
U12 DB
published
24(No Transcript)
25Coming Soon GenePCa Gene Prediction Combiner
26Tutorial Homepage
- http//genome.imim.es/courses/Pamplona07/
GBL Homepage