Genome Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Genome Bioinformatics

Description:

Genome Bioinformatics. Tyler Alioto. Center for Genomic Regulation Barcelona, Spain ... Expressed Sequence (cDNA) or protein sequence available? Yes Spliced alignment ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 27
Provided by: tylera9
Category:

less

Transcript and Presenter's Notes

Title: Genome Bioinformatics


1
Genome Bioinformatics
  • Tyler Alioto
  • Center for Genomic Regulation Barcelona, Spain

2
Node 1 of the INB
  • GN1 Bioinformática y Genómica
  • Genome Bioinformatic Lab, CRG
  • Roderic Guigó (PI)

3
Themes
  • Gene prediction
  • ab initio gt GeneID
  • dual-genome gt SGP2
  • u12 introns gt GeneID v1.3 and U12DB
  • combiner gt GenePC
  • Genome feature visualization
  • gff2ps
  • Alternative splicing
  • ASTALAVISTA
  • Gene expression regulatory elements
  • meta and mmeta alignment

4
Eukaryotic gene structure
5
Eukaryotic gene structure
INTRONS
PROMOTOR
acceptor
donor
EXONS
DOWNSTREAM REGULATOR
UPSTREAM REGULATOR
6
The Splicing Code
7
Gene Prediction Strategies
  • Expressed Sequence (cDNA) or protein sequence
    available?
  • Yes ? Spliced alignment
  • BLAT, Exonerate, est_genome, spidey, GMAP,
    Genewise
  • No ? Integrated gene prediction
  • Informant genome(s) available?
  • Yes ? Dual or n-genome de novo predictors
  • SGP2, Twinscan, NSCAN,
  • (Genomescan same or cross genome protein
    blastx)
  • No ? ab initio predictors
  • geneid, genscan, augustus, fgenesh, genemark,
    etc.
  • Many newer gene predictors can run in multiple
    modes depending on the evidence available.

8
Frameworks for gene prediction
  • Hierarchical exon-buliding and chaining
  • Hidden Markov Models (many flavors)
  • HMM, GHMM, GPHMM, Phylo-HMM
  • Conditional Random Fields (new!)
  • Conrad, Contrast... and, no doubt, more to come
  • All of them involve parsing the optimal path of
    exons using dynamic programming
  • (e.g. GenAmic, Viterbi algorithms)

9
How does GeneID approach gene prediction?
10
The gene prediction problem
sites
a4
a2
a1
a3
exons
d1
d2
e1
d3
d4
e2
e3
d5
e4
e5
e6
e7
genes
e8
e1
11
GeneID
  • Geneid follows a hierarchical structure
  • signal ? exon ? gene
  • Exon score
  • Score of exon-defining signals protein-coding
    potential (log-likelihood ratios)
  • Dynamic programming algorithm
  • maximize score of assembled exons ? assembled gene

12
Training GeneID
1 2 3 4 5 6 7 8 9
A 0.3 0.6 0.1 0.0 0.0 0.6 0.7 0.2 0.1
C 0.2 0.2 0.1 0.0 0.0 0.2 0.1 0.1 0.2
G 0.1 0.1 0.7 1.0 0.0 0.1 0.1 0.5 0.1
T 0.4 0.1 0.1 0.0 1.0 0.1 0.1 0.2 0.6
13
Running GeneID command line or on geneid server
NAME geneid - a program to annotate genomic
sequences SYNOPSIS geneid -bdaefitnxszr -DA
-Z -p gene_prefix -G -3 -X -M
-m -WCF -o -j lower_bound_coord -k
upper_bound_coord -O ltgff_exons_filegt -R
ltgff_annotation-filegt -S ltgff_homology_filegt
-P ltparameter_filegt -E exonweight -V
evidence_exonweight -Bv -h ltlocus_seq_in_
fasta_formatgt RELEASE geneid v 1.3 OPTIONS -b
Output Start codons -d Output Donor splice
sites -a Output Acceptor splice sites -e
Output Stop codons -f Output Initial exons -i
Output Internal exons -t Output Terminal
exons -n Output introns -s Output Single
genes -x Output all predicted exons -z Output
Open Reading Frames -D Output genomic sequence
of exons in predicted genes -A Output amino
acid sequence derived from predicted CDS -p
Prefix this value to the names of predicted
genes, peptides and CDS -G Use GFF format to
print predictions -3 Use GFF3 format to print
predictions -X Use extended-format to print
gene predictions -M Use XML format to print
gene predictions -m Show DTD for XML-format
output -j Begin prediction at this
coordinate -k End prediction at this
coordinate -W Only Forward sense prediction
(Watson) -C Only Reverse sense prediction
(Crick) -U Allow U12 introns (Requires
appropriate U12 parameters to be set in the
parameter file) -r Use recursive splicing -F
Force the prediction of one gene structure -o
Only running exon prediction (disable gene
prediction) -O ltexons_filenamegt Only running
gene prediction (not exon prediction) -Z
Activate Open Reading Frames searching -R
ltexons_filenamegt Provide annotations to improve
predictions -S ltHSP_filenamegt Using
information from protein sequence alignments to
improve predictions -E Add this value to the
exon weight parameter (see parameter file) -V
Add this value to the score of evidence exons
-P ltparameter_filegt Use other than default
parameter file (human) -B Display memory
required to execute geneid given a sequence -v
Verbose. Display info messages -h Show this
help AUTHORS geneid_v1.3 has been developed by
Enrique Blanco, Tyler Alioto and Roderic
Guigo. Parameter files have been created by
Genis Parra and Tyler Alioto. Any bug or
suggestion can be reported to geneid_at_imim.es
14
GeneID output
gff-version 2 date Mon Nov 26 143715
2007 source-version geneid v 1.2 --
geneid_at_imim.es Sequence HS307871 - Length
4514 bps Optimal Gene Structure. 1 genes. Score
16.20 Gene 1 (Forward). 9 exons. 391 aa.
Score 16.20 HS307871 geneid_v1.2 Internal
1710 1860 -0.11 0 HS307871_1 HS307871 geneid
_v1.2 Internal 1976 2055 0.24 2
HS307871_1 HS307871 geneid_v1.2 Internal
2132 2194 0.44 0 HS307871_1 HS307871 geneid
_v1.2 Internal 2434 2682 4.66 0
HS307871_1 HS307871 geneid_v1.2 Internal
2749 2910 3.19 0 HS307871_1 HS307871 geneid
_v1.2 Internal 3279 3416 0.97 0
HS307871_1 HS307871 geneid_v1.2 Internal
3576 3676 3.23 0 HS307871_1 HS307871 geneid
_v1.2 Internal 3780 3846 -0.96 1
HS307871_1 HS307871 geneid_v1.2 Terminal
4179 4340 4.55 0 HS307871_1
15
GFF a standard annotation format
  • Stands for
  • Gene Finding Format -or- General Feature Format
  • Designed as a single line record for describing
    features on DNA sequence -- originally used for
    gene prediction output
  • 9 tab-delimited fields common to all versions
  • seq source feature begin end score strand
    frame group
  • The group field differs between versions, but in
    every case no tabs are allowed
  • GFF2 group is a unique description, usually the
    gene name.
  • NCOA1
  • GFF2.5 / GTF (Gene Transfer Format) tag-value
    pairs introduced, start_codon and stop_codon are
    required features for CDS
  • transcript_id NM_056789 gene_id NCOA1
  • GFF3 Capitalized tags follow Sequence Ontology
    (SO) relationships, FASTA seqs can be embedded
  • IDNM_056789_exon1 ParentNM_056789 note5
    UTR exon

16
GeneID output
gff-version 2 date Mon Nov 26 143715
2007 source-version geneid v 1.2 --
geneid_at_imim.es Sequence HS307871 - Length
4514 bps Optimal Gene Structure. 1 genes. Score
16.20 Gene 1 (Forward). 9 exons. 391 aa.
Score 16.20 HS307871 geneid_v1.2 Internal
1710 1860 -0.11 0 HS307871_1 HS307871 geneid
_v1.2 Internal 1976 2055 0.24 2
HS307871_1 HS307871 geneid_v1.2 Internal
2132 2194 0.44 0 HS307871_1 HS307871 geneid
_v1.2 Internal 2434 2682 4.66 0
HS307871_1 HS307871 geneid_v1.2 Internal
2749 2910 3.19 0 HS307871_1 HS307871 geneid
_v1.2 Internal 3279 3416 0.97 0
HS307871_1 HS307871 geneid_v1.2 Internal
3576 3676 3.23 0 HS307871_1 HS307871 geneid
_v1.2 Internal 3780 3846 -0.96 1
HS307871_1 HS307871 geneid_v1.2 Terminal
4179 4340 4.55 0 HS307871_1
17
Visualizing features with gff2ps
generated by Josep Abril
18
Visualizing features on UCSC genome browser
(custom tracks)
  • If your genome is served by UCSC, this is a
    good option because
  • browsing is dynamic
  • access to other annotations
  • can view DNA sequence
  • can do complex intersections and filtering
  • gff2ps is good when
  • your genome is not on UCSC
  • you want more flexible layout options
  • you want to run it offline

19
Extensions to GeneID
  • Syntenic Gene Prediction (dual-genome)
  • Evidence-based (constrained) gene prediction
  • U12 intron detection
  • Combining gene predictions
  • Selenoprotein gene prediction

20
Syntenic Gene Prediction SGP2
21
Minor splicing and U12 introns
  • U12 introns make up a minor proportion of all
    introns (0.33 in human, less in insects)
  • But they can be found in 2-3 of genes
  • Normally ignored, but this causes annotation
    problems
  • Easy to predict due to highly conserved donor and
    branch sites

22
Splice Signal Profiles major and minor
23
Gathering U12 Introns
Human
Fruit Fly
2084
aln to EST/ mRNA
aln to EST/ mRNA
predict
predict
genome
genome
563
score
score
568
385
merge
merge
all annotated introns
all annotated introns
658
ENSEMBL?
ortholog search (17 species) spliced alignment
597
U12 DB
published
24
(No Transcript)
25
Coming Soon GenePCa Gene Prediction Combiner
26
Tutorial Homepage
  • http//genome.imim.es/courses/Pamplona07/

GBL Homepage
  • http//genome.imim.es/
Write a Comment
User Comments (0)
About PowerShow.com