Title: Comparative%20Genomics%20Comparative%20Gene%20Prediction%20in%20the%20Human%20Genome
1Comparative GenomicsComparative Gene Prediction
in the Human Genome
- Maribel Hernandez Rosales
2What is Comparative Genomics?
- Comparative genomics is the analysis and
comparison of genomes from different species. - The purpose is to gain a better understanding of
how species have evolved and to determine the
function of genes and noncoding regions of the
genome. - Researchers have learned a great deal about the
function of human genes by examining their
counterparts in simpler model organisms such as
the mouse. - Genome researchers look at many different
features when comparing genomes sequence
similarity, gene location, the length and number
of coding regions (called exons) within genes,
the amount of noncoding DNA in each genome, and
highly conserved regions maintained in organisms
as simple as bacteria and as complex as humans. - Comparative genomics involves the use of computer
programs that can line up multiple genomes and
look for regions of similarity among them.
3What are the comparative genome sizes of humans
and other organisms being studied?
4Eukaryotic Gene Finding
5(No Transcript)
6Comparative Gene Prediction
- GenScan ab initio gene prediction.
- GeneWise, Procrustes homology guided.
- Rosseta, SGP1 (Syntetic Gene Prediction), CEM
(Conserved Exon Method) gene prediction and
sequence alignment are clearly separated. - GenomeScan Ab Initio modified by BLAST
homologies. - SGP-2, TwinScan, SLAM, DoubleScan modification
of GenScan scoring schema to incorporate
similarity to known proteins.
7GeneScan
- A general probabilistic model for the gene
structure of human genomic sequences. - Gene identification by identifying complete
exon/intron structures of genes in genomic DNA. - Include de capacity to predict multiple genes in
a sequence, to deal with partial as well as
complete genes, and to predict consistent sets of
genes occurring on either or both DNA strands. - Markov Model of coding regions predictions do
not depend on presence of a similar gene in the
protein sequence databases and complement the
information provided by homology-based gene
identification methods (BLASTX). - Maximal Dependence Decomposition (MDD) new
statistical model of donor and acceptor splice
sites which capture important dependencies
between signal positions.
8Pre-mRNA Splicing
exon definition
intron definition
...
(assembly of
spliceosome,
catalysis)
...
9(No Transcript)
10Hidden semi-Markov Model (HMM)
11GenScan HMM
- N - intergenic region
- P - promoter
- F - 5 untranslated region
- Esngl single exon (intronless) (translation
start -gt stop codon) - Einit initial exon (translation start -gt donor
splice site) - Ek phase k internal exon (acceptor splice site
-gt donor splice site) - Eterm terminal exon (acceptor splice site -gt
stop codon) - Ik phase k intron 0 between codons 1
after the first base of a codon 2 after the
second base of a codon
12GenScan Features
- Model both strands at once
- Each state may output a string of symbols
(according to some probability distribution). - Explicit intron/exon length modeling
- Advanced splice site modeling
- Parameters learned from annotated genes
- Prediction of multiple genes in a sequence
(partial or complete).
13GenomeScan
- We can enhance our gene prediction by using
external information DNA regions with homology
to known proteins are more likely to be coding
exons. - Combine probabilistic extrinsic information
(BLAST hits) with a probabilistic model of gene
structure/composition (GenScan). - Focus on typical case when homologous but not
identical proteins are available.
14Ab Initio modified by BLAST homologies
15Ab Initio modified by BLAST homologies
16GeneWise
- Motivation Use good DB of protein world (PFAM)
to help us annotate genomic DNA - GeneWise algorithm aligns a profile HMM directly
to the DNA
17GeneWise
- Start with a PFAM domain HMM
- Replace AA emissions with codon emissions
- Allow for sequencing errors (deletions/
insertions) - Add a 3-state intron model
18GeneWise Model
19GeneWise Intron Model
5 site
3 site
20GeneWise Features Problems
- Best alignment of DNA to protein domain
- Alignment gives exact exon-intron boundaries
- Parameters learned from species-specific
statistics - Only provides partial prediction, and only where
the homology lies - Does not find more genes
- Pseudogenes, Retrotransposons picked up
- CPU intensive
- Solution Pre-filter with BLAST
21Rosetta
- Gene prediction is separated from sequence
alignment. - First, the alignment is obtained between two
homologous genomic sequences using sequence
global alignment Glass. Then, gene structures
(splice sites, exon number and length, etc.) are
predicted that are compatible with this
alignment, meaning that predicted exons fall in
the aligned regions.
22Syntenic Gene Prediction
- This approach does not require the comparison of
two homologous genomic sequences. - A query sequence from a target genome is
compared against a collection of sequence from a
second (informant, reference) genome and the
results of the comparison are used to modify the
scores of the exons produced by underlying ab
initio'' gene prediction algorithms. - Gene prediction and sequence alignment are
separated.
23SGP-2
24Gene predicition programs predict a large number
of genes
- almost every mouse gene has
- the human orthologue counterpart
TwinScan SGP
48462 total 47055
17562 novel 21942
10987 3171 multiexonic long no low complexity 12158 4543
954 human ts 2983 human sgp
317 637 1931 1052
intron aligned human ts human sgp intron aligned
25Orthologous human mouse genes have conserved
exonic structure.
- 85 of the orhologous pairs have identical
number of exons - 91 of the orthologous exons have identical
length - 99.5 of the orthologous exons have identical
phase - there are a few cases of intron
insertion/deletion (22)
26Summary
- Genes are complex structures which are difficult
to predict with the required level of accuracy/
confidence - Different approaches to gene finding improve
accuracy/confidence of the predictions - Ab Initio GenScan
- Ab Initio modified by BLAST homologies
GenomeScan - Homology guided GeneWise
- Gene prediction and sequence alignment
separately Rosseta - Ab initio with similarity in known proteins SGP-2
27Merci pour votre attention!