Comparative%20Genomics%20Comparative%20Gene%20Prediction%20in%20the%20Human%20Genome - PowerPoint PPT Presentation

About This Presentation
Title:

Comparative%20Genomics%20Comparative%20Gene%20Prediction%20in%20the%20Human%20Genome

Description:

Comparative genomics is the analysis and comparison of genomes from different species. ... Rosetta. Gene prediction is separated from sequence alignment. ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 28
Provided by: maribelher
Category:

less

Transcript and Presenter's Notes

Title: Comparative%20Genomics%20Comparative%20Gene%20Prediction%20in%20the%20Human%20Genome


1
Comparative GenomicsComparative Gene Prediction
in the Human Genome
  • Maribel Hernandez Rosales

2
What is Comparative Genomics?
  • Comparative genomics is the analysis and
    comparison of genomes from different species.
  • The purpose is to gain a better understanding of
    how species have evolved and to determine the
    function of genes and noncoding regions of the
    genome.
  • Researchers have learned a great deal about the
    function of human genes by examining their
    counterparts in simpler model organisms such as
    the mouse.
  • Genome researchers look at many different
    features when comparing genomes sequence
    similarity, gene location, the length and number
    of coding regions (called exons) within genes,
    the amount of noncoding DNA in each genome, and
    highly conserved regions maintained in organisms
    as simple as bacteria and as complex as humans.
  • Comparative genomics involves the use of computer
    programs that can line up multiple genomes and
    look for regions of similarity among them.

3
What are the comparative genome sizes of humans
and other organisms being studied?
4
Eukaryotic Gene Finding
5
(No Transcript)
6
Comparative Gene Prediction
  • GenScan ab initio gene prediction.
  • GeneWise, Procrustes homology guided.
  • Rosseta, SGP1 (Syntetic Gene Prediction), CEM
    (Conserved Exon Method) gene prediction and
    sequence alignment are clearly separated.
  • GenomeScan Ab Initio modified by BLAST
    homologies.
  • SGP-2, TwinScan, SLAM, DoubleScan modification
    of GenScan scoring schema to incorporate
    similarity to known proteins.

7
GeneScan
  • A general probabilistic model for the gene
    structure of human genomic sequences.
  • Gene identification by identifying complete
    exon/intron structures of genes in genomic DNA.
  • Include de capacity to predict multiple genes in
    a sequence, to deal with partial as well as
    complete genes, and to predict consistent sets of
    genes occurring on either or both DNA strands.
  • Markov Model of coding regions predictions do
    not depend on presence of a similar gene in the
    protein sequence databases and complement the
    information provided by homology-based gene
    identification methods (BLASTX).
  • Maximal Dependence Decomposition (MDD) new
    statistical model of donor and acceptor splice
    sites which capture important dependencies
    between signal positions.

8
Pre-mRNA Splicing
exon definition
intron definition
...
(assembly of
spliceosome,
catalysis)
...
9
(No Transcript)
10
Hidden semi-Markov Model (HMM)
11
GenScan HMM
  • N - intergenic region
  • P - promoter
  • F - 5 untranslated region
  • Esngl single exon (intronless) (translation
    start -gt stop codon)
  • Einit initial exon (translation start -gt donor
    splice site)
  • Ek phase k internal exon (acceptor splice site
    -gt donor splice site)
  • Eterm terminal exon (acceptor splice site -gt
    stop codon)
  • Ik phase k intron 0 between codons 1
    after the first base of a codon 2 after the
    second base of a codon

12
GenScan Features
  • Model both strands at once
  • Each state may output a string of symbols
    (according to some probability distribution).
  • Explicit intron/exon length modeling
  • Advanced splice site modeling
  • Parameters learned from annotated genes
  • Prediction of multiple genes in a sequence
    (partial or complete).

13
GenomeScan
  • We can enhance our gene prediction by using
    external information DNA regions with homology
    to known proteins are more likely to be coding
    exons.
  • Combine probabilistic extrinsic information
    (BLAST hits) with a probabilistic model of gene
    structure/composition (GenScan).
  • Focus on typical case when homologous but not
    identical proteins are available.

14
Ab Initio modified by BLAST homologies
15
Ab Initio modified by BLAST homologies
16
GeneWise
  • Motivation Use good DB of protein world (PFAM)
    to help us annotate genomic DNA
  • GeneWise algorithm aligns a profile HMM directly
    to the DNA

17
GeneWise
  • Start with a PFAM domain HMM
  • Replace AA emissions with codon emissions
  • Allow for sequencing errors (deletions/
    insertions)
  • Add a 3-state intron model

18
GeneWise Model
19
GeneWise Intron Model
5 site
3 site
20
GeneWise Features Problems
  • Best alignment of DNA to protein domain
  • Alignment gives exact exon-intron boundaries
  • Parameters learned from species-specific
    statistics
  • Only provides partial prediction, and only where
    the homology lies
  • Does not find more genes
  • Pseudogenes, Retrotransposons picked up
  • CPU intensive
  • Solution Pre-filter with BLAST

21
Rosetta
  • Gene prediction is separated from sequence
    alignment.
  • First, the alignment is obtained between two
    homologous genomic sequences using sequence
    global alignment Glass. Then, gene structures
    (splice sites, exon number and length, etc.) are
    predicted that are compatible with this
    alignment, meaning that predicted exons fall in
    the aligned regions.

22
Syntenic Gene Prediction
  • This approach does not require the comparison of
    two homologous genomic sequences.
  • A query sequence from a target genome is
    compared against a collection of sequence from a
    second (informant, reference) genome and the
    results of the comparison are used to modify the
    scores of the exons produced by underlying ab
    initio'' gene prediction algorithms.
  • Gene prediction and sequence alignment are
    separated.

23
SGP-2
24
Gene predicition programs predict a large number
of genes
  • almost every mouse gene has
  • the human orthologue counterpart

TwinScan SGP
48462 total 47055
17562 novel 21942
10987 3171 multiexonic long no low complexity 12158 4543
954 human ts 2983 human sgp
317 637 1931 1052
intron aligned human ts human sgp intron aligned



25
Orthologous human mouse genes have conserved
exonic structure.
  • 85 of the orhologous pairs have identical
    number of exons
  • 91 of the orthologous exons have identical
    length
  • 99.5 of the orthologous exons have identical
    phase
  • there are a few cases of intron
    insertion/deletion (22)

26
Summary
  • Genes are complex structures which are difficult
    to predict with the required level of accuracy/
    confidence
  • Different approaches to gene finding improve
    accuracy/confidence of the predictions
  • Ab Initio GenScan
  • Ab Initio modified by BLAST homologies
    GenomeScan
  • Homology guided GeneWise
  • Gene prediction and sequence alignment
    separately Rosseta
  • Ab initio with similarity in known proteins SGP-2

27
Merci pour votre attention!
Write a Comment
User Comments (0)
About PowerShow.com