IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES - PowerPoint PPT Presentation

About This Presentation
Title:

IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Description:

IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES L. Coin and R. Durbin Wellcome Trust Sanger Institute BIOINFORMATICS 2004 Presented by: Oscar Sanchez Plazas – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 24
Provided by: AMD66
Category:

less

Transcript and Presenter's Notes

Title: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES


1
IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF
PSEUDOGENES
  • L. Coin and R. Durbin
  • Wellcome Trust Sanger Institute

BIOINFORMATICS 2004 Presented by Oscar Sanchez
Plazas
2
Outline
  • Problem definition
  • Previous works on pseudogene identification
  • Proposed method
  • Protein domain profile (Pfam)
  • Algorithm
  • Results and Discussion

3
Pseudogene Identification
  • Pseudogene Remnants of genomic sequences of
    genes that are no longer translated into
    functional proteins.
  • Non-processed (duplicated)
  • Product of genome duplication (paralogous)
  • loss of function at the transcription or
    translation level
  • Processed (70)
  • Product of retro-transposition
  • No introns, no promoter

() Plagiarized Errors and Molecular Genetics.
Edward E. Max, M.D.
4
  • () http//www.pseudogene.org/definition.html

5
Pseudogenes
  • Significance
  • Comparative Genomic
  • Evolution of DNA, new gene expression, patterns
  • Study of mechanisms for regulation of gene
    expression
  • Verification of gene sequences in databases

6
Pseudogenes
  • Are they functional? (why high conservation
    compared to prokaryotes?)
  • Pseudogenes exhibit evolutionary conservation of
    gene sequence, reduced nucleotide variability,
    excess synonymous over nonsynonymous nucleotide
    polymorphism, and other features that are
    expected in genes or DNA sequences that have
    functional roles1
  • (1) PSEUDOGENES Are They Junk or Functional
    DNA? Evgeniy S. Balakirev, Francisco Ayala. 2003
  • - An expressed pseudogene regulates the messenger
    stability of its homologous coding gene. Nature,
    Hirotsune,S. et al. 2003
  • - The putatively functional Mkrn1-p1 pseudogene
    is neither expressed nor imprinted, nor does it
    regulate its source gene in trans. Gray TA,
    Wilson A, Fortin PJ, Nicholls RD. PNAS. 2006

() www.answersingenesis.org/tj/v17/i2/pseudogene.
asp
7
Problem
  • Sometimes pseudogenes are mis-annotated in gene
    sequence databases as functional genes.
  • Key Insight
  • Employ a evolutionary constraint model derived
    from a functional characterization over the gene
    product.
  • Constrained vs. neutral model

8
Previous approaches
  • Presence of stop codon and frameshift.
  • Not very sensitive (50 are detectable )

() Large-scale analysis of pseudogenes in the
human genome Zhao Lei Zhang, Mark Gerstein
9
Previous approaches
  • Ratio of synonymous and non-synonymous
    substitutions (dN/dS)
  • Not very accurate e.g. gene under positive
    selection pressure.

() Genome-wide survey of human pseudogenes.
Torrents,D., Suyama,M., Zdobnov,E. and Bork,P.
10
Model Proposed
  • PSILC Pseudogene inference from loss of
    constraint (log-odd score)
  • Protein Domain evolution (functional constrain) -
    Null probability model (Pfam)
  • Neutral nucleotide model
  • Protein coding model

11
Domain Profile - HMM
  • Protein Domains structural, functional and
    evolutionary units of proteins
  • HMM profiles the most sensitive models for
    domains
  • Every state has a particular emission
    distribution over A,C,T,G
  • () genome.nasa.gov/MediaLib/hmm_project_fig2.jpg

deletion
insertion
match
12
() http//pfam.sanger.ac.uk//family/TAF
13
Model Proposed
  • Objective
  • Look at pattern of substitution in conserved
    protein domains
  • Algorithm
  • Input
  • Alignment A
  • Unrooted tree T
  • Profile HMM D (aligned with A)
  • Output
  • Score for a leaf of the tree which represents the
    belief that the node corresponds to a pseudogene.

14
Algorithm
  • Notation
  • Xn. row corresponding to leaf-node n.
  • X.i i-th column.
  • A\Xn. Alignment A excluding Xn.
  • mj j-th match column of profile HMM.
  • pn parent node of n.
  • bn branch from pn to n.
  • T\bn Tree T excluding bn.

15
Algorithm
  • Input Unrooted tree T, Alignment A, profile HMM
    D
  • Output Log-odds scores
  • A neutral nucleotide model compared to a Pfam
    domain encoding model (PSILC-nuc/dom)
  • A protein coding model compared to a Pfam domain
    encoding model (PSILC-prot/dom).

Evolutionary model
16
Algorithm
  • Independence assumptions
  • xni respect to other columns in the row given
    A\xn
  • xni respect to other columns in A\xn given
    x.i\xni
  • Tree assumption xni respect to x.i\xn given xpni

17
Algorithm
  • Steps
  • Calculate the distribution at xpni given the
    evolutionary constraints on the other branches.
  • For each residue/base at xpni, calculate the
    transition probability to xni given the
    evolutionary constraints.
  • pn is set as the root of the T
  • Prior distribution Stationary dist. of Q

18
Evolutionary Model
  • Instantaneous rate matrix (Q)
  • DNA models HKY model (? - uniform)
  • Amino acid model database estimates (WAG, ?)
  • ? - steady state distribution (vs. equilibrium)
  • Alternative models ? observed in A
  • Null model distribution of the state in the HMM
  • Parameters (ML)
  • f trade off mutation pressure (from-to)
  • r evolutionary rate
  • ? ratio transition/transversion

() A Novel Use of Equilibrium Frequencies in
Models of Sequence Evolution Nick Goldman and
Simon Whelan
19
Algorithm
  • Directionality of the calculation
  • Score on an alignment of two transcripts x1, x2
    is not symmetric (detailed balance).
  • If base x1i is more likely than x2i at a
    particular match state but equally likely under
    the protein model, score for x2. being a
    pseudogene is higher than score for x1.
  • dN/dS does not have this property (a third
    sequence should be used).
  • Requires a PFam model (independent)

20
Results
  • Data Cromosome 6 human genome
  • Manually annotated (pseudo)genes
  • Blast search-ENSEMBL elt10-7 (gt80) (lt99)
  • Multiple alignment ClustalW
  • Max. likelihood distance.
  • Nearest neighbor tree.
  • 598 (875) coding transcripts, 97 (158)
    pseudogenes

21
Results
  • ROC

Why PSILC-prot/dom is better than PSILC-nuc/dom?
22
Results
  • Better discrimination

23
Question
  • What is the main difference between the HMMs
    previously studied (eg. Pairwise alignment) and
    the HMM profiles? Why the latter HMMs are
    important for the identification of pseudogenes?
Write a Comment
User Comments (0)
About PowerShow.com