Title: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES
1IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF
PSEUDOGENES
- L. Coin and R. Durbin
- Wellcome Trust Sanger Institute
BIOINFORMATICS 2004 Presented by Oscar Sanchez
Plazas
2Outline
- Problem definition
- Previous works on pseudogene identification
- Proposed method
- Protein domain profile (Pfam)
- Algorithm
- Results and Discussion
3Pseudogene Identification
- Pseudogene Remnants of genomic sequences of
genes that are no longer translated into
functional proteins. - Non-processed (duplicated)
- Product of genome duplication (paralogous)
- loss of function at the transcription or
translation level - Processed (70)
- Product of retro-transposition
- No introns, no promoter
-
() Plagiarized Errors and Molecular Genetics.
Edward E. Max, M.D.
4- () http//www.pseudogene.org/definition.html
5Pseudogenes
- Significance
- Comparative Genomic
- Evolution of DNA, new gene expression, patterns
- Study of mechanisms for regulation of gene
expression - Verification of gene sequences in databases
6Pseudogenes
- Are they functional? (why high conservation
compared to prokaryotes?) - Pseudogenes exhibit evolutionary conservation of
gene sequence, reduced nucleotide variability,
excess synonymous over nonsynonymous nucleotide
polymorphism, and other features that are
expected in genes or DNA sequences that have
functional roles1 - (1) PSEUDOGENES Are They Junk or Functional
DNA? Evgeniy S. Balakirev, Francisco Ayala. 2003 - - An expressed pseudogene regulates the messenger
stability of its homologous coding gene. Nature,
Hirotsune,S. et al. 2003 - - The putatively functional Mkrn1-p1 pseudogene
is neither expressed nor imprinted, nor does it
regulate its source gene in trans. Gray TA,
Wilson A, Fortin PJ, Nicholls RD. PNAS. 2006
() www.answersingenesis.org/tj/v17/i2/pseudogene.
asp
7Problem
- Sometimes pseudogenes are mis-annotated in gene
sequence databases as functional genes. - Key Insight
- Employ a evolutionary constraint model derived
from a functional characterization over the gene
product. - Constrained vs. neutral model
8Previous approaches
- Presence of stop codon and frameshift.
- Not very sensitive (50 are detectable )
() Large-scale analysis of pseudogenes in the
human genome Zhao Lei Zhang, Mark Gerstein
9Previous approaches
- Ratio of synonymous and non-synonymous
substitutions (dN/dS) - Not very accurate e.g. gene under positive
selection pressure.
() Genome-wide survey of human pseudogenes.
Torrents,D., Suyama,M., Zdobnov,E. and Bork,P.
10Model Proposed
- PSILC Pseudogene inference from loss of
constraint (log-odd score) - Protein Domain evolution (functional constrain) -
Null probability model (Pfam) - Neutral nucleotide model
- Protein coding model
11Domain Profile - HMM
- Protein Domains structural, functional and
evolutionary units of proteins - HMM profiles the most sensitive models for
domains - Every state has a particular emission
distribution over A,C,T,G - () genome.nasa.gov/MediaLib/hmm_project_fig2.jpg
deletion
insertion
match
12() http//pfam.sanger.ac.uk//family/TAF
13Model Proposed
- Objective
- Look at pattern of substitution in conserved
protein domains - Algorithm
- Input
- Alignment A
- Unrooted tree T
- Profile HMM D (aligned with A)
- Output
- Score for a leaf of the tree which represents the
belief that the node corresponds to a pseudogene.
14Algorithm
- Notation
- Xn. row corresponding to leaf-node n.
- X.i i-th column.
- A\Xn. Alignment A excluding Xn.
- mj j-th match column of profile HMM.
- pn parent node of n.
- bn branch from pn to n.
- T\bn Tree T excluding bn.
15Algorithm
- Input Unrooted tree T, Alignment A, profile HMM
D - Output Log-odds scores
- A neutral nucleotide model compared to a Pfam
domain encoding model (PSILC-nuc/dom) - A protein coding model compared to a Pfam domain
encoding model (PSILC-prot/dom).
Evolutionary model
16Algorithm
- Independence assumptions
- xni respect to other columns in the row given
A\xn - xni respect to other columns in A\xn given
x.i\xni - Tree assumption xni respect to x.i\xn given xpni
17Algorithm
- Steps
- Calculate the distribution at xpni given the
evolutionary constraints on the other branches. - For each residue/base at xpni, calculate the
transition probability to xni given the
evolutionary constraints.
- pn is set as the root of the T
- Prior distribution Stationary dist. of Q
18Evolutionary Model
- Instantaneous rate matrix (Q)
- DNA models HKY model (? - uniform)
- Amino acid model database estimates (WAG, ?)
- ? - steady state distribution (vs. equilibrium)
- Alternative models ? observed in A
- Null model distribution of the state in the HMM
- Parameters (ML)
- f trade off mutation pressure (from-to)
- r evolutionary rate
- ? ratio transition/transversion
() A Novel Use of Equilibrium Frequencies in
Models of Sequence Evolution Nick Goldman and
Simon Whelan
19Algorithm
- Directionality of the calculation
- Score on an alignment of two transcripts x1, x2
is not symmetric (detailed balance). - If base x1i is more likely than x2i at a
particular match state but equally likely under
the protein model, score for x2. being a
pseudogene is higher than score for x1. - dN/dS does not have this property (a third
sequence should be used). - Requires a PFam model (independent)
20Results
- Data Cromosome 6 human genome
- Manually annotated (pseudo)genes
- Blast search-ENSEMBL elt10-7 (gt80) (lt99)
- Multiple alignment ClustalW
- Max. likelihood distance.
- Nearest neighbor tree.
- 598 (875) coding transcripts, 97 (158)
pseudogenes
21Results
Why PSILC-prot/dom is better than PSILC-nuc/dom?
22Results
23Question
- What is the main difference between the HMMs
previously studied (eg. Pairwise alignment) and
the HMM profiles? Why the latter HMMs are
important for the identification of pseudogenes?