IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES - PowerPoint PPT Presentation

About This Presentation

Title:

IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Description:

IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES L. Coin and R. Durbin Wellcome Trust Sanger Institute BIOINFORMATICS 2004 Presented by: Oscar Sanchez Plazas – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 24

Provided by: AMD66

Learn more at: https://courses.grainger.illinois.edu

Category:

more less

Transcript and Presenter's Notes

Title: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

1
IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF
PSEUDOGENES

L. Coin and R. Durbin
Wellcome Trust Sanger Institute

BIOINFORMATICS 2004 Presented by Oscar Sanchez
Plazas
2
Outline

Problem definition
Previous works on pseudogene identification
Proposed method
Protein domain profile (Pfam)
Algorithm
Results and Discussion

3
Pseudogene Identification

Pseudogene Remnants of genomic sequences of
genes that are no longer translated into
functional proteins.
Non-processed (duplicated)
Product of genome duplication (paralogous)
loss of function at the transcription or
translation level
Processed (70)
Product of retro-transposition
No introns, no promoter

() Plagiarized Errors and Molecular Genetics.
Edward E. Max, M.D.
4

() http//www.pseudogene.org/definition.html

5
Pseudogenes

Significance
Comparative Genomic
Evolution of DNA, new gene expression, patterns
Study of mechanisms for regulation of gene
expression
Verification of gene sequences in databases

6
Pseudogenes

Are they functional? (why high conservation
compared to prokaryotes?)
Pseudogenes exhibit evolutionary conservation of
gene sequence, reduced nucleotide variability,
excess synonymous over nonsynonymous nucleotide
polymorphism, and other features that are
expected in genes or DNA sequences that have
functional roles1
(1) PSEUDOGENES Are They Junk or Functional
DNA? Evgeniy S. Balakirev, Francisco Ayala. 2003
- An expressed pseudogene regulates the messenger
stability of its homologous coding gene. Nature,
Hirotsune,S. et al. 2003
- The putatively functional Mkrn1-p1 pseudogene
is neither expressed nor imprinted, nor does it
regulate its source gene in trans. Gray TA,
Wilson A, Fortin PJ, Nicholls RD. PNAS. 2006

() www.answersingenesis.org/tj/v17/i2/pseudogene.
asp
7
Problem

Sometimes pseudogenes are mis-annotated in gene
sequence databases as functional genes.
Key Insight
Employ a evolutionary constraint model derived
from a functional characterization over the gene
product.
Constrained vs. neutral model

8
Previous approaches

Presence of stop codon and frameshift.
Not very sensitive (50 are detectable )

() Large-scale analysis of pseudogenes in the
human genome Zhao Lei Zhang, Mark Gerstein
9
Previous approaches

Ratio of synonymous and non-synonymous
substitutions (dN/dS)
Not very accurate e.g. gene under positive
selection pressure.

() Genome-wide survey of human pseudogenes.
Torrents,D., Suyama,M., Zdobnov,E. and Bork,P.
10
Model Proposed

PSILC Pseudogene inference from loss of
constraint (log-odd score)
Protein Domain evolution (functional constrain) -
Null probability model (Pfam)
Neutral nucleotide model
Protein coding model

11
Domain Profile - HMM

Protein Domains structural, functional and
evolutionary units of proteins
HMM profiles the most sensitive models for
domains
Every state has a particular emission
distribution over A,C,T,G
() genome.nasa.gov/MediaLib/hmm_project_fig2.jpg

deletion
insertion
match
12
() http//pfam.sanger.ac.uk//family/TAF
13
Model Proposed

Objective
Look at pattern of substitution in conserved
protein domains
Algorithm
Input
Alignment A
Unrooted tree T
Profile HMM D (aligned with A)
Output
Score for a leaf of the tree which represents the
belief that the node corresponds to a pseudogene.

14
Algorithm

Notation
Xn. row corresponding to leaf-node n.
X.i i-th column.
A\Xn. Alignment A excluding Xn.
mj j-th match column of profile HMM.
pn parent node of n.
bn branch from pn to n.
T\bn Tree T excluding bn.

15
Algorithm

Input Unrooted tree T, Alignment A, profile HMM
D
Output Log-odds scores
A neutral nucleotide model compared to a Pfam
domain encoding model (PSILC-nuc/dom)
A protein coding model compared to a Pfam domain
encoding model (PSILC-prot/dom).

Evolutionary model
16
Algorithm

Independence assumptions
xni respect to other columns in the row given
A\xn
xni respect to other columns in A\xn given
x.i\xni
Tree assumption xni respect to x.i\xn given xpni

17
Algorithm

Steps
Calculate the distribution at xpni given the
evolutionary constraints on the other branches.
For each residue/base at xpni, calculate the
transition probability to xni given the
evolutionary constraints.

pn is set as the root of the T
Prior distribution Stationary dist. of Q

18
Evolutionary Model

Instantaneous rate matrix (Q)
DNA models HKY model (? - uniform)
Amino acid model database estimates (WAG, ?)
? - steady state distribution (vs. equilibrium)
Alternative models ? observed in A
Null model distribution of the state in the HMM
Parameters (ML)
f trade off mutation pressure (from-to)
r evolutionary rate
? ratio transition/transversion

() A Novel Use of Equilibrium Frequencies in
Models of Sequence Evolution Nick Goldman and
Simon Whelan
19
Algorithm

Directionality of the calculation
Score on an alignment of two transcripts x1, x2
is not symmetric (detailed balance).
If base x1i is more likely than x2i at a
particular match state but equally likely under
the protein model, score for x2. being a
pseudogene is higher than score for x1.
dN/dS does not have this property (a third
sequence should be used).
Requires a PFam model (independent)

20
Results

Data Cromosome 6 human genome
Manually annotated (pseudo)genes
Blast search-ENSEMBL elt10-7 (gt80) (lt99)
Multiple alignment ClustalW
Max. likelihood distance.
Nearest neighbor tree.
598 (875) coding transcripts, 97 (158)
pseudogenes

21
Results

Why PSILC-prot/dom is better than PSILC-nuc/dom?
22
Results

Better discrimination

23
Question

What is the main difference between the HMMs
previously studied (eg. Pairwise alignment) and
the HMM profiles? Why the latter HMMs are
important for the identification of pseudogenes?

Write a Comment

User Comments (0)