Combining RNA and Protein selection models - PowerPoint PPT Presentation

About This Presentation
Title:

Combining RNA and Protein selection models

Description:

Combining RNA and Protein selection models. The Central Idea in Comparative ... (July1999) 'Structural Constraints on RNA Virus Evolution' J.of Virology 5787-94 ... – PowerPoint PPT presentation

Number of Views:519
Avg rating:3.0/5.0
Slides: 30
Provided by: hein
Category:

less

Transcript and Presenter's Notes

Title: Combining RNA and Protein selection models


1
Combining RNA and Protein selection models
The Central Idea in Comparative Molecular Biology
Genomics Three basic applications
Protein secondary structure RNA secondary
structure Gene structure Combining
Evolution Constraints Protein-Protein
RNA-Protein Combining Structure Descriptions
2
Modelling Sequence Evolution
a - unknown
Biological setup
Pi,j(t) continuous time markov chain on the state
space A,C,G,T.
t1
e
A
t2
C
C
3
Jukes-Cantor 69 Total Symmetry
Rate-matrix, R
T O
A C
G T F
A -3a a
a a R C
a -3a
a a O G a
a -3 a
a M T a
a a
-3 a Transition prob. after time t, a
at P(equal) ¼(1 3e-4a ) 1 - 3a
P(diff.) ¼(1 - 3e-4a )
3a Stationary Distribution (1,1,1,1)/4.
4
Comparison of Evolutionary Objects.
Observable
Unobservable
Goldman, Thorne Jones, 96
Knudsen Hein, 99 Eddy others
Pedersen Hein, 03 Haussler others
Multiple levels of selection Protein-protein RNA-p
rotein
Pedersen, Meyer, Forsberg, Hein,
Observable
Unobservable
5
Structure Description Grammars Finite Set of
Rules Generating Strings
finished no variables
Protein secondary structure Gene Structure
RNA secondary structure
6
Simple String Generators Terminals (capital)
--- Non-Terminals (small) i. Start with S
S --gt aT bS T
--gt aS bT ? One sentence odd of as S-gt
aT -gt aaS gt aabS -gt aabaT -gt aaba ii. ?S--gt
aSa bSb aa bb One sentence (even length
palindromes) S--gt aSa --gt abSba --gt abaaba
7
Stochastic Grammars
The grammars above classify all string as
belonging to the language or not.
All variables has a finite set of substitution
rules. Assigning probabilities to the use of
each rule will assign probabilities to the
strings in the language.
If there is a 1-1 derivation (creation) of a
string, the probability of a string can be
obtained as the product probability of the
applied rules.
i. Start with S. S --gt (0.3)aT (0.7)bS
T --gt (0.2)aS (0.4)bT (0.2)?
0.2
0.7
0.3
0.3
S -gt aT -gt aaS gt aabS -gt aabaT -gt aaba
0.2
ii. ?S--gt (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb
0.1
0.3
0.5
S -gt aSa -gt abSba -gt abaaba
8
Gene Describers
Simple Prokaryotic Genes
Simple Eukaryotic Genes
9
Secondary Structure Generators
S --gt LS L .869 .131 F
--gt dFd LS .788 .212 L --gt
s dFd .895 .105
Knudsen Hein, 99
10
Structure Dependent Evolution Models
  • Protein Secondary Structure Dependent (Goldman,
    Thorne Jones)
  • a, b Loop each has their own mutation
    rate matrix (20,20) , Ra,Rb Rloop

2. RNA Secondary Structure Dependent
i. R singlet, singlet (4,4)
ii. R doublet,doublet (16,16) (base pair
conserving relative to R singlet, singlet X R
singlet, singlet )
3. Gene Structure Dependent
i. Rnon-codingATG--gtGTG
ii. RcodingATG--gtGTG
iii-. Other structural categories, regulatory
signals ..
11
The Genetic Code
3 classes of sites 4 2-2 1-1-1-1
4 (3rd)
1-1-1-1 (3rd)
ii. T?A (2nd)
Problems i. Not all fit into those
categories. ii. Change in on site can change the
status of another.
12
Kimuras 2 parameter model Lis Model.
Probabilities
Rates
start
Selection on the 3 kinds of sites
(a,b)?(?,?) 1-1-1-1
(fa,fb) 2-2
(a,fb) 4 (a, b)
13
alpha-globin from rabbit and mouse.
Ser Thr Glu Met Cys Leu Met Gly Gly TCA ACT GAG
ATG TGT TTA ATG GGG GGA
TCG ACA GGG ATA TAT CTA ATG GGT ATA Ser
Thr Gly Ile Tyr Leu Met Gly Ile
  • Sites Total Conserved
    Transitions Transversions
  • 1-1-1-1 274 246 (.8978)
    12(.0438) 16(.0584)
  • 2-2 77 51 (.6623)
    21(.2727) 5(.0649)
  • 4 78 47 (.6026)
    16(.2051) 15(.1923)
  • Z(at,bt) .501exp(-2at) - 2exp(-t(ab)
    transition Y(at,bt) .251-exp(-2bt )
    (transversion)
  • X(at,bt) .251exp(-2at) 2exp(-t(ab)
    identity
  • L(observations,a,b,f)
  • C(429,274,77,78) X(af,bf)246Y(af,bf)12Z(a
    f,bf)16 X(a,bf)51Y(a,bf)21Z(a,bf)5X(a,
    b)47Y(a,b)16Z(a,b)15
  • where a at and b bt.
  • Estimated Parameters a 0.3003 b
    0.1871 2b 0.3742 (a 2b) 0.6745 f
    0.1663
  • Transitions
    Transversions
  • 1-1-1-1 af 0.0500 2bf 0.0622
  • 2-2 a 0.3004 2bf 0.0622
  • 4 a 0.3004 2b 0.3741

14
Three Questions
What is the probability of the data? What is the
most probable hidden configuration? What is the
probability of specific hidden state?
HMM/Stochastic Regular Grammar
W
Stochastic Context Free Grammars
WL
WR
j
L
i
1
i
j
15
Comparative Gene Finding Jakob Skou Pedersen
Hein, 2004
16
Knudsen Hein, 99
17
From Knudsen Hein (1999)
18
Knudsen and Hein, 2003
19
Why combine RNA Protein Models?
Short Term/Long Term Evolution Discrepancies Sepa
rating Selective Effects Analyzing one level
without interference from the other level
Predicting gene structure and RNA structure
better. Annotation of Viral Genomes
20
Combining Levels of Selection.
Assume multiplicativity fA,B fAfB
Protein-Protein Hein Støvlbæk, 1995
Codon Nucleotide Independence Heuristic Jensen
Pedersen, 2001 Contagious Dependence
Protein-RNA
Doublets
Singlet
Contagious Dependence
21
Overlapping Coding Regions Hein Stoevlbaek, 95
1st
1-1-1-1 2-2
4
2nd
(f1f2a, f1f2b) (f2a, f1f2b)
(f2a, f2b)
1-1-1-1 sites 2-2 4
(f1a, f1f2b) (f2a, f1f2b)
(a, f2b)
(f1a, f1b) (a, f1b)
(a, b)
pol
gag
Example Gag Pol from HIV
Gag
1-1-1-1 2-2
4
Pol
64 31
34
1-1-1-1 sites 2-2 4
40 7
0
27 2
0
MLE a.084 b .024 a2b.133
fgag.403 fpol.229
Ziheng Yang has an alternative model to this,
were sites are lumped into the same category if
they have the same configuration of positions and
reading frames.
22
HIV2 Analysis

Hasegawa, Kisino Yano Subsitution Model
Parameters at ßt pA
pC pG pT 0.350 0.105
0.361 0.181 0.236 0.222 0.015
0.005 0.004 0.003 0.003 Selection
Factors GAG 0.385 (s.d. 0.030) POL 0.220 (s.
d. 0.017) VIF 0.407 (s.d. 0.035) VPR 0.494 (
s.d. 0.044) TAT 1.229 (s.d.
0.104) REV 0.596 (s.d. 0.052) VPU 0.902 (s.d.
0.079) ENV 0.889 (s.d. 0.051) NEF 0.928 (s.
d. 0.073) Estimated Distance per Site 0.194
23
Evolution under double constraints
Codon Nucleotide Independence Heuristic
Singlet Ri,j f qi,j
Doublet R(i1,i2),(j1,j2) f1 f2 q
(i1,i2),(j1,j2)
24
Structure Prediction Hepatitis C Analysis
U
25
Evolution Models A hierarchy of hypotheses
Doublets
Singlet
26
Combined RNA Protein Structure
Gene Structure Fixed, RNA Structure Stochastic
Presently being implemented with viral analysis
in mind Both RNA Gene Structure Stochastic
Would imply Gene Finding as well.
Grammar for overlapping genes a new
phenomena Gene Structure Stochastic, RNA
Structure Fixed An untypical situation
A challenge for the future structure evolution.
27
Open Problems
Stacking Substitution Models
In principle a 44 times 44 matrix (65.536
entries!!) is need, but proper parametrisation
and symmetries is could reduce this substantially.
Other Sets of Constraints Regulatory Signals
Combining with Alignment
A
G
C
T
T
A
G
C
T
T
C
T
G
28
References.
Hein,J J.Stoevlbaek (1995) A
maximum-likelihood approach to analyzing
nonoverlapping and overlapping reading frames
J.Mol.Evol. 40.181-189. Jensen,JL Pedersen
(2001) Probabilistic models of DNA sequence
evolution with context dependent rates of
subsitution Adv. Appl.Prob. 32.499-517. Katz
and Burge (2003) Widespread Selection for Local
RNA Secondary Structure in Coding Regions of
Bacterial Genes. Genome Research.
13.2042-51 Kirby, AK, SV Muse W.Stephan (1995)
Maintenance of pre-mRNA secondary structure by
epistatic selection PNAS. 92.9047-51. Knudsen,
Hein 99 Predicting RNA Structure using
Stochastic Context Free Grammars and Molecular
Evolution Bioinformatics 15.6.446-454. Knudsen
and Hein (2003) Pfold RNA secondary structure
prediction using stochastic context-free
grammars. Nucleic Acid Research
31.13.3423-28. New Influenza gene article???
Meyer and Durbin (2002) Comparative Ab Initio
prediction of Gene Structure using pair HMMs
Bioinformatics 18.10.1309-18. Moulton, V., Zuker,
M. Steel, M., Penny, D. and Pointon, R. Metrics
on RNA Structures. J. Computational Biology, 7
(1) 277-292, (2000). Pedersen, AMK JL Jensen
(2001) A Dependent Rates Model and an
MCMC-Based Methodology for the Maximum-Likelihood
Analysis of Sequences with Overlapping Reading
Frames Mol.Biol.Evol. 18.5.763-76. Pedersen JS
J. Hein 2003 Gene finding with a Hidden
Markov Model of genome structure and evolution
Bioinformatics Pedersen, Forsberg, Meyer,
Simmonds and Hein (2003) An evolutionary model
for protein coding regions with RNA secondary
structure Manuscript in Preparation Pedersen,
Forsberg, Meyer, Simmonds and Hein (2003)
Structure Models Manuscript in
Preparation Schadt, E. K.Lange (2002) Codon
and Rate Variation Models in Molecular Phylogeny
Mol.Biol.Evol. 19.9.1534-49 Savill, NJ et al
(2001) RNA Sequence Evolution With Secondary
Structure Constraints Comparison of Substituin
Ratye Models Using Maximum-Likehood Methods
Genetics. 2001 Jan 157.399-4111 Simmonds, P. and
DB Smith (July1999) Structural Constraints on
RNA Virus Evolution J.of Virology
5787-94 Tillier ERM RA Collins (1998) High
Apparent Rate of Simultaneous Compensatory
Base-Pair Substitutions in Ribosomal RNA
Genetics 149.1993-2001. Yang, Z. et al. (1995)
Molecular Evolution of the Hepatitis B Virus
Genome J.Mol.Evol. 41.587-96
29
Acknowledgements
1. Comparative RNA Structure - Bjarne Knudsen 2.
Comparative Gene Structure - Jakob Skou
Pedersen 3. Integrating Levels of Selection
Structure Jakob Skou Pedersen,
Irmtraud Meyer, Roald Forsberg
Bjarne Knudsen
Roald Forsberg
Irmtraud Meyer
Jakob Skou Pedersen
Write a Comment
User Comments (0)
About PowerShow.com