Title: RNA Secondary Structure Prediction
1RNA Secondary Structure Prediction
- Lecture 11 June 13, 2006
- Algorithms of Molecular Biology
2Introduction to RNA Sequence/Structure Analysis
- RNAs have many structural and functional uses
- Translation
- Transcription
- RNA splicing
- RNA processing and editing
- cellular localization
- catalysis
3RNA functions
- RNA functions as
- mRNA
- rRNA
- tRNA
- In nuclear export
- Part of spliceosome (snRNA)
- Regulatory molecules (RNAi)
- Enzymes
- Viral genomes
- Retrotransposons
- Medicine
4Biological Functions of Nucleic Acids
- tRNA (transfer RNA, adaptor in translation)
- rRNA (ribosomal RNA, component of ribosome)
- snRNA (small nuclear RNA, component of
splicesome) - snoRNA (small nucleolar RNA, takes part in
processing of rRNA) - RNase P (ribozyme, processes tRNA)
- SRP RNA (RNA component of signal recognition
particle) - ..
5RNA Sequence Analysis
- RNA sequence analysis different from DNA sequence
analysis - RNA structures fold and base pair to form
secondary structures - not necessarily the sequence but structure
conservation is most important with RNA
6More Secondary Structures
Secondary Structures of Nucleic Acids
Pseudoknots
- DNA is primarily in duplex form.
- RNA is normally single stranded which can have a
diverse form of secondary structures other than
duplex.
Source Cornelis W. A. Pleij in Gesteland, R. F.
and Atkins, J. F. (1993) THE RNA WORLD. Cold
Spring Harbor Laboratory Press.
rRNA Secondary Structure Based on Phylogenetic
Data
73D Structures of RNACatalytic RNA
- Some structural rules
- Base pairing is stabilizing
- Unpaired sections (loops) destabilize
- 3D conformation with interactions makes up for
this
Tertiary Structure Of Self-splicing RNA
Secondary Structure Of Self-splicing RNA
8RNA secondary structure
- E. coli Rnase P RNA secondary structure
Image source www.mbio.ncsu.edu/JWB/MB409/lecture/
lecture05/lecture05.htm
9tRNA structure
10RNA Variations
- Variations in RNA sequence maintain base-pairing
patterns for secondary structures - when a nucleotide in one base changes, the base
it pairs to must also change to maintain the same
structure - Such variation is referred to as covariation.
11Covariance
- secondary structure prediction in RNA takes into
account conserved patterns of base-pairing - Positions of covariance are conserved matches,
since they maintain the secondary structure - computationally challenging
12Features of RNA
- RNA polymer composed of a combination of four
nucleotides - adenine (A)
- cytosine (C)
- guanine (G)
- uracil (U)
13Features of RNA
- G-C and A-U form complementary hydrogen bonded
base pairs (canonical Watson-Crick) - G-C base pairs being more stable (3 hydrogen
bonds) A-U base pairs less stable (2 bonds) - non-canonical pairs can occur in RNA -- most
common is G-U
14Features of RNA
- RNA typically produced as a single stranded
molecule (unlike DNA) - Strand folds upon itself to form base pairs
- secondary structure of the RNA
15Features of RNA
- intermediary between a linear molecule and a
three-dimensional structure - Secondary structure mainly composed of
double-stranded RNA regions formed by folding the
single-stranded RNA molecule back on itself
16Stem Loops (Hairpins)
- Loops generally at least 4 bases long
17Bulge Loops
- occur when bases on one side of the structure
cannot form base pairs
18Interior Loops
- occur when bases on both sides of the structure
cannot form base pairs
19Junctions (Multiloops)
- two or more double-stranded regions converge to
form a closed structure
20Tertiary Interactions
- tertiary interactions can be present as well
- located using covariance analysis
21Kissing Hairpins
- unpaired bases of two separate hairpin loops base
pair with one another
22Pseudoknots
23Hairpin-Bulge Interactions
24RNA structure prediction methods
- Dot Plot Analysis
- Base-Pair Maximization
- Free Energy Methods
- Covariance Models
25How RNA Prediction Methods Were Developed
- Mount p. 334
- Since Tinoco et al. measured energy associated
with regions of ss a few energy based algorithms
were developed - Nussinov and Jacobson (1980), Zuker and Stiegler
(1981), Trifonov and Bolshoi (1983) .
26Main approaches to RNA secondary structure
prediction
- Energy minimization
- dynamic programming approach
- does not require prior sequence alignment
- require estimation of energy terms contributing
to secondary structure - Comparative sequence analysis
- Using sequence alignment to find conserved
residues and covariant base pairs. - most trusted
27Circular Representation
- base pairs of a secondary structure represented
by a circle - arc drawn for each base pairing in the structure
- If any arcs cross, a pseudoknot is present
28Circular Representation
- Image source http//www.finchcms.edu/cms/biochem/
Walters/rna_folding.html
29(No Transcript)
30Circular Representation
31Base-Pair Maximization
- Find structure with the most base pairs
- Efficient dynamic programming approach to this
problem introduced by Ruth Nussinov (Tel-Aviv,
1970s). - Tutorial in the classroom let us try to
reconstruct Nussinovs algorithm -
32Nussinov Algorithm
- Four ways to get the best structure between
position i and j from the best structures of the
smaller subsequences - 1) Add i,j pair onto best structure found
for subsequence i1, j-1 - 2) add unpaired position i onto best
structure for subsequence i1, j - 3) add unpaired position j onto best
structure for subsequence i, j-1 - 4) combine two optimal structures i,k and
k1, j
33Nussinov Algorithm
34Nussinov Algorithm
- compares a sequence against itself in a dynamic
programming matrix -
- Four rules for scoring the structure at a
particular point - Since structure folds upon itself, only necessary
to calculate half the matrix
35Nussinov Algorithm
- Initialization score for matches along main
diagonal and diagonal just below it are set to
zero - Formally, the scoring matrix, M, is initialized
- Mii 0 for i 1 to L (L is sequence
length) - Mii-1 0 for i 2 to L
36Nussinov Algorithm
- Using the sequence GGGAAAUCC, the matrix now
looks like the following, such that sequences of
length 1 will score 0
37Nussinov Algorithm
- Matrix Fill
-
- Mij max of the following
- Mi1j (ith residue is hanging off by itself)
- Mij-1 (jth residue is hanging off by itself)
- Mi1j-1 S(xi, xj) (ith and jth residue are
paired if xi complement of xj, then S(xi, xj)
1 otherwise it is 0.) - Mij MAXiltkltj (Mik Mk1j) (merging
two substructures)
38Nussinov Algorithm
- The final filled matrix is as follows
39Nussinov Algorithm
- Traceback (P 271, Durbin et al) leads to the
following structure
40SCFG Version
- Nussinov algorithm can be converted to a
stochastic context-free grammar -
- S ? aS cS gS uS
- S ? Sa Sc Sg Su
- S ? aSu cSg uSa gSc
- S ? SS
41Nussinov Algorithm
- Web Interface
- http//ludwig-sun2.unil.ch/bsondere/nussinov/
42Nussinov Results
43Evaluation of Maximizing Basepairs
- Simplistic approach
- Does not give accurate structure predictions
- nearest neighbor interactions
- stacking interactions
- loop length preferences
44Free Energy Minimization RNA Structure
Prediction
- All possible choices of complementary sequences
are considered - Set(s) providing the most energetically stable
molecules are chosen - When RNA is folded, some bases are paired with
other while others remain free, forming loops
in the molecule. - Speaking qualitatively, bases that are bonded
tend to stabilize the RNA (i.e., have negative
free energy), whereas unpaired bases form
destabilizing loops (positive free energy). - Through thermodynamics experiments, it has been
possible to estimate the free energy of some of
the common types of loops that arise. - Because the secondary structure is related to the
function of the RNA, we would like to be able to
predict the secondary structure. - Given an RNA sequence, the RNA Folding Problem is
to predict the secondary structure that minimizes
the total free energy of the folded RNA molecule.
45Prediction of Minimum-Energy RNA Structure is
Limited
- In predicting minimum energy RNA secondary
structure, several simplifying assumptions are
made. - The most likely structure is identical to the
energetically preferable structure - Nearest-neighbor energy calculations give
reliable estimates of an experimentally
achievable energy measurements - Usually we can neglect pseudoknots
46Assumptions in secondary Structure Prediction
- most likely structure similar to energetically
most stable structure - Energy associated with any position is only
influenced by local sequence and structure - Structure formed does not produce pseudoknots
47Inferring Structure By Comparative Sequence
Analysis
- most reliable computational method for
determining RNA secondary structure - consider the example from Durbin, et al., p 266
- See an additional lecture of David Mathews
48Predicting Structure From a Single Sequence
- RNA molecule only 200 bases long has 1050
possible secondary structures - Find self-complementary regions in an RNA
sequence using a dot-plot of the sequence against
its complement - repeat regions can potentially base pair to form
secondary structures - advanced dot-plot techniques incorporate free
energy measures
49Dot Plot
- Image Source http//www.finchcms.edu/cms/biochem/
Walters/rna_folding.html
50Energy Minimization Methods
- RNA folding is determined by biophysical
properties - Energy minimization algorithm predicts the
correct secondary structure by minimizing the
free energy (?G) - ?G calculated as sum of individual contributions
of - loops
- base pairs
- secondary structure elements
- Energies of stems calculated as stacking
contributions between neighboring base pairs
51Energy Minimization Methods
- Free-energy values (kcal/mole at 37oC ) are as
follows
52Energy Minimization Methods
- Free-energy values (kcal/mole at 37oC ) are as
follows
53Energy Minimization Methods
- Given the energy tables, and a folding, the free
energy can be calculated for a structure
54Calculating Best Structure
- sequence is compared against itself using a
dynamic programming approach - similar to the maximum base-paired structure
- instead of using a scoring scheme, the score is
based upon the free energy values - Gaps represent some form of a loop
- The most widely used software that incorporates
this minimum free energy algorithm is MFOLD.
55Free Energy Minimization RNA Structure
Prediction
- http//www.bioinfo.rpi.edu/zukerm/Bio-5495/RNAfol
d-html/
56Calculating Best Structure
- most widely used software incorporating minimum
free energy algorithm is MFOLD - http//www.bioinfo.rpi.edu/applications/mfold/
- http//www.bioinfo.rpi.edu/applications/mfold/old/
rna/
57Example Sequence
- GCTTACGACCATATCACGTTGAATGCACGC
- CATCCCGTCCGATCTGGCAAGTTAAGCAAC
- GTTGAGTCCAGTTAGTACTTGGATCGGAGA
- CGGCCTGGGAATCCTGGATGTTGTAAGCT
58MFOLD Energy Dot Plot
59Optimal Structure
60Suboptimal Folds
- The correct structure is not necessarily
structure with optimal free energy - within a certain threshold of the calculated
minimum energy - MFOLD updated to report suboptimal folds
61Comparison of Methods
62Inferring Structure By Comparative Sequence
Analysis
-
- first step is to calculate a multiple sequence
alignment - Requires sequences be similar enough so that they
can be initially aligned - Sequences should be dissimilar enough for
covarying substitutions to be detected -
63Mutual Information
- fxi frequency of a base in column i
- fxixj joint (pairwise) frequency of a base
pair between columns i and j - Information ranges from 0 and 2 bits
- If i and j are uncorrelated, mutual information
is 0
64Mutual Information Plot
65Mutual Information Plot
66Frameshifting
- Virology. 2005 Feb 20332(2)498-510
- Programmed ribosomal frameshifting in decoding
the SARS-CoV genome. - Baranov PV, Henderson CM, Anderson CB, Gesteland
RF, Atkins JF, Howard MT.Department of Human
Genetics, University of Utah, 15 N 2030 E, Room
7410, Salt Lake City, UT 84112-5330,
USA.Programmed ribosomal frameshifting is an
essential mechanism used for the expression of
orf1b in coronaviruses. Comparative analysis of
the frameshift region reveals a universal shift
site U_UUA_AAC, followed by a predicted
downstream RNA structure in the form of either a
pseudoknot or kissing stem loops. Frameshifting
in SARS-CoV has been characterized in cultured
mammalian cells using a dual luciferase reporter
system and mass spectrometry. Mutagenic analysis
of the SARS-CoV shift site and mass spectrometry
of an affinity tagged frameshift product
confirmed tandem tRNA slippage on the sequence
U_UUA_AAC. Analysis of the downstream pseudoknot
stimulator of frameshifting in SARS-CoV shows
that a proposed RNA secondary structure in loop
II and two unpaired nucleotides at the stem
I-stem II junction in SARS-CoV are important for
frameshift stimulation. These results demonstrate
key sequences required for efficient
frameshifting, and the utility of mass
spectrometry to study ribosomal frameshifting.
67Frameshifting
- RNA-struct-frameshift.pdf
- frameshifts.pdf
- hepatitisC-frameshift.pdf
68Covariance Models
- 7 approaches to locate covarying sites offered in
Mount, p225 - key to covariance is mutual information content
- mutual information content can be plotted on a
motif logo
69Mutual Information
- Image source http//www.cbs.dtu.dk/gorodkin/appl
/slogo.html
70Covariance Models
- A formal covariance model, COVE, devised by Eddy
and Durbin - Provides very accurate results
- extremely slow and unsuitable for searching large
genomes
71SCFGs
- Stochastic Context Free Grammars (SCFGs) have
also been used to model RNA secondary structure - Examples
- tRNAScan-SE
- program created to find snoRNAs
- Grammars are created by using a training set of
data, and then the grammars are applied to
potential sequences to see if they fit into the
language
72SCFGs
- SCFGs allow the detection of sequences belonging
to a family - tRNAs
- group I introns
- snoRNAs
- snRNAs
73SCFGs
- base-paired columns modeled by pairwise emitting
non terminals - aWu aWa aWc aWg ...
- single-stranded columns modeled by leftwise
emitting nonterminals (when possible) - aW cW gW uW ..., when possible
74SCFGs
- Any RNA structure can be reduced to a SCFG (see
Durbin, et al., p 278-279)
75Transformational Grammars
- First described by linguist Noam Chomsky in the
1950s. - (Yes, the same Noam Chomsky who has expressed
various dissident political views throughout the
years!)
76Transformational Grammars
- Very important in computer science, most notably
in compiler design - Covered in detail in compiler and automaton
classes
77Transformational Grammars
- Idea take a set of outputs (sentence, RNA
structure) and determine if it can be produced
using a set of rules -
- consist of a set of symbols and production rules
- The symbols can terminal (emitting) symbols or
non-terminal symbols
78Grammar for Palindromes
- Consider palindromic DNA sequences
- Five possible terminal symbols A, C, G, T, ?)
(? represents the blank terminal symbol)
79Grammar for Palindromes
- Production Rules, where S and W are non-terminal
symbols -
- S?W
- W? aWa cWc gWg tWt
- W? a c g t ?
80Derivation of Sequences
- Using these production rules, a derivation of the
palindromic sequence acttgttca follows - S ? W ? aWa ? acWca?actWtca ? acttWttca ?
acttgttca
81Parse Trees
- A context-free grammar can be aligned to a
sequence using a parse tree - Root of the tree is the non-terminal start
symbol, S - Leaves are terminal symbols
- Internal nodes are the nonterminals
- Leaves can be parsed from left to right to view
the results of production
82Parse Tree
83RNA Structure SCFG
- S?W
- W? WW (bifurcation)
- W? aWu cWg gWc uWa (stems)
- W? gWu uWg
- W? aW cW gW uW (bulges)
- W? Wa Wc Wg Wu (bulges)
- W? a c g t ?
84Example of SCFG
- structure for the RNA structure for the
sequence produced by MFOLD, can be constructed
(5 to 3) - GCUUACGACCAUAUCACGUUGAAUGCACGCCAUCCCGUCCGAUCUGGCAA
GUUAAGCAACGUUGAGUCCAGUUAGUACUUGGAUCGGAGACGGCCUGGGA
AUCCUGGAUGUUGUAAGCU
85Example Construction
- S?
- W?
- Wu?
- gWcu?
- gcWgcu?
- gcuWagcu?
- gcuuWaagcu?
- gcuuaWuaagcu?
- gcuuacWguaagcu?
- gcuuacgWuguaagcu?
- gcuuacgaWuuguaagcu?
- gcuuacgacWguuguaagcu?
- gcuuacgaccWguuguaagcu?
- gcuuacgaccaWguuguaagcu?....
86Other Programs
- RNA Movies
- http//bibiserv.techfak.uni-bielefeld.de/rnamovies
/ - (Visualization of RNA secondary structure)
- RNA LOGOS
- http//www.cbs.dtu.dk/gorodkin/appl/slogo.html