Title: Prediction of RNA
1Prediction of RNA secondary structure
- Features of RNA secondary structure,
presentations - Dot matrix analysis, Global Energy Minimization
Methods, Mfold, VIENNA RNA package - Dynamic Programming methods and Sequence
Covariation methods
Computational analysis of microRNAs
- Small RNA world
- Computational identification of miRNAs
- Prediction of miRNA targets
2Prediction of RNA secondary structure
3RNA follows the same basic rules of base-pairing
as DNA, but short single -stranded RNA molecules
can take a variety of 3D shapes (tRNA,
ribozymes, splicing etc)
- Information for self-assembly
- the genetic code specifying the order of AA
- control of the beginning and ends of coding
sequences - splicing signals
- determination of the stability and its relative
transcriptional - level
- regulation of gene expression
http//www.rnabase.org
4- What is RNA secondary structure?
- RNA secondary structure is similar to an
alignment of - protein and nucleic acid sequences, except that
the sequence - folds back on itself and complementary bases
pair rather - than identical or similar bases.
-
- Also, an alignment of 2 or more biosequences is
a statement - about an inferred evolutionary history. In
contrast, not - necessarily the sequence but structure
conservation is most - important with RNA
5- Main Points
- RNA structure is dynamic in solution, i.e.
constantly fluctuating between different folded
states - There are many alternative structures that are
nearly identical in energy (both predicted and
actual) - Highly sensitive to solution conditions, e.g.
salt and temperature - Highly sensitive to protein binding
- Tertiary structure (e.g. pseudoknots are
important) - Biologically important structure may not have
lowest predicted free energy, but it should be
one of the lower ones - must look at sub-optimal
structures - Three dimensional structure difficult to
determine due to flexibility of molecule - Most analysis of correctness must therefore
rely on phylogenetically determined models - Phylogenetic models look for invariant base
pairs, but may not identify all unique structures - Structural information also comes from
nuclease digestion studies and sometimes
crosslinking
6The complementary bases, C-G and A-U form stable
base pairs with each other through the creation
of hydrogen bonds between donor and acceptor
sites on the bases. These are called
Watson-Crick (W-C) base pairs. In addition, we
consider the weaker G-U wobble pair, where the
bases bond in a skewed fashion. All of these are
called canonical base pairs. Other base pairs
occur, some of which are stable. They are called
non-canonical base pairs.
Most common Biologically informative
Difficult to compare
7A computer predicted folding of Bacillus
subtilis RNase P RNA
A circular representation of the B. subtilis
folding.
The nucleotides are stretched out uniformly
along the circumference of a circle and the
base pairs are represented by circular arcs
that link paired bases and meet the circle at
right angles.
The triangular image in Figure is referred to as
an RNA structure dot plot Plot sequence vs.
reverse complement Possible stems run
perpendicular to axis of symmetry
8MOUNTAINS
Less common Is used in RNA literature Much
easier to see similarity than squiggles Good
for revealing pattern of nested stems
9(No Transcript)
10Single-stranded
Double-stranded
. . . . . . . . . . . . . . . . . . .
Stem and loop/hairpin loop
Bulge loop
. . . . . . . . .
. . . . . . . . .
Interior loop
. . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
Junctions or multi-loops
. . . . . . . . . .
Interactions Among Secondary Structures
Pseudoknots Kissing Hairpins
Hairpin-bulge contacts
11- RNA structure prediction methods
- self-complementary regions (Dot Plot Analysis)
- most energy stable molecules
- Base-Pair Maximization
- Free Energy Methods
- conserved patterns of base-pairing during
- evolution
-
- Covariance Models
- -
Energy minimization dynamic programming
approach does not require prior sequence
alignment require estimation of energy terms
contributing to secondary structure Comparative
sequence analysis Using sequence alignment to
find conserved residues and covariant base
pairs. most trusted
12Development of RNA prediction
- Tinoco et al. 1971 extrapolation from studies
on small - molecules
- Pipas and McMahon, 1975 computer programs
- estimating all possible structures in tRNAs
- Nussinov and Jacobson, 1980 precise and
efficient - algorithm for structure predictions (two scoring
matrices - approach)
- Zuker and Stiegler, 1981 dynamic programming
algorithm - Jaeger et al., 1989, 1990, Zuker, 1994 MFOLD
- Set of possible structures within a given energy
range - Indication of reliability
- Uses covariance information
- Wuchty et al., 1999 the partition function
method
13Self-complementary regions in RNA sequences
Dot matrix method- search for a
self-complementary regions (long window, many
matches)
Possible stems run perpendicular To axis of
symmetry
14- StemLoop
- StemLoop
- StemLoop finds stems (inverted repeats) within a
sequence. You specify the - minimum stem length, minimum and maximum loop
sizes, and the minimum - number of bonds per stem. All stems or only the
best stems can be displayed on - your screen or written into a file.
- 2. DotPlot
- DotPlot makes a dot-plot with the
- output file from Compare or StemLoop.
Calculates score over a window Finds stems over
a threshold score Minimum/maximum loopsize
Sort by position or score
15Inverted repeats only
16RNA Folding by Energy Minimization The
quickest and easiest route to RNA structure
prediction is through the use of simple energy
rules. One way is to assign an energy to each
base pair in a secondary structure. That e (ri,
rj) is, there is a function e such that is the
energy of a base pair. The energy, E (S) , of
the entire structure, is then given by
Reasonable values of e at are -3, -2 and -1
kcal/mole for GC, AU and GU base pairs,
respectively. Unfortunately, such simple minded
rules are insufficient to capture the
destabilizing effects of various loops, or the
nearest neighbor interactions in helices and
loops. More sophistication is required.
- The energy associated with any position in the
structure - is influenced only by local sequence and
structure - The structure is assumed to be formed by folding
that - does not produce knots
17GLOBAL Energy-Minimization Methods (Minimum-Free
Energy Maximum Enthalpy) Stabilization
Energy (kcal/mole) Total Free Energy
optimality criterion Boltzmann
function-based optimality criteria Loops/bulges
introduce positive free energy and are
destabilizing. How is Stability Measured? (A)
Base pairs Stability Introduced by
Double-Stranded Regions (dsRNA)
Energy of paired bases are stored in a look-up
table these vary with temperature
Energy required by all base-pairs in a
structure are summed this sum is the cost of the
structure (B) Stacking energies (DH Turner,
Rochester) Stacking energies
are energies added by surrounding bases. (C)
Loops (Loop Destabilizing Energies) - Instability
Introduced by Single-Stranded (Unpaired) Loops-
all enthalpic (D) Branches Multibranches
18-each base is compared to every other base
(similar to dot matrix) -energy is estimated by
nearest-neighbor rule -complementary regions are
evaluated by dynamic programming algorithm
Energies are determined empirically
Energy scoring base pairing (kcal/M)
G-C -3
A-U -2
G-U -1
Energy scoring loop penalties (kcal/M)
Size Internal Bulge Hairpin
1 3.9 2 4.1 3.1
3 4.5 3.5 4.5
4 4.9 4.2 5.5
5 5.3 4.8 4.9
Stacking energies for base pairs
19Base-pair stacking
Favorable energies come from base-pair stacking
NOT from formation of base-pairs Un-paired
bases make hydrogen bonds with water therefore
there is no netchange when they pair Favorable
interactions come from electronic interactions
between stacked bases Base-pair stacking is
the ONLY favorable energy term in RNA folding
20Base comparisons
Free energy calculations
Stacking energies for base pairs (kcal/mole 370C)
21Matches/Mismatches Get some favorable energy
even if not hydrogen bonded due to stacking,
for instance for a mismatch next to an AU 5' AX
3 3' UY 5'
22Algorithm RNA folding is implicitly an N4
algorithm N2 dynamic programming to find the
stems N2 dynamic programming to find the best
combination Zuker algorithm is N3 due to
approximations in searching for lopsided internal
loops Note that very asymmetric internal loops
will not be found with the default settings
23(No Transcript)
24- Dynamic Programming Methods
- Use Trace-Back methods
-
- Applied by Zuker Steigler using Energetics as
the - Criterion
- StemLoop Program calculates the optimal energies
of - local stems loops independently based on
inverted - repeats (ignores internal loops bulges but
mFOLD - does not).
j
1 2 3
n
ij
1 2 3
i1 j-1
i
Energy matrix W
n
25Zuker Algorithm Calculation proceeds from center
towards edges Includes stacking, bulge,
internal, and hairpin loop terms Start from
center because the center line is location of
hairpin loops
Limited number of alternative structures
!!!!!!!!!!!!
Vienna RNA Package http//rna.tbi.univie.ac.at/cgi
-bin/RNAfold.cgi
26ENERGY DOT PLOT
-alternative choices
Which regions are more/less predictive?
27Reliability of secondary structure prediction
Pnum - total number of energy dotes in the i-th
row and I-th column of the
energy dot plot - represents the number of base
pairs that the i-th base can form with all
other base pairs within the defined energy
range - the lower this value-the more well
defined the local structure Hnum the sum of
Pnum (i) and Pnum (j) less 1 and is the
total number of dots in the i-th row and j-th
column - the lower this number-the more well
determined the double-stranded region Ssum
the number of foldings in which base i is
single- stranded divided by m, the number of
foldings - represents the probability that base
i is single- stranded 1-probably single
stranded 0-probably not
28MFold MfoldPlotFold predicts optimal and
suboptimal secondary structures for an RNA or DNA
molecule using the most recent energy
minimization method of Zuker.
MFold calculates energy matrices that
determine all optimal and suboptimal
secondary structures for an RNA or DNA
molecule. The program writes these energy
matrices to an output file. A companion
program, PlotFold, reads this output file and
displays a representative set of optimal and
suboptimal secondary structures for the
molecule within any increment of the computed
minimum free energy you choose. You can
choose any of several different graphic
representations for displaying the secondary
structures in PlotFold.
P-Num Plot
This plot shows the amount of variability in
pairing at each position in the sequence in all
predicted foldings within the increment of the
optimal folding energy you specify.
Squiggles Plot
The squiggles plot is a representation
similar to what you might draw by hand
that is, bonds formed between bases are
drawn as chords. Bases are shown participating in
stems, as well as in hairpin, bulge,
interior, and multibranched loops.
29Lower left to upper right diagonals free energy
encoded by colors (dark is most optimal). Note
that some short-cut algorithms will not explore
all possible structures but instead will ignore
the 'blank' areas in the biplot.
Once structures are predicted they can be
compared using Structure Dot Plot
Structure plots summarize the Commonalities
between two Predicted Structures (in this case
the top two structures).
http//www.bioinfo.rpi.edu/zukerm/rna/
30Lower left to upper right diagonals free energy
encoded by colors (dark is most optimal). Note
that some short-cut algorithms will not explore
all possible structures but instead will ignore
the 'blank' areas in the biplot.
Once structures are predicted they can be
compared using Structure Dot Plot
Structure plots summarize the Commonalities
between two Predicted Structures (in this case
the top two structures).
- LIMITATION
- do not compute all the structures within a given
energy range of - the minimum free-energy structure
31Vienna RNA Package 1.4
http//www.tbi.univie.ac.at/ivo/RNA/
- three kinds of dynamic programming algorithms for
structure prediction -
- 1-the minimum free energy algorithm of (Zuker
Stiegler 1981) which yields a single optimal
structure, the partition function - 2-algorithm of (McCaskill 1990) which calculates
base pair probabilities in the thermodynamic
ensemble - 3-suboptimal folding algorithm of (Wuchty et.al
1999) which generates all suboptimal structures
within a given energy range of the optimal
energy. - For secondary structure comparison, the package
contains several measures of distance
(dissimilarities) using either string alignment
or tree-editing (Shapiro Zhang 1990). - Finally, an algorithm is provided to design
sequences with a predefined structure (inverse
folding).
RNAfold -- predict minimum energy secondary
structures and pair probabilities RNAeval
-- evaluate energy of RNA secondary structures
RNAheat -- calculate the specific heat (melting
curve) of an RNA sequence RNAinverse --
inverse fold (design) sequences with predefined
structure RNAdistance -- compare secondary
structures RNApdist -- compare base pair
probabilities RNAsubopt -- complete
suboptimal folding
32Minimum free energy structure and base pair
probabilities for the Sarcin loop of 23S
ribosomal RNA, as predicted by the RNAfold
program.
33Evaluation Biological RNAs (with important
structure) are difficult to distinguish from
random RNAs Same number and length of stems and
loops Same GC content of stems Same free
predicted free energy Biologically important
structures are exceptional in lacking competing
structures this insures that the structure will
be present regardless of the net DG PNUM plot
shows number of alternative structures within
energy increment Agrees well with phylogenetic
predictions, but most effective for large
molecules
34- Sequence Covariation Methods (non-independent
changes) - determined by comparing sequences among species.
Joint substitutions that are - seen may reflect sites paired in the structure.
Improves structure prediction by - Dynamic Programming Methods
- for double-stranded regions in RNA
molecules, sequence changes that take place - in evolution should maintain the base
pairing - sequence changes in loops and
single-stranded regions should not have such a - constraint
- You are looking for sequence positions at which
covariation -
- maintains the base-pairing properties
Seq 1----------------G-------------C--------- Seq
2----------------C-------------G--------- Seq
3----------------A-------------C--------- Seq
4----------------A-------------T---------
?
AT
CG
AC
GC
http//www.rna.icmb.utexas.edu/
35Covariance secondary structure prediction in
RNA takes into account conserved patterns of
basepairing Positions of covariance are
conserved matches, since they maintain the
secondary Structure computationally challenging
36Eddy Durbin (1994) formal covariance model
- LIMITATION
- slow
- unsuitable for searching through large genomes
- usually use information from already existing RNA
secondary structure - How to discover this information??????
- Construct a more general model
- Train the model
- Discover the most likely base-paired regions
Similarity with HMMs
Mutual information content M superimposed on the
information content of each sequence position in
an RNA alignment
http//www.cbs.dtu.dk/7Egorodkin/appl/slogo.html
37(No Transcript)
38Phylogeny based prediction
Inference of structure from covariance or
mutual information depends on having the
correct alignment Correct alignment depends on
knowing the correct structures Can only find
common structures, not structures unique to a
molecule Can, in principle, detect pseudoknots
39(No Transcript)
40Interaction among base pairs versus
Context-free grammar
41Interaction among base pairs versus
Context-free grammar
SCFG
Stochastic context-free grammars
42Interaction among base pairs versus
Context-free grammar
SCFG
Stochastic context-free grammars
Language Terminal symbols A C G U Nonterminal
symbols S0, S1, S2, S3,..
COVE is an implementation of stochastic context
free grammar methods for RNA sequence/structure
analysis
43CAUCAGGGAAGAUCUCUUG
44RNA world
http//www.imb-jena.de/RNA.html
Tutorials
http//www.ambion.com/techlib/resources/linkspage.
html
RNA Secondary Structure Prediction at Belozersky
Institute, Moscow
http//www.genebee.msu.su/services/rna2_reduced.ht
ml
45RNA-specifying genes
- tRNAscan-SE
-
- -identifies transfer RNA genes in genomic DNA or
RNA sequences. - specificity of the Cove probabilistic RNA
prediction package (Eddy Durbin, 1994) - speed and sensitivity of tRNAscan 1.3 (Fichant
Burks, 1991) - implementation of an algorithm described by
Pavesi and colleagues (1994) which searches for
eukaryotic pol III tRNA promoters (our
implementation referred to as EufindtRNA). - tRNAscan and EufindtRNA are used as first-pass
prefilters to identify candidate'' tRNA regions
of the sequence. These subsequences are then
passed to Cove for further analysis, and output
if Cove confirms the initial tRNA prediction. In
this way, tRNAscan-SE attains the best of both
worlds - - a false positive rate of less than one per 15
billion nucleotides of random sequence - - the combined sensitivities of tRNAscan and
EufindtRNA (detection of 99 of true tRNAs) - - search speed 1,000 to 3,000 times faster than
Cove analysis and 30 to 90 times faster than the
original tRNAscan 1.3 (tRNAscan-SE uses both a
code-optimized version of tRNAscan 1.3 which
gives a 650-fold increase in speed, and a fast C
implementation of the Pavesi et al. algorithm). - published in Lowe Eddy, Nucleic Acids Research
25 955-964 (1997). .
http//lowelab.ucsc.edu/tRNAscan-SE/
NCBI CP000030
46Automatic detection of conserved RNA structure
elements in complete RNA virus genomes
Nucleic Acids Research, 1998, Vol. 26, No. 16
a new method for detecting conserved RNA
secondary structures in a family of related RNA
sequences. Method is based on a combination of
thermodynamic structure prediction and
phylogenetic comparison. In contrast to purely
phylogenetic methods, our algorithm can be used
for small data sets of 10 sequences,
efficiently exploiting the information contained
in the sequence variability.
- Distant groups of RNA viruses have very little or
no detectable sequence homology and often very - different genomic organization
- (ii) RNA viruses show an extremely high mutation
rate, of the order of 10-5-10-3 mutations per - nucleotide and replication.
- Due to the high sequence variation, the
application of classical methods of sequence
analysis - is, therefore, difficult or outright impossible.
- The high mutation rate of RNA viruses also
explains their short genomes, of less than 20
000 - nt. A large number of complete genomic sequences
is available in databases. The non-coding - regions are most likely functionally important,
since the high selection pressure acting on viral
- replication rates makes junk RNA' very
unlikely.
RNA secondary structures are predicted as minimum
energy structures by means of dynamic
programming techniques. An efficient
implementation of this algorithm is part of the
Vienna RNA Package
47Sequences are aligned using a standard multiple
alignment procedure. Secondary structures for
each sequence are predicted and gaps are
inserted bases in the sequence alignment. The
resulting aligned structures can be represented
as aligned mountain plots. From the aligned
structures consistently predicted base pairs are
identified. The alignment is used to identify
compensatory mutations that support base pairs
and inconsistent mutants that contradict pairs.
This information is used to rank proposed base
pairs by their credibility and to filter the
original list of predicted pairs.
48Aligned mountain representations m(k) of the RNA
secondary structure of 13 complete HCV genomes.
Peaks and plateaux in the mountain
representation correspond to hairpins and
unpaired regions in the secondary structure.
Colors indicate the number of consistent
mutations red 1, yellow 2 and green 3 different
types of base pairs. These saturated colors
indicate that there are only compatible
sequences. Decreasing saturation of the colors
indicates an increasing number of non-compatible
sequences
Comparison of predicted minimum energy structures
in region A (around position 8000) of the HCV
genome. The lower left part of the plot shows a
conventional picture of the predicted structure.
Base pairs marked in green have non-consistent
mutations, circles indicate compensatory
mutations. The extended outer stem contains a
number of compensatory mutations supporting its
existence.
49The TAR structure of HIV-1. Almost all predicted
base pairs are consistent with all 13 sequences,
most of them are predicted in at least 11
sequences. A large number of compensatory
mutations supports the thermodynamic
predictions. Our computed consensus structure
(lower left) matches the structure determined by
probing and phylogenetic reconstruction (4). We
display here the consensus dot plot, the
classical secondary structure and a mountain
representation. The latter is a convenient
alternative to dot plots for larger structural
motifs. Base pairs are represented by slabs
connecting the two sequence positions. The width
and color of a slab corresponds to size and color
of the corresponding dot plot entry.
50Consensus structures of the HIV-1 RRE region from
a set of 13 sequences and from the 21 sequences
51Primary Structure of RNA e.g., Human tRNAgene
for Methionine gtgi1181147embZ69292.1HSC6TRNAM
H.sapiens tRNA-Met gene GGCCUUUUUUUUCCUUUUUUUUA
AUUUUAUUGAGACAGGGUCUCGCUAUGUUGCCUGCCUGGGUCUUCCA
AAGUGCAGUGACUACAGGGAGCUGAGCCCGGCGCCUAGCCCACCAGUGU
AUUGAUAUUUAUUUUUCUAUC CCUUGUUUUGUUUUCUGUUUGAUUCUG
GUGAUUCCUUUUUCCAAAGUGAGUUGGCAACCUGUGGUAGCCA
GCAAGUAGGCAACUGCUCGUAGGUUUUUUCUUAAAUUACGAGGUAGUCU
GUUCGGCAUCUCCUGUAAGUA GUUAAGAGUACUGUGAGACCGUGUGCU
UGGCAGAACAGCAGAGUGGCGCAGCGGAAGCGUGCUGGGCCCA
UAACCCAGAGGUCGAUGGAUCGAAACCAUCCUCUGCUAGGUCCUUUUUU
UUUUUCCCCCCCCGUCUAUUU UCCUGAGGAUCCCUUUUUUUAAGUUAC
AGUUUUUUAGGUUAAACAAUGACAAAGAAAACAAAAUGAACCC
GAGUAUUUCUUUAAUUCCAGAAUUACAAGCAUUUCCGGGAAAUAAUGUG
AAACUACAAUCUCUGCAUGUA CAAUUUUGAUUUUCAUGGACACCCAAG
UGUCAUUAAUCAAUAUUUCAUCUGUAAACAAAGCAAAUUUCUC
UUGUUUAGAGGCUAUACCACUGUUGCAGCCAGUUAUGACAGUUGUAAGU
UAACCUGCCAAGAAGGAGAAU CGUUACAUAAACUGAGUGCCAAGGGUG
GGGUGGGGUGGGAGCCCAGGAAUGGAGUUUUAUAUCUUUUGAU
ACAUAAUUCAGAAAGCACUAUUUGCCAAGUAGUUAACGCCAUCGAUUAG
GAAUUC
http//www.genetics.wustl.edu/eddy/tRNAscan-SE/
http//lowelab.ucsc.edu/tRNAscan-SE/
6137599 n
L35894
http//www.bioinfo.rpi.edu/zukerm/rna/
http//www.bioinfo.rpi.edu/applications/mfold/old/
rna/