Title: RNA secondary structure prediction
1RNA secondary structure prediction
- Introduction
- Examples of RNA molecules
- Secondary structure elements
- Pseudo-knots
- RNA folding
- Nussinov algorithm
- Energy minimization
- Covariance analysis
- RNA secondary structure motifs
- Examples, biological function
- RNA secondary structure patterns
2Basics about RNA (for computer scientists)
- RNA initially synthesized as co-linear copy of
DNA - U replaces T (however, U represented as T in
nucleotide database entries) - RNA may undergo splicing and other
post-transcriptional modifications - Two major RNA classes in cellular organisms
- messenger RNA (mRNA) templates for protein
synthesis - structural and catalytic RNAs
- The genome of many viruses (e.g. HIV) consists of
RNA - RNA is usually single-stranded (exception a few
viral genomes) - RNA folds back onto itself to form short
base-paired regions - As in DNA, base-paired regions form anti-parallel
helices - Same base-pairing rules as for DNA but U-G pairs
also permitted
3Examples of structural and/or catalytic RNAs
ribosomal RNA (rRNA) transfer RNA (tRNA) small
nuclear RNA (snRNA. e.g. U1) small nucleolar RNAs
(snoRNA) small cytoplasmic RNA (scRNA, e.g
7SL-RNA) microRNAs (miRNA)
4RNA secondary structure elements Terminology
5Purpose of RNA folding algorithm
- Prediction of the native secondary structure of
an RNA molecule -
- Formally, the secondary structure of an RNA
consists of all pairs of bases that interact with
each other, usually through standard Watson-Crick
base-pairs. - Recognition of RNA functional motifs
- RNA molecules may contain regulatory motifs that
interact with RNA-binding proteins - Such motif may have a conserved secondary
structure in addition to conserved primary
structure elements.
6Pseudo-knots
Cause problems to ordinary RNA folding algorithms.
Pseudoknots imply an arrangement of pairs of
interacting base pairs of the type a b a
b Such structure require intersecting lines in
the following type of representation
U U C C G A A G C U C A A C G G G A A A A U G A G
C U
7RNA secondary structure notation
- RNA secondary structures can be specified by a
sequences of the three letters -,gt,lt. - Base pairs can be reconstructed as follows
- process sequence from left to right
- if base marked - leave unpaired
- if base marked gt wait
- if base marked lt connect to closest unpaired
base marked gt on left side
AAGACUUCGGAUCUGGCGACACCC --gtgtgt----lt-ltlt-gtgt-gt---ltltlt
Note works only if no pseudoknots occur.
8Nussinov algorithm Principle
Objective To find the secondary structure with
the maximal number of base pairs under the
pseudo-knot exclusion constraint. Principle Recur
sive procedure (dynamic programming
algorithm). Scoring function sum of base-pair
scores, no penalties for loops Optimal score
computed from the optimal scores of
subsequences. Filling-stage. Scores for
subsequences are recursively computed from and
recorded in a quadratic table. Trace-back Reconst
ruction of filling steps indicates optimal
structure Time-complexity O(N3) Limitations No
pseudo-knots, No constraints on loop
lengths No penalties for bulge loops No
scoring terms for base-pair stacking
inter-actions (see later)
9Nussinov algorithm extension operations
10Nussinov algorithm fill-stage
Scoring system d(i,j) 1 for all RNA
Watson-Crick base-pairs including G-U else d(i,j)
0.
Blue addition of unpaired base 3 or 7
Green addition of paired bases 1,7
Pink joining of substructures 1..4 and 5..8
11Nussinov algorithm trace-back
current record stack 1,9 1,9
1,8 1,8 1,4 5,8 1,4 1,4
2,3 5,8 2,3 2,3 3,2 5,8 3,2
5,8 5,8 5,8 6,7 6,7 6,7 7,6 7,6
12RNA folding by energy minimization
Note a bulge loop does not alter stacking energy!
13Principle of the Zuker algorithm (RNAFOLD)
- Energy minimization using a richer scoring
system - Stacking energies scores for overlapping
dinucleotide pairs - Bulge loop scores dependent on length
- Hairpin loop scores dependent on length and
closing pair - Internal loop scores dependent on length and
closing pair - Same principle as Nussinov algorithm but
- Two minimal energy values are stored for each
subsequence - W(i,j) best structure on i,j
- V(i,j) best structure on i,j closed by paired
i,j. - Computational complexity essentially O(N3)
- (if constraints on maximal loop sizes are applied)
14Energy-parameters used by RNAFOLD
Note Some energy terms (e.g. for the terminal
mismatch of a hairpin) are Missing.
15Prediction of RNA structure by covariance models
Motivation Energy minimization-based approaches
often predict large numbers of alternative RNA
secondary structures with very similar free
energy. A Multiple alignment of related RNAs
potentially reveals base pair interactions
Interacting positions in multiple alignment
positions expected to show co-variation
compatible with standard RNA base-pairing
rules Limitation requires within column
variation. No information is obtained for
completely conserved position.
16Prediction of RNA structure by covariance models
Covariance measure used Mutual information
17Covariance analysis tRNA-Phe
18RNA motifs, signatures, domains, and families
- Terminology
- Motif short RNA regions with partly conserved
primary and secondary structure, usually with a
defined function. - Signature short RNA regions with partly
conserved primary and secondary structure useful
for identifying members of an RNA family. - Domain A larger RNA region with conserved
secondary structure, usually considered an
independent folding unit - Family A family of homologous and/or
structurally related RNA molecules, e.g. tRNAs.
- RNA sequence-structural motifs play a role in
various biological processes - Translational control, e.g. iron-response element
(IRE) - RNA degradation
- RNA localization (zip-code motifs)
19RNABOB and example of an RNA pattern recognition
program
Characteristics Supports qualitative patterns
(true/false no scores or probabilities) Based
on simple but powerful pattern syntax Fast search
engine Supports non-Watson-Crick type base
interactions Supports pseudo-knots ! Allows for
errors (mismatches) in the pattern.
20RNABOB pattern syntax
S1 h1 s2 h2 s3 h2' h1' h1 00 NNNNNNNNNN h2 00
NNNN S1 0 NN s2 0 R s3 0 ANYA
Example
- The first line indicates the ordering of pattern
elements - s1, s2, s3 consist of contiguous unpaired
sequences - h1, h1 represent complementary sequence segments
forming a double helix. - Lines 2 to 6 contain the descriptions of each
element - NNNNNNNNNN means that any base is permitted in
this structure, the only constraint is that they
have to respect base-pairing rules2020 - Numbers indicate how many mismatches are allowed
per element. - IUPAC codes are used to specify ambiguous
positions Y CT