Title: On The RNA Structure Prediction Problems: Structural Inference Technique and Other Recent Algorithms
1On The RNA Structure Prediction Problems
Structural Inference Technique and Other Recent
Algorithms
2Content of Presentation
- Introduction
- RNA and Its Functions
- RNA Structures
- Brief Review on Computational Methods on RNA
Structure Prediction - Ab Initio Predictive Methods
- Comparison Methods
- Inference Methods
- Current Work
- Introduction
- Preliminaries and Problem Definition
- Previous Approach
- Sparsification
- Recursive Dynamic Programming
- Hirschberg-like Recursive Traceback Technique
- Conclusion and Future Direction
- References
3RNA
- Stands for Ribonucleic Acid
- A biological polymer consisting monomers called
nucleotides - Each nucleotide consists of a (ribose) sugar, a
phosphate group and a base. - There are mainly 4 types of base in RNA
sequences.
4RNA
Watson-Crick Base Pairing
Wobble Base Pairing
5RNA Functions
- DNA transcription and translation
- Transcription Messenger RNA (mRNA),
- Translation Transfer RNA(tRNA), Ribosomal RNA
(rRNA) - Catalyst and regulator in nucleic acid processing
and gene expression - Messenger RNA splicing Small Nuclear RNA
(snRNA) - rRNA processing in the nucleus Small Nucleolar
RNA (snoRNA) - Regulators Micro RNA (miRNA) which has two
types, - 1)Small Interfering RNA (siRNA) and
- 2) Small Temporally Regulated RNA (stRNA)
6RNA Primary Structure
- The view of RNA from its nucleotides base
- Commonly represented by a string S over the
alphabet SA,C,G,U - Can be found using similar techniques for DNA
sequencing, such as Gel Electrophoresis, etc.
7RNA Secondary Structures
Helices
Bulge Loop
Hairpin Loop
Internal Loop
Multi Loop
8RNA Tertiary Structures
Pseudoknot
Base Triple
9RNA Structure
- To a preserved function there corresponds a
preserved molecular conformation. - Secondary and tertiary structures are being
solved much slower than new RNA sequences being
discovered. Existing experimental methods are
relatively expensive and slow. - Denatured RNAs deterministically fold back to
their original folding in vitro. Thus, RNA
structure depends solely on its nucleotide
content. Computational method should exist!
10Existing Computational Methods for RNA Structure
Prediction
- Ab-Initio Predictive Methods
- Try to compute the RNA structure solely based on
its nucleotide contents by minimizing the free
energy of the predicted structure. - Comparative Methods using sequence homology
- By examining a set of homologous sequence along
with their covarying position, we can predict
interactions between non adjacent positions in
the sequence, such as base pairs, triples, etc. - Structural Inference Methods
- Given a sequence with a known structure, we infer
the structure of another sequence known to be
similar to the first one by maximizing some
similarity function
11Ab Initio Predictive Methods
- Minimizing the sum of Free Energy of the
predicted structure - Uses experimentally determined local structures
- free energy.
- Best result without Pseudoknot O(n³) time 1
- Best result with Restricted Pseudoknot
- Simple Pseudoknot 2 O(n4) time
- Recursive Pseudoknot 2 O(n5) time
- General Pseudoknot 2 NP-Hard
12Ab Initio Predictive Methods
- Equilibrium Partition Function
- The weighted sum of probabilities over all
possible structure, where the weight is computed
from the free enrgy of the structure. - Best result without Pseudoknot O(n³) time 1
- Best result with Restricted Pseudoknot O(n5)
time 3 (Restricted to class of pseudoknots that
are physically most likely to occur)
13Comparative Methods
- Simultaneous Sequence Structure Alignment
- First by D. Sankoff 4, with O(n6) time
complexity - Best result without pseudoknot O(M3n3) 5 where
M is the maximum distance between the 2 sequences - Stochastic Context Free Grammars (SCFG)
- Use context free grammar to produce the base
pairing with some distribution (hence the term
stochastic) - Can only handle non-pseudoknotted structure
- Most recent result in 6
- A new model called Parallel Communicating Grammar
System (PCGS) is designed in 7, to handle
pseudoknots
14Comparative Methods
- Maximum Weighted Matching
- Based on Gabows Maximal Weighted Matching
algorithm. Tries to find the base pairs in the
structure given the likelihood score of all
possible pairs. - Computing the likelihood score might require
multiple sequence alignment (slow) - The most recent publications are 8 and 9. 8
only considers bi-secondary RNA structures. While
9 tries to find helices of some minimum length
in the sequences and try to align them.
15Comparative Methods
- Iterative Loop Matching 10
- Applies the Loop Matching algorithm by Nussinov
et.al. The algorithm finds a non-pseudoknotted
structure in each iteration and run the same
algorithm on the remaining unpaired bases. - Genetic algorithm 11
- Find a set of possible structures. The algorithm
will - pass these structures through several stage of
evolution. - Bayesian Network and other approaches
16Structural Inferring Methods
- Given two RNA sequences S1 and S2, where the
secondary structure of S1 is known. The method of
this class will infer the secondary structure for
S2 by aligning S1 and S2. Let the length of S1
be n and length of S2 be m - Previously, Bafna et.al uses dynamic programming
to solve the problem in O(n2m2nm3) time and
O(n2m2) space 12. K. Zhang improves the result
to O(nm3) time and O(nm2) space 13. - We further improve it to O(nm2 log n) time and
O(m2 log n) space. The algorithm will be
described later. - The survey of works related to this class of
algorithm can be found here
17Current Work
- We submitted a paper to WABI 2004 under the title
A Faster and More Space-EfficientAlgorithm for
Inferring Arc-Annotations of RNA Sequences
through Alignment. - Our contributions
- Improvement in running time by sparsification and
recursive dynamic programming - Improvement in space requirement using score-only
dynamic programming with a Hirschberg-like trace
back algorithm and compression.
18Preliminaries
- Consider two RNA sequences S1 and S2 with length
equal to n and m respectively. Only S1s
secondary structure is provided - To represent a base pair between the base S1i
and S1j, we use the pair (i,j), denoted as an
arc, where 1?iltj? n. The structure of S1 can
then be defined by a set P1 of arcs. The pair
(S1,P1) is called an arc-annotated sequence. - For RNA, it is obvious that S1i and S1j must
be complementary to each other.
19Preliminaries
- Considering secondary structures, the arc
annotation that corresponds to such structures is
the Nested Arc Annotation - Any two arcs (i,j) , (k,l) in a nested arc
annotation P1 satisfy iltkltj ? iltlltj - For any arc u in P1 let u_l be its left endpoint
and u_r be its right endpoint. The size of an arc
is equal to u_r-u_l1.
20Alignment Score Function
- Unpaired base alignment score function
- ?(S1i,S2j) ß if S1i and S2j are
complementary - 0 otherwise
- Arc Alignment Score Function
- a1, a2, and ß are positive integers.
21Problem Formulation
- The Weighted Largest Common Substructure (WLCS)
of 2 arc-annotated sequence (S1,P1) and (S2,P2)
is the maximum weighted alignment between S1 and
S2 where unpaired bases are aligned to unpaired
bases and arcs are aligned to arcs. - The problem we address is, given (S1,P1) and S2,
infer the arc annotation P2 of S2 such that their
WLCS is maximized.
22Previous Algorithm
EXTEND(DP(i,i))
MERGE(DP(i,ul-1),DP(ul,ur))
ARC-MATCH (DP(i,i))
23Previous Algorithm (2)
- All EXTEND operations take O(nm2) time and space
- All ARC-MATCH operations also take O(nm2) time
and - space
- The bottleneck of the computation is the
procedure - MERGE, each requiring O(m3) time yielding a
total of - O(nm3) time
24Sparsification Technique
- Based on the observation that the entries in the
rows of table DP is monotonically increasing, we
do not need to check all possible j in the
MERGE equation. Instead, we check the positions
of j where the corresponding DP entries are
distinct. - This way, we can reduce the cost of each MERGE
operation to - O(minul-ur,ul-im2)
25Recursive Dynamic Programming
- An arc u is the parent of arc v iff ulltvlltvrltur
and there is no arc w s.t ulltwlltvlltvrltwrltur - Conversely, v is the (one of the) child of u
- Let core-arc(u) be the child of arc u with the
biggest arc. - Let side-arc(u) be the set of children of u
excluding its core-arc. - Let core-path(u) be the transitive closure of
core-arc(u).
26Recursive Dynamic Programming
Left Part
Right Part
Computed
27Running Time Analysis
- Since we compute the DP table only for side arcs,
and since the size of any side arc will not
exceed ½ size of its parent, the recursion will
reach at most log n levels. - In each level, the total time spent by MERGE is
at most O(nm2) - Thus the total running time is still bounded by
the MERGE operation which is O(nm2 log n)
28Space Improvement
- In order to use standard trace back, we need to
store all the tables corresponding to an arc in
P1. - This requires O(nm2) storage. For sequence of
length 3-5K, which is commonly used in lab
experiments, the storage requirement can reach
tens of gigabytes. - Solution Use the score only version of the
dynamic programming and recursion to traceback.
29Hirschberg-like Trace Back Algorithm
1. Find two points p1 and p2 in S1 such that
p2-p1 is at least 1/3n 2. Find the positions in
S2 to which p1 and p2 is aligned 3. Divide the
problem into two subproblems, each having a
fractional size of the original 4. Summing up the
decreasing geometric series, we still have the
same running time as before
30Conclusion
- RNA structure prediction is in general has many
yet to be done. - We considered a problem of RNA structure
inference where we infer the structure of an RNA
sequence given a similar sequence with known
structure. - Our technique is quite general that it can also
directly solve the LAPCS problem mentioned
elsewhere 14 - We wish to handle pseudoknot in the future by
applying the algorithm iteratively, following the
idea of Iterative Loop Matching
31References
- 1 R. B. Lyngsø, M. Zuker, and C.N.S. Pedersen.
Internal loops in RNA secondary structure
prediction. In ICMB, pages 260267, 1999. - 2 T. Akutsu. Dynamic programming algorithms for
RNA secondary structure with pseudoknots. In
Disc. Appl. Math, volume 104, pages 4562, 2000. - 3 R. M. Dirks and N. A. Pierce. A partition
function algorithm for nucleic acid secondary
structure including pseudoknots. In J. Comput.
Chem., volume 24, pages 16641677, 2003. - 4 D. Sanko. Simultaneous solution of the RNA
folding alignment and protosequence problem. In
SIAM J. Appl. Math, volume 45, pages 810825,
1985. - 5 D. Mathews and D. Turner. Dynalign an
algorithm for finding the secondary structure
common to two RNA sequences. In J. Mol. Biol,
volume 317, pages 191203, 2002. - 6 B. Knudsen and J. Hein. RNA secondary
structure prediction using stochastic context
free grammars and evolutionary history. In
Bioinformatics (6), volume 15, pages 446454,
1999. - 7 L.M. Cai, R. L. Malmberg, and Y. Z. Wu.
Stochastic modeling of RNA pseudoknotted
structures a grammatical approach. In
Bioinformatics (suppl. 3), volume 15, pages
166173, 2003.
32References (2)
- 8 C. Witwer, I. L. Hofacker, and P. F. Stadler.
Prediction of consensus RNA secondary structures
including pseudoknots. In to appear in European
Conference on Computational Biology, 2004. - 9 Yongmei-Ji, Xing-Xu, and G. D. Stormo. A
graph theoretical approach to predict common RNA
secondary structure motifs including pseudoknots
in unaligned sequences. In to appear in
Bioinformatics, 2004. - 10 Jianhua Ruan, G. D. Stormo, and Weixiong
Zhang. An iterated loop matching approach to the
prediction of RNA secondary structures with
pseudoknots. In Bioinformatics (1), volume 20,
pages 5866, 2004. - 11 J.H. Chen, S. Y. Le, and J. V. Maizel.
Prediction of common secondary structures of
RNAa genetic algorithm approach. In Nuc. Acids
Res.(4), volume 28, pages 991999, 2000. - 12 V. Bafna, S. Muthukrishnan, and R. Ravi.
Computing similarity between RNA strings. CPM,
volume 937, pages 116, 1995. Springer-Verlag. - 13 K. Zhang. Computing similarity between RNA
secondary structures. In IEEE International Joint
Symposia on Intelligence and Systems, pages
126132. 1998.
33References (3)
- 14 T. Jiang, G. H. Lin, B. Ma, and K. Zhang.
The longest common subsequence problem for
arc-annotated sequences. In Proceedings of
the11th Annual Symposium on Combinatorial Pattern
Matching, volume 1848, pages 154165.
Springer-Verlag, 2000.