On The RNA Structure Prediction Problems: Structural Inference Technique and Other Recent Algorithms - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

On The RNA Structure Prediction Problems: Structural Inference Technique and Other Recent Algorithms

Description:

free energy. Best result without Pseudoknot: O(n ) time [1] ... We further improve it to O(nm2 log n) time and O(m2 log n) space. ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:5.0/5.0
Slides: 34
Provided by: compN
Category:

less

Transcript and Presenter's Notes

Title: On The RNA Structure Prediction Problems: Structural Inference Technique and Other Recent Algorithms


1
On The RNA Structure Prediction Problems
Structural Inference Technique and Other Recent
Algorithms
  • Hugo Willy
  • HT030031E

2
Content of Presentation
  • Introduction
  • RNA and Its Functions
  • RNA Structures
  • Brief Review on Computational Methods on RNA
    Structure Prediction
  • Ab Initio Predictive Methods
  • Comparison Methods
  • Inference Methods
  • Current Work
  • Introduction
  • Preliminaries and Problem Definition
  • Previous Approach
  • Sparsification
  • Recursive Dynamic Programming
  • Hirschberg-like Recursive Traceback Technique
  • Conclusion and Future Direction
  • References

3
RNA
  • Stands for Ribonucleic Acid
  • A biological polymer consisting monomers called
    nucleotides
  • Each nucleotide consists of a (ribose) sugar, a
    phosphate group and a base.
  • There are mainly 4 types of base in RNA
    sequences.

4
RNA
Watson-Crick Base Pairing
Wobble Base Pairing
5
RNA Functions
  • DNA transcription and translation
  • Transcription Messenger RNA (mRNA),
  • Translation Transfer RNA(tRNA), Ribosomal RNA
    (rRNA)
  • Catalyst and regulator in nucleic acid processing
    and gene expression
  • Messenger RNA splicing Small Nuclear RNA
    (snRNA)
  • rRNA processing in the nucleus Small Nucleolar
    RNA (snoRNA)
  • Regulators Micro RNA (miRNA) which has two
    types,
  • 1)Small Interfering RNA (siRNA) and
  • 2) Small Temporally Regulated RNA (stRNA)

6
RNA Primary Structure
  • The view of RNA from its nucleotides base
  • Commonly represented by a string S over the
    alphabet SA,C,G,U
  • Can be found using similar techniques for DNA
    sequencing, such as Gel Electrophoresis, etc.

7
RNA Secondary Structures
Helices
Bulge Loop
Hairpin Loop
Internal Loop
Multi Loop
8
RNA Tertiary Structures
Pseudoknot
Base Triple
9
RNA Structure
  • To a preserved function there corresponds a
    preserved molecular conformation.
  • Secondary and tertiary structures are being
    solved much slower than new RNA sequences being
    discovered. Existing experimental methods are
    relatively expensive and slow.
  • Denatured RNAs deterministically fold back to
    their original folding in vitro. Thus, RNA
    structure depends solely on its nucleotide
    content. Computational method should exist!

10
Existing Computational Methods for RNA Structure
Prediction
  • Ab-Initio Predictive Methods
  • Try to compute the RNA structure solely based on
    its nucleotide contents by minimizing the free
    energy of the predicted structure.
  • Comparative Methods using sequence homology
  • By examining a set of homologous sequence along
    with their covarying position, we can predict
    interactions between non adjacent positions in
    the sequence, such as base pairs, triples, etc.
  • Structural Inference Methods
  • Given a sequence with a known structure, we infer
    the structure of another sequence known to be
    similar to the first one by maximizing some
    similarity function

11
Ab Initio Predictive Methods
  • Minimizing the sum of Free Energy of the
    predicted structure
  • Uses experimentally determined local structures
  • free energy.
  • Best result without Pseudoknot O(n³) time 1
  • Best result with Restricted Pseudoknot
  • Simple Pseudoknot 2 O(n4) time
  • Recursive Pseudoknot 2 O(n5) time
  • General Pseudoknot 2 NP-Hard

12
Ab Initio Predictive Methods
  • Equilibrium Partition Function
  • The weighted sum of probabilities over all
    possible structure, where the weight is computed
    from the free enrgy of the structure.
  • Best result without Pseudoknot O(n³) time 1
  • Best result with Restricted Pseudoknot O(n5)
    time 3 (Restricted to class of pseudoknots that
    are physically most likely to occur)

13
Comparative Methods
  • Simultaneous Sequence Structure Alignment
  • First by D. Sankoff 4, with O(n6) time
    complexity
  • Best result without pseudoknot O(M3n3) 5 where
    M is the maximum distance between the 2 sequences
  • Stochastic Context Free Grammars (SCFG)
  • Use context free grammar to produce the base
    pairing with some distribution (hence the term
    stochastic)
  • Can only handle non-pseudoknotted structure
  • Most recent result in 6
  • A new model called Parallel Communicating Grammar
    System (PCGS) is designed in 7, to handle
    pseudoknots

14
Comparative Methods
  • Maximum Weighted Matching
  • Based on Gabows Maximal Weighted Matching
    algorithm. Tries to find the base pairs in the
    structure given the likelihood score of all
    possible pairs.
  • Computing the likelihood score might require
    multiple sequence alignment (slow)
  • The most recent publications are 8 and 9. 8
    only considers bi-secondary RNA structures. While
    9 tries to find helices of some minimum length
    in the sequences and try to align them.

15
Comparative Methods
  • Iterative Loop Matching 10
  • Applies the Loop Matching algorithm by Nussinov
    et.al. The algorithm finds a non-pseudoknotted
    structure in each iteration and run the same
    algorithm on the remaining unpaired bases.
  • Genetic algorithm 11
  • Find a set of possible structures. The algorithm
    will
  • pass these structures through several stage of
    evolution.
  • Bayesian Network and other approaches

16
Structural Inferring Methods
  • Given two RNA sequences S1 and S2, where the
    secondary structure of S1 is known. The method of
    this class will infer the secondary structure for
    S2 by aligning S1 and S2. Let the length of S1
    be n and length of S2 be m
  • Previously, Bafna et.al uses dynamic programming
    to solve the problem in O(n2m2nm3) time and
    O(n2m2) space 12. K. Zhang improves the result
    to O(nm3) time and O(nm2) space 13.
  • We further improve it to O(nm2 log n) time and
    O(m2 log n) space. The algorithm will be
    described later.
  • The survey of works related to this class of
    algorithm can be found here

17
Current Work
  • We submitted a paper to WABI 2004 under the title
    A Faster and More Space-EfficientAlgorithm for
    Inferring Arc-Annotations of RNA Sequences
    through Alignment.
  • Our contributions
  • Improvement in running time by sparsification and
    recursive dynamic programming
  • Improvement in space requirement using score-only
    dynamic programming with a Hirschberg-like trace
    back algorithm and compression.

18
Preliminaries
  • Consider two RNA sequences S1 and S2 with length
    equal to n and m respectively. Only S1s
    secondary structure is provided
  • To represent a base pair between the base S1i
    and S1j, we use the pair (i,j), denoted as an
    arc, where 1?iltj? n. The structure of S1 can
    then be defined by a set P1 of arcs. The pair
    (S1,P1) is called an arc-annotated sequence.
  • For RNA, it is obvious that S1i and S1j must
    be complementary to each other.

19
Preliminaries
  • Considering secondary structures, the arc
    annotation that corresponds to such structures is
    the Nested Arc Annotation
  • Any two arcs (i,j) , (k,l) in a nested arc
    annotation P1 satisfy iltkltj ? iltlltj
  • For any arc u in P1 let u_l be its left endpoint
    and u_r be its right endpoint. The size of an arc
    is equal to u_r-u_l1.

20
Alignment Score Function
  • Unpaired base alignment score function
  • ?(S1i,S2j) ß if S1i and S2j are
    complementary
  • 0 otherwise
  • Arc Alignment Score Function
  • a1, a2, and ß are positive integers.

21
Problem Formulation
  • The Weighted Largest Common Substructure (WLCS)
    of 2 arc-annotated sequence (S1,P1) and (S2,P2)
    is the maximum weighted alignment between S1 and
    S2 where unpaired bases are aligned to unpaired
    bases and arcs are aligned to arcs.
  • The problem we address is, given (S1,P1) and S2,
    infer the arc annotation P2 of S2 such that their
    WLCS is maximized.

22
Previous Algorithm
EXTEND(DP(i,i))
MERGE(DP(i,ul-1),DP(ul,ur))
ARC-MATCH (DP(i,i))
23
Previous Algorithm (2)
  • All EXTEND operations take O(nm2) time and space
  • All ARC-MATCH operations also take O(nm2) time
    and
  • space
  • The bottleneck of the computation is the
    procedure
  • MERGE, each requiring O(m3) time yielding a
    total of
  • O(nm3) time

24
Sparsification Technique
  • Based on the observation that the entries in the
    rows of table DP is monotonically increasing, we
    do not need to check all possible j in the
    MERGE equation. Instead, we check the positions
    of j where the corresponding DP entries are
    distinct.
  • This way, we can reduce the cost of each MERGE
    operation to
  • O(minul-ur,ul-im2)

25
Recursive Dynamic Programming
  • An arc u is the parent of arc v iff ulltvlltvrltur
    and there is no arc w s.t ulltwlltvlltvrltwrltur
  • Conversely, v is the (one of the) child of u
  • Let core-arc(u) be the child of arc u with the
    biggest arc.
  • Let side-arc(u) be the set of children of u
    excluding its core-arc.
  • Let core-path(u) be the transitive closure of
    core-arc(u).

26
Recursive Dynamic Programming
Left Part
Right Part
Computed
27
Running Time Analysis
  • Since we compute the DP table only for side arcs,
    and since the size of any side arc will not
    exceed ½ size of its parent, the recursion will
    reach at most log n levels.
  • In each level, the total time spent by MERGE is
    at most O(nm2)
  • Thus the total running time is still bounded by
    the MERGE operation which is O(nm2 log n)

28
Space Improvement
  • In order to use standard trace back, we need to
    store all the tables corresponding to an arc in
    P1.
  • This requires O(nm2) storage. For sequence of
    length 3-5K, which is commonly used in lab
    experiments, the storage requirement can reach
    tens of gigabytes.
  • Solution Use the score only version of the
    dynamic programming and recursion to traceback.

29
Hirschberg-like Trace Back Algorithm
1. Find two points p1 and p2 in S1 such that
p2-p1 is at least 1/3n 2. Find the positions in
S2 to which p1 and p2 is aligned 3. Divide the
problem into two subproblems, each having a
fractional size of the original 4. Summing up the
decreasing geometric series, we still have the
same running time as before
30
Conclusion
  • RNA structure prediction is in general has many
    yet to be done.
  • We considered a problem of RNA structure
    inference where we infer the structure of an RNA
    sequence given a similar sequence with known
    structure.
  • Our technique is quite general that it can also
    directly solve the LAPCS problem mentioned
    elsewhere 14
  • We wish to handle pseudoknot in the future by
    applying the algorithm iteratively, following the
    idea of Iterative Loop Matching

31
References
  • 1 R. B. Lyngsø, M. Zuker, and C.N.S. Pedersen.
    Internal loops in RNA secondary structure
    prediction. In ICMB, pages 260267, 1999.
  • 2 T. Akutsu. Dynamic programming algorithms for
    RNA secondary structure with pseudoknots. In
    Disc. Appl. Math, volume 104, pages 4562, 2000.
  • 3 R. M. Dirks and N. A. Pierce. A partition
    function algorithm for nucleic acid secondary
    structure including pseudoknots. In J. Comput.
    Chem., volume 24, pages 16641677, 2003.
  • 4 D. Sanko. Simultaneous solution of the RNA
    folding alignment and protosequence problem. In
    SIAM J. Appl. Math, volume 45, pages 810825,
    1985.
  • 5 D. Mathews and D. Turner. Dynalign an
    algorithm for finding the secondary structure
    common to two RNA sequences. In J. Mol. Biol,
    volume 317, pages 191203, 2002.
  • 6 B. Knudsen and J. Hein. RNA secondary
    structure prediction using stochastic context
    free grammars and evolutionary history. In
    Bioinformatics (6), volume 15, pages 446454,
    1999.
  • 7 L.M. Cai, R. L. Malmberg, and Y. Z. Wu.
    Stochastic modeling of RNA pseudoknotted
    structures a grammatical approach. In
    Bioinformatics (suppl. 3), volume 15, pages
    166173, 2003.

32
References (2)
  • 8 C. Witwer, I. L. Hofacker, and P. F. Stadler.
    Prediction of consensus RNA secondary structures
    including pseudoknots. In to appear in European
    Conference on Computational Biology, 2004.
  • 9 Yongmei-Ji, Xing-Xu, and G. D. Stormo. A
    graph theoretical approach to predict common RNA
    secondary structure motifs including pseudoknots
    in unaligned sequences. In to appear in
    Bioinformatics, 2004.
  • 10 Jianhua Ruan, G. D. Stormo, and Weixiong
    Zhang. An iterated loop matching approach to the
    prediction of RNA secondary structures with
    pseudoknots. In Bioinformatics (1), volume 20,
    pages 5866, 2004.
  • 11 J.H. Chen, S. Y. Le, and J. V. Maizel.
    Prediction of common secondary structures of
    RNAa genetic algorithm approach. In Nuc. Acids
    Res.(4), volume 28, pages 991999, 2000.
  • 12 V. Bafna, S. Muthukrishnan, and R. Ravi.
    Computing similarity between RNA strings. CPM,
    volume 937, pages 116, 1995. Springer-Verlag.
  • 13 K. Zhang. Computing similarity between RNA
    secondary structures. In IEEE International Joint
    Symposia on Intelligence and Systems, pages
    126132. 1998.

33
References (3)
  • 14 T. Jiang, G. H. Lin, B. Ma, and K. Zhang.
    The longest common subsequence problem for
    arc-annotated sequences. In Proceedings of
    the11th Annual Symposium on Combinatorial Pattern
    Matching, volume 1848, pages 154165.
    Springer-Verlag, 2000.
Write a Comment
User Comments (0)
About PowerShow.com