Title: An efficient multiple alignment method for RNA secondary structures including pseudoknots
1An efficient multiple alignment method for RNA
secondary structures including pseudoknots
2nd International Workshop on Natural Computing,
Dec. 10-12, 2007 Noyori Conference Hall, Nagoya
University, Japan
- Shinnosuke Seki 1 Satoshi Kobayashi 2
1 Department of Computer Science, University of
Western Ontario, London, Ontario, Canada, N6A
5B7, sseki_at_csd.uwo.ca 2 Department of Computer
Science, The University of Electro-Communications,
1-5-1 Chofugaoka, Chofu, Tokyo, Japan, 182-8585,
satoshi_at_cs.uec.ac.jp
2Problem setting
- INPUT
- RNA secondary structures (2 or more)
- Sequential info.
- Structural info. (which can be obtained through
database or a prediction algorithm based on the
sequential info.) - OUTPUT
- The alignment of the input RNA secondary
structures as a grammatical model
3Secondary structure alignment
- DNA and RNA sequences fold into themselves so
that they form a 2D (secondary) or 3D (tertiary)
structures. - These highly-dimensional structures play an
important role in determining biological
functions. - Similar structures may have similar functions.
- The structure alignment aims at finding a
similarity between structures as well as between
sequences.
4Cloverleaf structure (tRNA)
- Secondary structure
- 1 multiple loop with 3 hairpin loops
- Tertiary structure
- L-shaped 3D-structure
5Pseudoknotted structure (tmRNA)
- E coli. transfer-messenger RNA
- Hairpin loops
- Bulge loops
- Internal loops
- Multiple loops
- pseudoknots
6NP-hardness of pseudoknotted structure alignment
- The alignment based on the edit distance between
pseudoknotted structures has proven NP-hard. - We focus on a subset of pseudoknotted structures
which can be modeled by a grammar called SLTAGs. - Most of pseudoknots in reality can be modeled by
SLTAGs.
7Chomsky-Schützenberger hierarchy
- Context-free grammars are strong enough to model
pseudoknot-free secondary structures. - Modeling pseudoknotted structures requires
stronger grammars like context-sensitive grammars.
8Simple Linear Tree Adjoining Grammars (SLTAGs)
- A mild context-sensitive grammar (between CF
CS) - Growing a tree by replacing -node by a tree
called the adjoining tree (bolded in left fig.) - Terminal symbols derived at the same time are
considered to form a base-pair. - Descriptive power for pseudoknots (left fig.)
S
A
S
S
S
C
G
S
A
S
S
S
S
U
?
?
U
?
5 A C U G 3
9Simple Linear TAG (SLTAG)
- SLTAG
- A TAG with the property that any tree derived
from it has exact one -node. - Hence, a derivation by SLTAGs can be regarded as
a sequence of symbols for adjoining trees. - Like D A1 A2 A3 A1 A4
- Known descriptive to model sufficient amount of
pseudoknots which exist in reality.
10Challenges in modeling by SLTAGs
- Ambiguity
- Based on a grammar, there may exist multiple
derivations of a word. - When modeling something by a grammar, its
ambiguity must be taken into account! - How to overcome the ambiguity?
- Alignment of derivations by SLTAGs Seki
Kobayashi, 2005 - Multiple pseudoknots modeling
- SLTAGs can model an RNA secondary structure with
1 pseudoknot, not multiple pseudoknots.
11Abstract RNA Structure (ARNAS) model
- A tree structure to model an RNA secondary
structure to represent a relationship among its
components. - Vertices of ARNAS models are
- String (single base chain)
- Tandem (also-called stem, cascade of base-pairs)
- Pseudoknot
12Example 1 ARNAS model for tRNA cloverleaf
ARNAS model
Secondary structure
root
tandem
SC
SC
SC
SC
SC
SC
tandem
tandem
tandem
D-arm
T-arm
SC
SC
SC
A-arm
SC single-base chain
13ARNAS components
- String (can be modeled by regular grammar)
- A single base chain of maximal length
- Sequential information only
- Tandem (can be modeled by context-free grammar)
- A cascade of base-pairs
- Information of sequence, of nested base-pairing,
and of its child components. - Pseudoknot (requires context-sensitive grammar)
- A pseudoknot in a biological sense
- A pseudoknot structure which can be modeled by
SLTAGs - Information of sequence, of crossing
base-pairing, and of its child components.
14Example 2ARNAS model for tmRNA
Secondary structure
ARNAS model
root
SC
tandem
AAAAAAUAGUGAC
GCUUUAGCAG CUGC UAGAGC
pseudoknot
CUUAAUAAC
U
CGAGG GCGGUU CCUCG AGCCGC
G
GG
UAAAA
15Alignment of ARNAS components
- ARNAS components can be modeled by SLTAGs.
- The SLTAG parser Uemura et al., 1999 provides
the set of all derivations of each component to
be aligned. - Based on the dynamic programming, the alignment
algorithm for SLTAG models Seki Kobayashi,
2005 calculates alignments for all combinations
of 2 derivations, and finds the optimal alignment
among them. - The components to be aligned may have sub ARNAS
models as their children. The alignments of these
sub ARNAS models have been calculated previously,
and accommodated in the alignment of these
components.
16Time-complexity of component alignment algorithm
(Table 1)
- The algorithm can employ context-free or regular
grammars as its base-grammar depending on
components to be aligned. - Its time-complexity varies as follows where s1
and s2 are of bases in components to be aligned.
17ARNAS Alignment algorithm
- Based on the tree alignment algorithm Jiang et
al., 1995 whose time complexity is
, where n1 and n2 are of nodes of trees to be
aligned. - Scores to edit nodes of ARNAS models are
alignment scores of corresponding ARNAS
components. - Bottom-up approach
- Given two ARNAS models, the algorithm
- calculates alignments between leaf components
(strings), - calculates alignments between their parent
components based on their alignments, - repeat this process until it reaches the
alignment of root components, which is the
alignment between the ARNAS models.
18The time-complexity of ARNAS alignment algorithm
- Given RNA secondary structures of length n1 and
n2, - Theoretical time complexity is .
- In reality, it is not so intractable because of
- The scarcity of pseudoknots
- Almost all component alignments can be done in
time. - Short-bp property
- A pseudoknot is much shorter than the secondary
structure itself. -
19Multiple alignment algorithm
- Progressive alignment approach
- Given multiple ARNAS models, find the two ARNAS
models with the highest similarity. - The alignment result is also an ARNAS model so
that we can repeat this process until all ARNAS
models given are aligned.
ARNAS((1, (2, 3)), 4)
ARNAS(1, (2, 3))
ARNAS(2, 3)
ARNAS1
ARNAS2
ARNAS3
ARNAS4
20Experimental results (1)
- How many pseudoknotted secondary structures can
be converted into ARNAS models? - INPUT 675 RNA pseudoknotted structures in
comparative RNA (CRW) Website http//www.rna.icmb
.utexas.edu. - 561 of 675 (83.1) can be converted into ARNAS
models. - All but one RNAs of length up to about 2400 can
be converted. - This means that RNA structures hardly contain a
pseudoknot which cannot be modeled by SLTAGs.
21Experimental results (2)
- Short-bp property
- INPUT The 561 RNA secondary structures whose
pseudoknots can be modeled by SLTAGs. - Compare the length of RNA secondary structure
with the length of longest pseudoknots in it. - The least-square method provides the following
theoretical curve, where x is the length of
secondary structure, and P(x) is the length of
longest pseudoknots.
22Short-bp Property
23Experimental results (3)
- An experimental time complexity
- SETTING
- Intel(R) Xeon processors 2.8GHz2 with 2GB memory
- Cf. on this environment, our original algorithm
without ARNAS modification takes about 600 sec.
to align pseudoknots of length around 80
nucleotides. - INPUT 150 of 561 RNAs with structural info.
- RESULT A theoretical curve between x (the length
of RNAs) and T(x) (the alignment time sec.) is
as follows - It can align RNAs of 2400 nucleotides in about 15
secs.
24Running Time
25Future work
- Experiments on the accuracy of ARNAS alignment
algorithm - Comparison with other algorithms for
pseudoknotted RNA alignment