ncRNA detection w multiple alignments - PowerPoint PPT Presentation

About This Presentation
Title:

ncRNA detection w multiple alignments

Description:

Applying RNAz to conserved regions results in a discovery of 30k putative RNA. ... COVE (covariance model) (Eddy and Durbin 1994) ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 52
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: ncRNA detection w multiple alignments


1
ncRNA detection w/ multiple alignments
2
Comparative detection of ncRNA
  • Given a pairwise alignment, QRNA decides if it is
    RNA, coding or Other
  • The key to detecting RNA is covarying mutations.
  • Multiple alignment should provide more
    information on covarying mutations.

3
RNAz
  • Computes the probability of ncRNA in a multiple
    alignment.
  • RNAz computes two novel statistics
  • Min. Free Energy of sequences (MFE)
  • Conserved secondary structure (SCI)
  • Train an SVM using the following features
  • MFE
  • SCI
  • Mean pairwise identity
  • Number of sequences in the input

4
SCI
  • Apply min. energy folding to a multiple
    alignment.
  • The score of a pair of column is dependent upon
    base-pairing as well as compensatory mutations.
  • Let EA denote the consensus fold energy.
  • Let E denote the average MFE of all sequences
  • SCI EA / E
  • Claim Low SCI is bad, high is good
  • Q What is the SCI for diverged (random)
    sequences?
  • What is the SCI for identical sequences?

5
MFE
  • Compute a z-score for a sequence with MFEm
  • Z (m-?)/?
  • Instead of computing ?,? by shuffling, and
    computing (slow)
  • Use regression to predict ?,? from sequence
    length and base composition.

6
Non-linear classification
  • The z-statistic and SCI capture different
    properties.
  • Green is good (native), red is bad (shuffed).
  • Is SCI a good statistic, given different levels
    of sequence identity?

7
Using RNAz to predict ncRNA
  • Applying RNAz to conserved regions results in a
    discovery of 30k putative RNA.
  • Is this list complete? Is it valid?

8
Structural Alignment
  • X07545 ..ACCCGGC.CAUA...GUGGCCG.GGCAA.CAC.
    CCGG.U.C..UCGUUM21086 ..ACCCGGC.CAUA...GCGGCCG
    .GGCAA.CAC.CCGG.A.C..UCAUGX05870
    ..ACCCGGC.CACA...GUGAGCG.GGCAA.CAC.CCGG.A.C..UCAUU
    U05019 ..ACCCGGU.CAUA...GUGAGCG.GGUAA.CAC.CCGG
    .A.C..UCGUUM16530 ..ACCCGGC.AAUA...GGCGCCGGUGC
    UA.CGC.CCGG.U.C..UCUUCX01588
    ..ACCCGGU.CACA...GUGAGCG.GGCAA.CAC.CCGG.A.C..UCAUU
    AF034619 ...GGCGGC.CACA...GCGGUGG.GGUUGCCUC.CCGU
    .A.C..CCAUCL27170 AGUGGUGGC.CAUA...UCGGCGG.GGU
    UC.CUCCCCGU.A.C..CCAUC
  • X05532 AGGAACGGC.CAUA...CCACGUC.GAUCG.CAC.CA
    CA.U.C..CCGUC
  • GC ltltltltltltltltlt........ltlt.ltltltlt.lt...lt.lt...ltlt
    ltlt.lt.lt.......

Conserved sequences, and conserved structure are
more apparent in multiple alignments.
9
RNA multiple alignments
  • Detection of RNA depends upon reliable prediction
    of covarying mutations, as well as regions of
    conserved sequence
  • Precomputing multiple alignments based on
    sequence considerations is probably not
    sufficient (should be tested).
  • How can structural alignments be computed?

10
Computing Structural Alignments
G U G G C C G G C G G C C G G U G A G C G G U G A
G C G G C G C C G G U G A G C G G C G G U G G U C
G G C G G C C A C G U C
Pr(G1) 0.8
3
2
1
4
1
3
2
  • Analogy In sequence alignment, the score for
    aligning a column is position independent.
  • In profiles, or HMMs, position specific scoring
    is used to distinguish conserved positions from
    non-conserved positions
  • Similar ideas can be used for RNA.

11
Covariance modelsRNA profiles
S W1 a W2 W3 b a W4 b
Terminal symbols correspond to columns
A A A A U
U U U - A
A A A U -
- - - A U
12
Aligning a sequence to a covariance model
  • We align each node of the covariance model (it is
    tree like, but may be a graph).
  • The alignment score follows the same recurrence
    as in Lecture 7, but with position specific
    probabilities.
  • Example
  • AWi,(i,j) -log (PrWi-gtsi Wj sj
    )AWj,(i1,j-1)
  • If we wish to compute the probability that a
    sequence belongs to a family, we compute the
    total likelihood (sum over all probabilities)
  • If we wish to compute the structure of an unknown
    sequence by comparison to a covariance model, we
    compute the max likelihood parse in this graph.

13
Covariance models and ncRNA discovery
  • Given a family of ncRNA sequences, scan a genomic
    sequence with a covariance model and retrieve all
    high scoring sub-sequences.
  • This is the most common method, but it is
    expensive.
  • Assume covariance model has m states, and the
    substring has at most n symbols, and the database
    has L symbols.
  • Alignment cost O(n2m1n3m2)
  • Total time ?

14
Computing covariance models
  • If we are given a CM, a multiple structural
    alignment is easy.
  • In turn, align each sequence to the CM.
  • If we are given a multiple alignment, computing
    the covariance model is easy
  • For simultaneous prediction, a Bayesian iterative
    approach is used
  • Compute a seed alignment
  • Use the alignment to compute a CM
  • Use the CM to compute a new alignment
  • Iterate

15
Open
  • Compute a structural multiple alignment.
  • Existing methods do not work well without good
    seed alignment, and require excessive hand
    curation.
  • Here, we solve a simpler problem
  • Predict conserved structure in unaligned
    sequences.

16
Motivation to a new approach
  • Base-pairs appear in clusters we call them
    stacks, which is energetically favorable.
  • Most of the stability of the RNA secondary
    structure is determined by stacks.

17
Statistics of the stacks in Rfam database
  • Most base-pairs are stacked up

18
Using stacks as anchors for predictions
  • The idea of anchors as constraints has been used
    in multiple genomic sequence alignment.
  • MAVID (Bray and Pachter, 2004)
  • TBA (Blanchette et al., 2004)
  • Several heuristic methods have been developed by
    finding anchored stacks
  • Waterman (1989) used a statistical approach to
    choose conserved stacks within fixed-size
    windows.
  • Ji and Stormo (2004) and Perriquet et al. (2003)
    use primary sequence conservation of the stacks
    and the length of loop regions to reduce the
    searching space.
  • stack anchor has low sequence similarity.
  • Its hard to find correct anchors

19
Problem
  • Selecting one stack at a time may cause wrong
    matching stacks.

20
A global approach configuration of stacks
  • RNA secondary structure can be viewed as stacks
    plus unpaired loops. (no individual base-pairs)
  • The energy of the structure is the sum of the
    energies of stacks and loops.
  • Stack configuration
  • Nested stacks
  • Parallel stacks
  • Crossing stacks (pseudo knots)
  • More generalized stacks can include mismatches in
    the stacks.

21
RNA Stack-based Consensus Folding (RNAscf) problem
  • Find conserved stack configurations for a set of
    unaligned RNA sequence.
  • Optimize both stability (free energy) of the
    structure and sequence similarity computed based
    on these common stacks as anchors.

22
RNA stack-based consensus folding for pairwise
sequences
23
A matching stack-configurations on two sequences
24
RNA Stack-based Consensus Folding for multiple
sequences
25
Cost function for multiple sequences
26
Compute an optimal stack configuration for two
sequences
  • Dynamic programming algorithm is used to
    align RNA sequences and find an optimal
    configuration at the same time.
  • The algorithm is similar to prior work (Sankoff
    1985, Bafna et al. 1995)
  • Differences
  • We use stacks as the basic structural elements.
  • Prior work used individual base pairs.
  • The computational time is O(n4) (n is the number
    of stacks).
  • Sankoffs algorithm is O(m6), (m is the length of
    the sequences).
  • The number of possible stacks (size gt 4) is
    much smaller than the length of the sequence.
  • Its much faster.

27
For any pair of stacks, there are three choices
28
The score of matching stacks
PA
PB
29
The score of matching hairpin loops
30
The score of matching interior loops or bulges
Loop(PX,PA)
PA
PX
PY
PB
Loop(PY,PA)
31
The score of matching two multi-loops
Loop(Pi,PA)
PA
PiA
P1A
PjB
P1B
PB
Loop(Pi,PB)
32
Consensus folding for multiple sequences
  • We use a heuristic method based on the notion
    of star-alignment.
  • Compute an optimal configuration from a random
    seed pair.
  • Align all individual sequences to this
    configuration.
  • Choose the conserved stack configuration in all
    sequences.
  • Allow some stacks to be partially conserved (at
    least appear in a certain fraction of the
    sequences).

33
Compute the stack configuration for multiple
sequences RNAscf(k,h,f)
34
Iterative procedure for RNAscf
  • P RNAscf(k, h, f).
  • In each sequence, extract the unpaired regions
    according to the loop regions in P.
  • Predict additional putative stacks that are not
    crossing with P using smaller k and h.
  • Recompute the alignment for with additional
    putative stacks using RNAscf(k,h,f).

35
Test dataset
  • We choose a set of 12 RNA families from Rfam
    database
  • 20 sequences chosen from the families. (except
    for CRE and glms, we choose 10 sequences) with
    annotated structures.
  • There are 953 stacks.
  • We compare RNAscf with 3 other programs that are
    available online for RNA folding
  • RNAfold (energy based minimization) (Hofacker
    2003)
  • COVE (covariance model) (Eddy and Durbin 1994)
  • Cove need a staring seed alignment which is
    produced by ClustalW.
  • comRNA (computing anchors in multiple sequences)
    (Ji, Xu and Stormo 2004).
  • Sensitivity the fraction of true stacks that
    overlapped with predicted stacks.
  • Accuracy the fraction of predicted stacks that
    overlapped with true stacks

36
Test results
37
Test results
38
Test results
39
Performance improves when the number of sequences
increases
(Using Thiamine riboswitch subfamily (RF00059))
40
RNAscf always finds the right consensus stack
configuration.
(Sam riboswitch (RF00162))
41
Conclusion and future work
  • RNAscf is a valid approach to RNA consensus
    structure prediction.
  • Use stack configuration to represent RNA
    secondary structure.
  • Propose a dynamic programming algorithm to find
    optimal stack configuration for pairwise
    sequences.
  • Use both primary sequence information and energy
    information.
  • Use a star-alignment-like heuristic method to get
    the consensus structure for multiple sequences.

42
Conclusion
  • There is a signal due to to covarying mutations
    that is a good predictor of RNA structure.
  • Can RNAscf scores be used as a statistic to
    discover ncRNA in unaligned sequences?
  • How good are sequence based alignments? Do they
    preserve structure?
  • Not for diverged families
  • Possibly for orthologous regions

43
ncRNA discovery for specific families
44
Case study miRNA
  • dsRNA, and siRNA can be used to silence genes in
    mammalian tissue culture.
  • miRNA is a new member of this class of endogenous
    interfering RNA
  • RNA interference (RNAi) is a pwerful new
    technique to study gene function.

45
Case Study miRNA
  • ncRNA 22 nt in length
  • Pairs to sites within the 3 UTR, specifying
    translational repression.
  • Similar to siRNA (involved in RNAi)
  • Unlike siRNA, miRNA do not need perfect base
    complementarity
  • No computational techniques to predict miRNA
  • Most predictions based on cloning small RNAs from
    size fractionated samples

46
miRNA (vs. siRNA)
  • Derived from transcripts that form local hairpin
    structures.
  • Sequences of the precursor, and processed miRNA
    is evolutionarily conserved
  • Usually distinct, and distant, from other genes
  • siRNA (by contrast)
  • Not evolutionarily conserved
  • Correspond to sequences of known or predicted
    mRNAs, transposons, or regions of heterochromatic
    DNA.

47
MiRscan
  • Predicts miRNA
  • Start with evolutionarily conserved region. Ex
    C. elegans and C. briggsae
  • 36000 hairpins were found (including 50/53 known
    miRNA).
  • 50 known miRNA were used to train and score the
    36000 hairpins

48
Computational identification of miRNA
  • 7 features are scored
  • miRNA base-pairing
  • Base-pairing of the rest of the fold-back
  • Stringent sequence conservation in the 5 end of
    fold back
  • Sequence conservation in the 3 end of fold back
  • Sequence bias in the first 5 bases of miRNA
  • Tendency to form symmetric internal loops
  • Presence of 2-9 consensus base-pairs between
    miRNA and terminal loop region
  • Red Conserved with C. briggsae
  • Blue varying residues that maintain their
    predicted paired or unpaired states

49
MiRscan scoring
  • 35 previously unannotated hairpins exceeded the
    Median score

50
Molecular identification of miRNA
  • Initial cloning and sequencing identified 300
    clones representing 54 unique miRNA
  • 10 fold scale up of the procedure identified 3423
    clones as miRNA. These contain 77 distinct miRNA
    genes
  • 77-5423 novel miRNAs found
  • 20 were scored by MiRscan (yellow). 10 were among
    the top 35

51
MiRscan results
  • 35 Predictions
  • 10 identified with a high throughput screen
    (sequencing of 3423 clones)
  • 6 identified using a PCR assay.
  • 4 identified as false positives PCR hybridized to
    larger ncRNAs
  • 15 unknown
  • Evolutionary conservation is important for ncRNA
    detection
  • gt97 of all miRNA had significant conservation
    between C. briggsae, and C. elegans
Write a Comment
User Comments (0)
About PowerShow.com