Handling Updates of Pairwise Sequence Alignments - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Handling Updates of Pairwise Sequence Alignments

Description:

Handling Updates of Pairwise Sequence Alignments. Twin Cities Campus. University of Minnesota ... d(city)=sum(length(arc)| minimum spanning treerootcity) ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 22
Provided by: hong58
Category:

less

Transcript and Presenter's Notes

Title: Handling Updates of Pairwise Sequence Alignments


1
Handling Updates of Pairwise Sequence Alignments
University of Minnesota
Twin Cities Campus
C. Hong and A. H. Tewfik
Dept. of Electrical and Computer
Engineering University of Minnesota
2
Outline
  • Introduction
  • Reusable Dynamic Programming
  • Toy example
  • From arc to node sensitivity analysis
  • Preliminary results
  • Storage and Computational Resource Minimization
  • Limitations of reusable Dynamic programming
  • Highly conserved segments
  • Constraining the search space
  • Dealing with single or burst type perturbations

3
Repeated Tasks in Bioinformatics
  • Retrieval?Processing?Storing
  • Typical task
  • Gene finding/ predicting protein folding
    structure/ gene regulator network/ homology/
    micro array gene expression data
  • Data provenance for managing updates
  • Workflow
  • Input/output/process/parameters/version
  • Intermediate computational output

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
4
Pairwise Sequence Alignment
  • Definition
  • Given two sequences s1 and s2, a seq.
    alignment transforms one sequence into the other
    to maximize the total alignment value for a
    scoring information(?x)
  • Dynamic programming, O(N2) time space

Fx
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
Alignment algorithm
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
5
Daily Updates
  • Newly observed sequence
  • lengthy
  • uncertainty
  • keep updating
  • Erroneous sequences due to
  • Mis-incorporation of bases in polymerase-chain-rea
    ction amplification of templates (10-3
    Kwiatowski et el. 91)
  • errors of compression in sequencing ladders
  • misreading autoradiograph
  • mistyping results
  • Mutation of the sequences

6
Problem Definition
  • Q If there are changes in a sequence, do we need
    to repeat all previous matching procedures?
  • A Reusable Dynamic Programming

7
Main Idea hong-tewfik05
  • Alice and Bob live in Minnesota
  • Trip costgas rate() x distance
  • Last week, Alice arrived in Boston
  • 203minc(R),c(P),c(B)
  • min937058,932585,935070

minneapolis
(R)ochester
58
50
boston
chicago
70
85
43
(P)ittsburgh
25
70
columbus
50
(B)altimore
8
Bobs Trip
  • Same increment?
  • No change in his plan
  • Alice trips total cost(203)3
  • Some gas stations change their rates?
  • 208min936858,933285,934570
  • Bob changes her trip plan (Alices trip
    information)

minneapolis
minneapolis
(R)ochester
(R)ochester
58
58
50
50
70
boston
73
boston
chicago
68
chicago
70
85
28
85
43
32
(P)ittsburgh
43
25
(P)ittsburgh
25
columbus
columbus
70
70
53
50
(B)altimore
50
(B)altimore
45
9
Single Arc Sensitivity SHIER 82
1
N(e13)
N-(e13)
  • Single arc tolerance bound
  • Cost?(a) ? T
  • more to pay to reach a destination
  • Q What is the lower and upper bound of arc a13?
  • tol. bound(a13) ? T
  • max-?(a)a?C-(a13)l(a13)min?(a)a?C(a13),a?
    a13
  • Relative tolerance bound hong-tewfik05

3
2
4
7
5
6
8
11
9
10
12
?(a56)
-?(a710)
d(a13)
10
From Arc Tolerance To Node Tolerance
  • Simultaneous perturbations
  • Node distance sensitivity analysis
  • d(city)sum(length(arc) minimum spanning
    treeroot?city)
  • Store nodes distance from origin on spanning
    tree (DP matrix)

163
0
minneapolis
(R)ochester
203
50
58
boston
chicago
70
118
85
50
43
(P)ittsburgh
25
columbus
70
50
93
(B)altimore
143
11
Illustration Off-line Analysis
  • Viterbi(HMM ? Sx), substitution only allowed
  • Non-tree edge e sorted in ascending orders of its
    corresponding cost
  • Node tolerance bound assignment
  • propagate tolerance of e back to source

1
2
3
12
Illustration On-line Analysis
  • Starting from the first perturbation column
  • Evaluate perturbations with a set of tolerance
    bounds
  • Continue normal Viterbi decoding, or
  • Skip all unperturbed segments

13
Experimental Results
14
Limitations
  • Exponential number of paths
  • Large bounds set in each column
  • Perturbation not only substitution
  • Storage and computational issues

15
T1. Highly Conserved Segments
  • More resilient to perturbation
  • Tolerance bound set KC
  • Perturbation a random variable X
  • Set of KC increasing P(X1)KC
  • Constrain evaluation points forgo highly
    conserved segments
  • fewer tol. evaluations

16
T2. Static Searching Space
  • Small perturbation where will the new optimal
    path be?
  • In small area around prior optimal
  • Constrain searching space
  • bidirectional search
  • compare SmaxFDBD vs. (w1FDw2BD)
  • consider potential nodes only
  • reduce DP iteration and tol. evaluation cost

17
Delta Propagation
  • Assumption a new optimal path is within our
    searching space
  • Real time Delta propagation procedure in works
    correctly even if the non-tree edge with minimum
    cost is outside of search space

18
On The Last Perturbation
  • No more update in remaining sequence ? use
    Hirschbergs Algorithm
  • Same procedure also gives excellent performance
    with single or single burst type of perturbations

19
Experiment With 2 Techniques
  • www.ncbi.nlm.nih.gov
  • amino acid sequences
  • definition/organism/locus vs. definition/organism/
    locus
  • maximum 4 perturbation (substitution/insert/delete
    )
  • length 120 544
  • streptomyces griseus/serine proteinase/1SGC vs.
    pig/native elastase/3EST
  • winged bean/leghemoglobin/P27199 vs. homo
    sapiens/beta globin/NP_000509
  • actinidia chinensis/actinidin precursor/CAA31435
    vs. homo sapiens /cathepsin B preproprotein/NP_680
    093
  • bacteria/serine protease/YP_273777 vs. bacteria
    /serine protease/YP_244613
  • gallus gallus /alpha 2 globin/NP_001004376 vs.
    danio rerio/hemoglobin beta embryonic-3/NP_0010150
    58

20
Experiment With 2 Techniques
  • 5 off-line resource
  • Manage all perturbations
  • Higher than 20 identity, where identity no. of
    matches/alignment length
  • Uses about 4.518 of flops used by typical
    approach, e.g., popular Needleman-Wunsch method

21
Conclusion
  • Main contributions
  • Made Dynamic Programming (directed acyclic
    graphs) reusable
  • Best Performance when dealing with updates
  • Greatest performance improvement occurs with
    long sequences characterized by high identity
    values and relatively few perturbations
  • Extensions
  • Affine gap model
  • Dealing with different scoring matrices
Write a Comment
User Comments (0)
About PowerShow.com