Title: Handling Updates of Pairwise Sequence Alignments
1Handling Updates of Pairwise Sequence Alignments
University of Minnesota
Twin Cities Campus
C. Hong and A. H. Tewfik
Dept. of Electrical and Computer
Engineering University of Minnesota
2Outline
- Introduction
- Reusable Dynamic Programming
- Toy example
- From arc to node sensitivity analysis
- Preliminary results
- Storage and Computational Resource Minimization
- Limitations of reusable Dynamic programming
- Highly conserved segments
- Constraining the search space
- Dealing with single or burst type perturbations
3Repeated Tasks in Bioinformatics
- Retrieval?Processing?Storing
- Typical task
- Gene finding/ predicting protein folding
structure/ gene regulator network/ homology/
micro array gene expression data - Data provenance for managing updates
- Workflow
- Input/output/process/parameters/version
- Intermediate computational output
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
4Pairwise Sequence Alignment
- Definition
- Given two sequences s1 and s2, a seq.
alignment transforms one sequence into the other
to maximize the total alignment value for a
scoring information(?x) - Dynamic programming, O(N2) time space
Fx
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
Alignment algorithm
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
5Daily Updates
- Newly observed sequence
- lengthy
- uncertainty
- keep updating
- Erroneous sequences due to
- Mis-incorporation of bases in polymerase-chain-rea
ction amplification of templates (10-3
Kwiatowski et el. 91) - errors of compression in sequencing ladders
- misreading autoradiograph
- mistyping results
- Mutation of the sequences
6Problem Definition
- Q If there are changes in a sequence, do we need
to repeat all previous matching procedures? - A Reusable Dynamic Programming
7Main Idea hong-tewfik05
- Alice and Bob live in Minnesota
- Trip costgas rate() x distance
- Last week, Alice arrived in Boston
- 203minc(R),c(P),c(B)
- min937058,932585,935070
minneapolis
(R)ochester
58
50
boston
chicago
70
85
43
(P)ittsburgh
25
70
columbus
50
(B)altimore
8Bobs Trip
- Same increment?
- No change in his plan
- Alice trips total cost(203)3
- Some gas stations change their rates?
- 208min936858,933285,934570
- Bob changes her trip plan (Alices trip
information)
minneapolis
minneapolis
(R)ochester
(R)ochester
58
58
50
50
70
boston
73
boston
chicago
68
chicago
70
85
28
85
43
32
(P)ittsburgh
43
25
(P)ittsburgh
25
columbus
columbus
70
70
53
50
(B)altimore
50
(B)altimore
45
9Single Arc Sensitivity SHIER 82
1
N(e13)
N-(e13)
- Single arc tolerance bound
- Cost?(a) ? T
- more to pay to reach a destination
- Q What is the lower and upper bound of arc a13?
- tol. bound(a13) ? T
- max-?(a)a?C-(a13)l(a13)min?(a)a?C(a13),a?
a13 - Relative tolerance bound hong-tewfik05
3
2
4
7
5
6
8
11
9
10
12
?(a56)
-?(a710)
d(a13)
10From Arc Tolerance To Node Tolerance
- Simultaneous perturbations
- Node distance sensitivity analysis
- d(city)sum(length(arc) minimum spanning
treeroot?city) - Store nodes distance from origin on spanning
tree (DP matrix)
163
0
minneapolis
(R)ochester
203
50
58
boston
chicago
70
118
85
50
43
(P)ittsburgh
25
columbus
70
50
93
(B)altimore
143
11Illustration Off-line Analysis
- Viterbi(HMM ? Sx), substitution only allowed
- Non-tree edge e sorted in ascending orders of its
corresponding cost - Node tolerance bound assignment
- propagate tolerance of e back to source
1
2
3
12Illustration On-line Analysis
- Starting from the first perturbation column
- Evaluate perturbations with a set of tolerance
bounds - Continue normal Viterbi decoding, or
- Skip all unperturbed segments
13Experimental Results
14Limitations
- Exponential number of paths
- Large bounds set in each column
- Perturbation not only substitution
- Storage and computational issues
15T1. Highly Conserved Segments
- More resilient to perturbation
- Tolerance bound set KC
- Perturbation a random variable X
- Set of KC increasing P(X1)KC
- Constrain evaluation points forgo highly
conserved segments - fewer tol. evaluations
16T2. Static Searching Space
- Small perturbation where will the new optimal
path be? - In small area around prior optimal
- Constrain searching space
- bidirectional search
- compare SmaxFDBD vs. (w1FDw2BD)
- consider potential nodes only
- reduce DP iteration and tol. evaluation cost
17Delta Propagation
- Assumption a new optimal path is within our
searching space - Real time Delta propagation procedure in works
correctly even if the non-tree edge with minimum
cost is outside of search space
18On The Last Perturbation
- No more update in remaining sequence ? use
Hirschbergs Algorithm - Same procedure also gives excellent performance
with single or single burst type of perturbations
19Experiment With 2 Techniques
- www.ncbi.nlm.nih.gov
- amino acid sequences
- definition/organism/locus vs. definition/organism/
locus - maximum 4 perturbation (substitution/insert/delete
) - length 120 544
- streptomyces griseus/serine proteinase/1SGC vs.
pig/native elastase/3EST - winged bean/leghemoglobin/P27199 vs. homo
sapiens/beta globin/NP_000509 - actinidia chinensis/actinidin precursor/CAA31435
vs. homo sapiens /cathepsin B preproprotein/NP_680
093 - bacteria/serine protease/YP_273777 vs. bacteria
/serine protease/YP_244613 - gallus gallus /alpha 2 globin/NP_001004376 vs.
danio rerio/hemoglobin beta embryonic-3/NP_0010150
58
20Experiment With 2 Techniques
- 5 off-line resource
- Manage all perturbations
- Higher than 20 identity, where identity no. of
matches/alignment length - Uses about 4.518 of flops used by typical
approach, e.g., popular Needleman-Wunsch method
21Conclusion
- Main contributions
- Made Dynamic Programming (directed acyclic
graphs) reusable - Best Performance when dealing with updates
- Greatest performance improvement occurs with
long sequences characterized by high identity
values and relatively few perturbations - Extensions
- Affine gap model
- Dealing with different scoring matrices