Title: Picking Alignments from Steiner Trees
1Picking Alignments from (Steiner) Trees
Lior Pachter
Fumei Lam
Marina Alexandersson
2Alignment
ATCG--G A-CGTCA
biologically meaningful
Steiner Networks
Pair Hidden Markov Models
fast alignments based on HMM structure
3Some basic definitions Let G be a graph and S ?
V(G). A k-spanner for S is a subgraph G ? G
such that for any u,v ? S the length of the
shortest path between u,v in G is at most k
times the distance between u and v in G. Let
V(G)R2 and E(G)horizontal and vertical line
segments. A Manhattan network is a 1-spanner for
a set S of points in R2. Vertices in the
Manhattan network that are not in S are called
Steiner points
4Example
S red points
5Gudmundsson-Levcopoulos-Narasimhan 2001 Find
the shortest Manhattan network connecting the
points
4-approximation in O(n3) and 8-approximation in
O(nlogn)
6Gudmundsson-Levcopoulos-Narasimhan 2001 proof
outline 1. it suffices to work on the Hanan grid
7Gudmundsson-Levcopoulos-Narasimhan 2001 proof
outline 2. Construct local slides (for all four
orientations)
slide
A(v) uv is the topmost node below and
to the left of u
v
8Gudmundsson-Levcopoulos-Narasimhan 2001 proof
outline 3. Solve each slide
The minimum slide arborescense problem
Lingas-Pinter-Rivest-Shamir 1982
O(n3) optimal solution using dynamic programming
9Gudmundsson-Levcopoulos-Narasimhan 2001 proof
outline 4. Proof of correctness
b
v
a
u
10What is an alignment?
ATCG--GACATTACC-AC AC-GTCA-GATTA-CAAC
11Pair HMMs
Simple sequence-alignment PHMM
M (mis)match X insert seq1 Y insert seq2
12Pair HMMs
transition probabilities
Hidden sequence
M
M
X
Y
M
Y
M
output probabilities
13Using the Pair HMM
In practice, we have observed sequence
ATCGG ACGTCA
for which we wish to infer the underlying hidden
states
One solution among all possible sequences of
hidden states, determine the most likely (Viterbi
algorithm).
14Viterbi in PHMM
Needleman Wunsch
Match prob pm Mismatch prob pr
Gap prob pg
Match score log(pm) Mismatch score log(pr) Gap
score log(pg)
15Want to take into account that the sequences are
genomic sequences
Example a pair of syntenic genomic regions
16PHMM
Y
X
17- A property of single sequence states is
- that all paths in the Viterbi graph between
- two vertices have the same weight
18Strategy for Alignment
G
A
T
G
GATTACATTGATCAGACAGGTGAAGA
19The CD4 region
50000
mouse
0
human
50000
0
205
3
Splice site GGTGAG
Splice site CAG
Stop codon TAG/TGA/TAA
Branchpoint CTGAC
Translation Initiation ATG
21Suggests a new Steiner problem Find the shortest
1-spanner connecting reds to blues
22Generalizes the Manhattan network problem (all
points red and blue) Generalizes the Rectilinear
Steiner Arborescence problem
23History of the Rectilinear Steiner Arborescence
Problem
1985, Trubin - polynomial time algorithm
1992, Rao-Sadayappan-Hwang-Shor - error in Trubin
2000, Shi and Su - NP complete!
24Results for unlabeled problem
- An O(n3) 2-approximation algorithm (implemented)
- An O(nlogn) 4-approximation algorithm
- Testing on CD4 region in human/mouse
- Implementation ( SLIM )
- http//bio.math.berkeley.edu/slim/
- SLIM for SLAM (in progress)
- http//bio.math.berkeley.edu/slam/
25(No Transcript)
26The Viterbi graph for a more complicated
alignment PHMM
27Comparison and Analysis of Performance
- Our method has two main steps (Llength of seqs,
nHSP) - Building the network O(n3) or O(nlogn)
- Running the Viterbi algorithm O(nL) worst
case - for the HMM on the network
- Banding algorithms are O(L2) worst case for
step 2. - Chaining algorithms are O(n2) in the case where
gap - penalties can depend on the sequences.
- These strategies do not generalize well for more
- sophisticated HMMs.
28Summary
Software
SLIM (network build) http//bio.math.berkeley.e
du/slim/ SLAM (alignment)
http//bio.math.berkeley.ed/slam/
Thanks Nick Bray and Simon Cawley