Title: MAVID: Constrained Ancestral Alignment of Multiple Sequence
1MAVID Constrained Ancestral Alignment of
Multiple Sequence
- Author Nicholas Bray and Lior Pachter
2Outline
- AVID
- MAVID
- Progressive alignment
- Constraints
- Tree Building
- Experimental Results
3AVID A Global Alignment Program
- Fast
- Memory efficient
- Practical for sequence for alignments of large
genomic region - Sensitive in finding homologous regions
- Specific and avoids the false-positive problems
4Algorithm
- Repeat Masking (Optional)
- Finding Matches Using Suffix Trees
- Anchor Selection
- Recursion
5Repeat Masking
Match finding
Recursion
Anchor selection
Enough anchors?
Base pair alignment
Split sequences using anchors
6Repeat Masking (Optional)
- RepeatMasker (http//ftp.genome.washington.edu/RM/
RepeatMasker.html) - Repeat matches
- Clean matches
Clean matches
Repeat matches
7Finding Matches Using Suffix Trees
8Finding Matches Using Suffix Trees
- Maximal repeated substring (Match)
- Every subsequence that contains it is not
repeated in the string - Maximal matches between two sequence
- Pairs of matching subsequences whose flanking
bases are mismatches - Transform
9Maximal repeated substring
Maximal matches between two sequence
Transform
10Anchor Selection
- Eliminate noisy matches (those less than half the
length of the longest match) - The left matches are ordered by
- Long clean -gt short clean -gt long repeat -gt short
repeat
11Anchor Selection
- A variant of Smith-Waterman algorithm (no
overlapping) - Gap score 0
- Mismatch score 8
- Match score
10 bp
12Recursion
13Condition
- There are still significant matches
- The anchor set is gt50 of the length of the
sequence - Recursion
- Otherwise
- Needleman-Wunsch algorithm
- No significant matches
- Short sequence (lt4kb)
- Needleman-Wunsch algorithm
- Long sequence
- Trivial alignment (gap)
14MAVID
- Rapidly aligning multiple large genomic regions
- Incorporating biologically meaningful heuristics
- Sound alignment strategies
15Method
- Core progressive ancestral alignment, which
incorporate preprocessed constraint - Terminology
- Match
- Similar (may not exactly match) region between
two sequences - Constraint
- The order of positions of alignment
16Standard progressive alignment
- Compute the distance matrix by aligning all pairs
of sequences - Build a phylogenetic tree (guide tree) from the
distance matrix - Cluster
- Midpoint method
- Progressively align the sequence according to the
branching order in the guide tree - Aligning two alignments
- An alignment is viewed as a sequence
17Method
18Key difference
- Instead of aligning alignments, we first infer
ancestral sequences of alignments using
maximum-likelihood estimation within a
probabilistic evolutionary model - maximum-likelihood estimation
- a popular statistical method used to make
inferences about parameters of the underlying
probability distribution of a given data set
19Key difference
- The ancestral sequences are then aligned with
AVID - The scores of the Smith-Waterman step are
assigned according to the branch length of the
two alignments - The alignment of the ancestral sequences is then
used to glue two alignments. Gaps in the
ancestral sequences lead to gaps in the multiple
alignment
20Alignment A
Ancestral A
Ancestral B
Alignment B
AVID
21AVID with preprocessed data
- Gene predictions using GENSCAN
- Protein alignments using BLAT
- Finding exon matches without using suffix tree
- In addition, the exon matches can be used shape
the final multiple alignment
22MAVID(Constraints, Tree building, and
Experimental results)
23Constraints(1/3)
- Notation ai bj
- This means that position i in sequence a must
appear before position j in sequence b in the
multiple sequence alignment.
24Constraints(2/3)
ai
a
cy
c
cx
b
bj
If x y, then ai cx cy bj ,and so ai bj
by transitivity.
25Constraints(3/3)
- The above information can be used in the
alignment of the ancestral sequences by requiring
potential anchors between the sequences to
satisfy the constraints.
26Prime Constraints(1/4)
- Consider every triplet of sequences (a, b, c)
with a in u, b in v, and c not in x. - Every triplet can provide potential constraints
for the alignment. - If there are n sequences, there are O(n3) such
triplets.
x
Too many constraints!
u
v
27Prime Constraints(2/4)
- Actually, we dont need to find all possible
constraints, many of which will be redundant. - Instead, we wish to find a set of prime
constraints - In this set, no constraint is implied by the
others. - Such a set can be inferred from the homology map.
28Illustration
29Prime Constraints(3/4)
- If there are m sets of orthologous exons, then at
node x there can be at most O(m) prime
constraints. - The sets of all prime constraints can be found in
O(mk2), where k is the number of leaves below x.
30Prime Constraints(4/4)
- Matches between the ancestral sequences that are
inconsistent with this set of constraints can be
filtered out in time O(N logm), where N is the
total number of matches. - For typical values of m and k, the time taken
computing and utilizing the constraints is
negligible.
31Tree Building(1/3)
- Most multiple alignment programs require pairwise
alignments of all the sequences to build in
initial guide tree. (Quadratic number of sequence
alignments) - We utilize an iterative method to obtain a guide
tree using only linear number of alignments.
32Tree Building(2/3)
- The initial guide tree is selected randomly from
the set of complete binary trees. - The sequences are aligned using this random tree,
and then a phylogenetic tree is inferred from the
resulting multiple alignment. - The above process is iterated until the alignment
and tree are satisfactory.
33Tree Building(3/3)
- Instead of computing all pairwise alignments,
only O(nk) alignments are necessary to perform n
iterations with k sequences. - We found that for typical alignment problems,
only a small number of iterations were necessary.
34Experimental Results 1
- A human, mouse, and rat whole-genome multiple
alignment. - A homology map for the genomes was built by C.
Dewey, and was used to generate gene anchors and
constraints. - Chromosome 20 was chosen because it aligns almost
completely with mouse chromosome 2.
35Experimental Results 1 (cont.)
Coverage of human chromosome 20 RefSeq exons by
the MAVID alignments. Of a total of 3927 exons,
only six were not in the homology map. A total of
53.5 of the exons were covered by precomputed
exon anchors in either mouse or rat. The
remaining exons are mostly aligned by MAVID,
resulting in 93.6 of the exons covered by
alignment in either mouse or rat.
36Experimental Results 2
- Alignment of 21 Organisms
- We aligned 1.8 Mb of human sequence together with
the homologous regions from 20 other organisms of
a total 23 Mb of sequence. - Baboon, cat, chicken, chimp, cow, dog, dunnart,
fugu, hedgehog, horse, lemur, macaque, mouse,
opossum, pig, platypus, rabbit, rat, tetraodon,
and zebra-fish.
37Experimental Results 2(cont.)
- The MAVID alignments were compared with MLAGAN,
version 1.1(Brudno et al. 2003). - MLAGAN is the only other program we know of that
is able to align the 21 sequences in a reasonable
period of time.
38Experimental Results 2(cont.)
- MAVID and MLAGAN both aligned sequences
correctly. - MAVID took 40 min, while MLAGAN took roughly 6h.