MAVID: Constrained Ancestral Alignment of Multiple Sequence - PowerPoint PPT Presentation

About This Presentation

Title:

MAVID: Constrained Ancestral Alignment of Multiple Sequence

Description:

Practical for sequence for alignments of large genomic region ... cat, chicken, chimp, cow, dog, dunnart, fugu, hedgehog, horse, lemur, macaque, ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 39

Provided by: czo9

Category:

more less

Transcript and Presenter's Notes

Title: MAVID: Constrained Ancestral Alignment of Multiple Sequence

1
MAVID Constrained Ancestral Alignment of
Multiple Sequence

Author Nicholas Bray and Lior Pachter

2
Outline

AVID
MAVID
Progressive alignment
Constraints
Tree Building
Experimental Results

3
AVID A Global Alignment Program

Fast
Memory efficient
Practical for sequence for alignments of large
genomic region
Sensitive in finding homologous regions
Specific and avoids the false-positive problems

4
Algorithm

Repeat Masking (Optional)
Finding Matches Using Suffix Trees
Anchor Selection
Recursion

5
Repeat Masking
Match finding
Recursion
Anchor selection
Enough anchors?
Base pair alignment
Split sequences using anchors
6
Repeat Masking (Optional)

RepeatMasker (http//ftp.genome.washington.edu/RM/
RepeatMasker.html)
Repeat matches
Clean matches

Clean matches
Repeat matches
7
Finding Matches Using Suffix Trees
8
Finding Matches Using Suffix Trees

Maximal repeated substring (Match)
Every subsequence that contains it is not
repeated in the string
Maximal matches between two sequence
Pairs of matching subsequences whose flanking
bases are mismatches
Transform

9
Maximal repeated substring
Maximal matches between two sequence
Transform
10
Anchor Selection

Eliminate noisy matches (those less than half the
length of the longest match)
The left matches are ordered by
Long clean -gt short clean -gt long repeat -gt short
repeat

11
Anchor Selection

A variant of Smith-Waterman algorithm (no
overlapping)
Gap score 0
Mismatch score 8
Match score

10 bp
12
Recursion
13
Condition

There are still significant matches
The anchor set is gt50 of the length of the
sequence
Recursion
Otherwise
Needleman-Wunsch algorithm
No significant matches
Short sequence (lt4kb)
Needleman-Wunsch algorithm
Long sequence
Trivial alignment (gap)

14
MAVID

Rapidly aligning multiple large genomic regions
Incorporating biologically meaningful heuristics
Sound alignment strategies

15
Method

Core progressive ancestral alignment, which
incorporate preprocessed constraint
Terminology
Match
Similar (may not exactly match) region between
two sequences
Constraint
The order of positions of alignment

16
Standard progressive alignment

Compute the distance matrix by aligning all pairs
of sequences
Build a phylogenetic tree (guide tree) from the
distance matrix
Cluster
Midpoint method
Progressively align the sequence according to the
branching order in the guide tree
Aligning two alignments
An alignment is viewed as a sequence

17
Method
18
Key difference

Instead of aligning alignments, we first infer
ancestral sequences of alignments using
maximum-likelihood estimation within a
probabilistic evolutionary model
maximum-likelihood estimation
a popular statistical method used to make
inferences about parameters of the underlying
probability distribution of a given data set

19
Key difference

The ancestral sequences are then aligned with
AVID
The scores of the Smith-Waterman step are
assigned according to the branch length of the
two alignments
The alignment of the ancestral sequences is then
used to glue two alignments. Gaps in the
ancestral sequences lead to gaps in the multiple
alignment

20
Alignment A
Ancestral A
Ancestral B
Alignment B
AVID
21
AVID with preprocessed data

Gene predictions using GENSCAN
Protein alignments using BLAT
Finding exon matches without using suffix tree
In addition, the exon matches can be used shape
the final multiple alignment

22
MAVID(Constraints, Tree building, and
Experimental results)

Speaker ???
2005/12/07

23
Constraints(1/3)

Notation ai bj
This means that position i in sequence a must
appear before position j in sequence b in the
multiple sequence alignment.

24
Constraints(2/3)
ai
a
cy
c
cx
b
bj
If x y, then ai cx cy bj ,and so ai bj
by transitivity.
25
Constraints(3/3)

The above information can be used in the
alignment of the ancestral sequences by requiring
potential anchors between the sequences to
satisfy the constraints.

26
Prime Constraints(1/4)

Consider every triplet of sequences (a, b, c)
with a in u, b in v, and c not in x.
Every triplet can provide potential constraints
for the alignment.
If there are n sequences, there are O(n3) such
triplets.

x
Too many constraints!
u
v
27
Prime Constraints(2/4)

Actually, we dont need to find all possible
constraints, many of which will be redundant.
Instead, we wish to find a set of prime
constraints
In this set, no constraint is implied by the
others.
Such a set can be inferred from the homology map.

28
Illustration
29
Prime Constraints(3/4)

If there are m sets of orthologous exons, then at
node x there can be at most O(m) prime
constraints.
The sets of all prime constraints can be found in
O(mk2), where k is the number of leaves below x.

30
Prime Constraints(4/4)

Matches between the ancestral sequences that are
inconsistent with this set of constraints can be
filtered out in time O(N logm), where N is the
total number of matches.
For typical values of m and k, the time taken
computing and utilizing the constraints is
negligible.

31
Tree Building(1/3)

Most multiple alignment programs require pairwise
alignments of all the sequences to build in
initial guide tree. (Quadratic number of sequence
alignments)
We utilize an iterative method to obtain a guide
tree using only linear number of alignments.

32
Tree Building(2/3)

The initial guide tree is selected randomly from
the set of complete binary trees.
The sequences are aligned using this random tree,
and then a phylogenetic tree is inferred from the
resulting multiple alignment.
The above process is iterated until the alignment
and tree are satisfactory.

33
Tree Building(3/3)

Instead of computing all pairwise alignments,
only O(nk) alignments are necessary to perform n
iterations with k sequences.
We found that for typical alignment problems,
only a small number of iterations were necessary.

34
Experimental Results 1

A human, mouse, and rat whole-genome multiple
alignment.
A homology map for the genomes was built by C.
Dewey, and was used to generate gene anchors and
constraints.
Chromosome 20 was chosen because it aligns almost
completely with mouse chromosome 2.

35
Experimental Results 1 (cont.)
Coverage of human chromosome 20 RefSeq exons by
the MAVID alignments. Of a total of 3927 exons,
only six were not in the homology map. A total of
53.5 of the exons were covered by precomputed
exon anchors in either mouse or rat. The
remaining exons are mostly aligned by MAVID,
resulting in 93.6 of the exons covered by
alignment in either mouse or rat.
36
Experimental Results 2

Alignment of 21 Organisms
We aligned 1.8 Mb of human sequence together with
the homologous regions from 20 other organisms of
a total 23 Mb of sequence.
Baboon, cat, chicken, chimp, cow, dog, dunnart,
fugu, hedgehog, horse, lemur, macaque, mouse,
opossum, pig, platypus, rabbit, rat, tetraodon,
and zebra-fish.

37
Experimental Results 2(cont.)

The MAVID alignments were compared with MLAGAN,
version 1.1(Brudno et al. 2003).
MLAGAN is the only other program we know of that
is able to align the 21 sequences in a reasonable
period of time.

38
Experimental Results 2(cont.)