Title: Whole Genome Alignment
1Whole Genome Alignment
- Arvind Gupta, Ladislav Stacho , Alireza
Khodabakhshi - School of Computing Science , SFU
- Department of Mathematics, SFU
2Overview
- What is sequence alignment ?
- Motivation
- Definition and Examples
- Local vs. Global Alignment
- Applications
- Sequence alignment techniques
- Needlman-Wunsch algorithm (Optimal Alignment)
- Heuristic based techniques (Approximate
Alignment) - Whole Genome Alignment using Anchors
- Method description
- Results
3What is Sequence Alignment ?
- Genetic information changes in time (mutates).
- Three chief ways of mutation
- Insertion and Deletion of amino acids /
nucleotides - Substitution of one nucleotide by another
- Two nucleotides are said to be homologous if they
are descendants of a common ancestral nucleotide,
through copying and substitution. It is the
process of pairing up homologous nucleotides or
amino acids. - Alignment is the process of pairing up homologous
nucleotides or amino acids
4Example
- Sequence1 ACGTACGTACGT
- Sequence2 AGTAACGTCCGT
ACGT-ACGTACGT A-GTAACGTCCGT
Alignment
Nucleotides appearing in the same column are
supposed to be homologues.
5An Informal Definition
- A sequence alignment is a scheme of writing one
sequence on top of another where a residue in one
sequence is either paired with a residue in the
other one or a dash. - There are so many alignments between two
sequences but we are interested in the one that
reveals the closeness of the two sequences.
ACGT-ACGTACGT A-GTAACGTCCGT
ACGTACGTACGT AGTAACGTCCGT
6How to find the alignment
- General idea
- A) Define a score for a particular alignment
of sequences - B) Find highest scoring alignment among all
possible
- Typical scoring
- Defining a score for each pair (Match ,
Mismatch and Indel scores). - Score of the alignment is the sum of the scores
of the pairs.
7Example
- Match score 1
- Mismatch score -2
- Indel score -1
ACGTACGTACGT AGTAACGTCCGT 1 -2-2-2 1 1 1 1-2 1 1
1 0
ACGT-ACGTACGT A-GTAACGTCCGT 1 -1 1 1 -1 1 1 1 1
-2 1 1 1 6
8Global vs. Local alignment
- Sometimes we are interested in regions in the
sequences with high degree of similarity - Local alignment is the alignment of two
substring of the sequences which results in a
high score alignment.
9Applications
- An alignment is useful to
- Decide whether sequences are related. This can
be used for inferring gene function. - Estimate evolutionary distance between sequences.
This estimate can be used as a starting point
for inferring e.g. a phylogenetic tree relating
the sequences. - Find conserved domains, protein structure
prediction.
10Alignment Techniques
- Needleman-Wunsch algorithm (A DP approach)
- Global optimum built by recursion from optimal
alignment of smaller sequences. - Involves the construction of a dynamic
programming matrix, M - Comprised of initialization, recursion and
trace-back phases
11Needleman-Wunsch algorithm
12Needleman-Wunsch algorithm
Optimal score is stored here
13Complexity of Neelman-Wunsch algorithm
- Requires O(m.n) space and time
- There is a variation of this algorithm that needs
O(n) space but takes O(m.n) time. - It is not efficient for aligning long sequences
or for searching a sequence in a database of
sequences
14Heuristic based methods
- Word or K-tuple methods (FASTA, BLAST)
- Statistical methods (Hidden Markov Model)
- Anchor based methods (MUMer, AVID, OURS, MGA).
15Anchor based methods
Consists of three main steps
- Finding maximal matches that are longer than a
certain value say k, in both sequences using an
efficient data structure called Suffix Tree. - Finding the maximum set of consistent matches
(non-crossing matches). - Filling the gaps between the consistent matches.
16Anchor based methods
17Suffix tree
A suffix tree T for an m-character string S is a
rooted directed tree with exactly m leaves having
the following properties
- Each internal node, other than the root, has at
least two children and each edge is labeled with
a nonempty substring of S. - No two edges out of a node can have edge labels
beginning with the same character. - For any leaf i, the concatenation of the edge
labels on the path from the root to the leaf i,
exactly spell out the suffix of S that starts at
position i. - The concatenation of the edge labels on the
path from the root to an internal node represent
a repeated substring in S and the number of
repeats equals to the number of leaves beneath
that node.
18Suffix tree of the string ababc
19Constructing the suffix tree
- Suffix tree of a string can be built in O(n) time
and can be stored in O(n) space, where n is the
length of the strings - The first linear time algorithm was given by
Weiner1973 and two more efficient algorithms
were given by McCreight and Ukkonen later on.
20Finding the maximal matches using suffix tree.
To find the maximal matches of two strings A and
B
- Construct the suffix tree for the string S
AB -, are two characters that are not
occurring anywhere in A and B. - Examine all the internal nodes to find the
matches and to check the maximal property refer
to the original strings.
21Finding the maximal consistent matches
- Construct the adjacency matrix of the bipartide
graph representing the matches. - Add a directed edge from each one in the matrix
to all ones on the right and below it this
results in a DAG. - Find the longest path in the resulting DAG.
22Example
23Complexity and Improvements
- The time complexity of this algorithm is O(e2)
where e is the number of edges in the bipartide
graph or the number of the matches. - But we can do it in O(e.log e) and O(e) space in
the average case. - If the matrix isnt sparse we can do it in O(m.n)
which is proportional to O(e) in this case.
24Filling the gaps
There are two ways to align the sequences between
the matches.
- Recursively call the whole procedure using a
shorter minimum length for the matches. We
usually do this when the gap is long - Use the Needleman-Wuncsh or other approximating
methods. (for short gaps)
25Comparing the algorithms
- MUMmer
- Uses Maximal Unique Matches, so it doesnt cover
large fraction of the sequences using matches. - Doesnt use an efficient algorithm for finding
the set of consistent matches. - AVID
- Uses Maximal Matches, but it doesnt allow
overlap in the set of consistent matches. - Doesnt use an efficient algorithm for finding
the set of consistent matches and it doesnt
necessarily produce the maximum covering set of
matches. - Ours
- Uses Maximal Matches, and it does allow overlap
in the set of consistent matches. - Produces a set of consistent matches that covers
the maximum possible fraction of the sequences. - MGA
- Uses Maximal Matches, but it doesnt allow
overlap in the set of consistent matches. - Uses an efficient algorithm for finding the set
of consistent matches, but it doesnt necessarily
produce the maximum covering set of matches. - Uses a more efficient implementation of suffix
tree structure
26Results. MUMmer vs. ours (ALIGNER)
Cat MUMER ALIGNER Coverage1 56707 58793 Co
verage2 56776 58812 Match Num
1996 2115 Chicken MUMER ALIGNER Coverage1
2793 3412 Coverage2 2796 3410 Match Num
103 132
27Results. MUMmer vs. ours (ALIGNER)
Chimp MUMER ALIGNER Coverage1 791911 823739
Coverage2 792217 823884 Match Num
7834 8388 Cow MUMER ALIGNER Coverage1 52
503 63538 Coverage2 52487 63545 Match Num
1831 2242
28Results. MUMmer vs. ours (ALIGNER)
Dog MUMER ALIGNER Coverage1 49427 51014 C
overage2 49420 51089 Match Num
1748 1820 Pig MUMER ALIGNER Coverage1 10
1592 120991 Coverage2 101556 121084 Match
Num 3564 4293 Total time 3'22" 5'41"
29Results. AVID vs. ours (ALIGNER)
Chicken (Len1 1254651, Len2 1034310)
ALIGNER AVID Score -4755676 -6935493 Tim
e 122 111 Chimp (Len1 1100481 , Len2
990574) ALIGNER AVID Score
4380981 3172357 Time 036 026 Cow
(Len1 2573067 , Len2 3241677)
ALIGNER AVID Score -7258532 -8684808 Ti
me 156 414
30Results. AVID vs. ours (ALIGNER)
Dog (Len1 1530045, Len2 2495979)
ALIGNER AVID Score -2853058
-3359416 Time 657 352 Pig (Len1
3805947, Len2 6120150) ALIGNER AVID
Score -11711531 -1235622 Time
316 858 Celera - EDGP (Len1 2695268, Len2
2626764) ALIGNER AVID Score 14546685
14311088 Time A few minutes!
??????