Whole Genome Alignment - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Whole Genome Alignment

Description:

Two nucleotides are said to be homologous if they are descendants of a common ... in one sequence is either paired with a residue in the other one or a dash. ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 31

Provided by: Alireza7

Category:

more less

Transcript and Presenter's Notes

Title: Whole Genome Alignment

1
Whole Genome Alignment

Arvind Gupta, Ladislav Stacho , Alireza
Khodabakhshi
School of Computing Science , SFU
Department of Mathematics, SFU

2
Overview

What is sequence alignment ?
Motivation
Definition and Examples
Local vs. Global Alignment
Applications
Sequence alignment techniques
Needlman-Wunsch algorithm (Optimal Alignment)
Heuristic based techniques (Approximate
Alignment)
Whole Genome Alignment using Anchors
Method description
Results

3
What is Sequence Alignment ?

Genetic information changes in time (mutates).
Three chief ways of mutation
Insertion and Deletion of amino acids /
nucleotides
Substitution of one nucleotide by another
Two nucleotides are said to be homologous if they
are descendants of a common ancestral nucleotide,
through copying and substitution. It is the
process of pairing up homologous nucleotides or
amino acids.
Alignment is the process of pairing up homologous
nucleotides or amino acids

4
Example

Sequence1 ACGTACGTACGT
Sequence2 AGTAACGTCCGT

ACGT-ACGTACGT A-GTAACGTCCGT
Alignment
Nucleotides appearing in the same column are
supposed to be homologues.
5
An Informal Definition

A sequence alignment is a scheme of writing one
sequence on top of another where a residue in one
sequence is either paired with a residue in the
other one or a dash.
There are so many alignments between two
sequences but we are interested in the one that
reveals the closeness of the two sequences.

ACGT-ACGTACGT A-GTAACGTCCGT
ACGTACGTACGT AGTAACGTCCGT
6
How to find the alignment

General idea
A) Define a score for a particular alignment
of sequences
B) Find highest scoring alignment among all
possible

Typical scoring
Defining a score for each pair (Match ,
Mismatch and Indel scores).
Score of the alignment is the sum of the scores
of the pairs.

7
Example

Match score 1
Mismatch score -2
Indel score -1

ACGTACGTACGT AGTAACGTCCGT 1 -2-2-2 1 1 1 1-2 1 1
1 0
ACGT-ACGTACGT A-GTAACGTCCGT 1 -1 1 1 -1 1 1 1 1
-2 1 1 1 6
8
Global vs. Local alignment

Sometimes we are interested in regions in the
sequences with high degree of similarity
Local alignment is the alignment of two
substring of the sequences which results in a
high score alignment.

9
Applications

An alignment is useful to
Decide whether sequences are related. This can
be used for inferring gene function.
Estimate evolutionary distance between sequences.
This estimate can be used as a starting point
for inferring e.g. a phylogenetic tree relating
the sequences.
Find conserved domains, protein structure
prediction.

10
Alignment Techniques

Needleman-Wunsch algorithm (A DP approach)
Global optimum built by recursion from optimal
alignment of smaller sequences.
Involves the construction of a dynamic
programming matrix, M
Comprised of initialization, recursion and
trace-back phases

11
Needleman-Wunsch algorithm
12
Needleman-Wunsch algorithm
Optimal score is stored here
13
Complexity of Neelman-Wunsch algorithm

Requires O(m.n) space and time
There is a variation of this algorithm that needs
O(n) space but takes O(m.n) time.
It is not efficient for aligning long sequences
or for searching a sequence in a database of
sequences

14
Heuristic based methods

Word or K-tuple methods (FASTA, BLAST)
Statistical methods (Hidden Markov Model)
Anchor based methods (MUMer, AVID, OURS, MGA).

15
Anchor based methods
Consists of three main steps

Finding maximal matches that are longer than a
certain value say k, in both sequences using an
efficient data structure called Suffix Tree.
Finding the maximum set of consistent matches
(non-crossing matches).
Filling the gaps between the consistent matches.

16
Anchor based methods
17
Suffix tree
A suffix tree T for an m-character string S is a
rooted directed tree with exactly m leaves having
the following properties

Each internal node, other than the root, has at
least two children and each edge is labeled with
a nonempty substring of S.
No two edges out of a node can have edge labels
beginning with the same character.
For any leaf i, the concatenation of the edge
labels on the path from the root to the leaf i,
exactly spell out the suffix of S that starts at
position i.
The concatenation of the edge labels on the
path from the root to an internal node represent
a repeated substring in S and the number of
repeats equals to the number of leaves beneath
that node.

18
Suffix tree of the string ababc
19
Constructing the suffix tree

Suffix tree of a string can be built in O(n) time
and can be stored in O(n) space, where n is the
length of the strings
The first linear time algorithm was given by
Weiner1973 and two more efficient algorithms
were given by McCreight and Ukkonen later on.

20
Finding the maximal matches using suffix tree.
To find the maximal matches of two strings A and
B

Construct the suffix tree for the string S
AB -, are two characters that are not
occurring anywhere in A and B.
Examine all the internal nodes to find the
matches and to check the maximal property refer
to the original strings.

21
Finding the maximal consistent matches

Construct the adjacency matrix of the bipartide
graph representing the matches.
Add a directed edge from each one in the matrix
to all ones on the right and below it this
results in a DAG.
Find the longest path in the resulting DAG.

22
Example
23
Complexity and Improvements

The time complexity of this algorithm is O(e2)
where e is the number of edges in the bipartide
graph or the number of the matches.
But we can do it in O(e.log e) and O(e) space in
the average case.
If the matrix isnt sparse we can do it in O(m.n)
which is proportional to O(e) in this case.

24
Filling the gaps
There are two ways to align the sequences between
the matches.

Recursively call the whole procedure using a
shorter minimum length for the matches. We
usually do this when the gap is long
Use the Needleman-Wuncsh or other approximating
methods. (for short gaps)

25
Comparing the algorithms

MUMmer
Uses Maximal Unique Matches, so it doesnt cover
large fraction of the sequences using matches.
Doesnt use an efficient algorithm for finding
the set of consistent matches.
AVID
Uses Maximal Matches, but it doesnt allow
overlap in the set of consistent matches.
Doesnt use an efficient algorithm for finding
the set of consistent matches and it doesnt
necessarily produce the maximum covering set of
matches.
Ours
Uses Maximal Matches, and it does allow overlap
in the set of consistent matches.
Produces a set of consistent matches that covers
the maximum possible fraction of the sequences.
MGA
Uses Maximal Matches, but it doesnt allow
overlap in the set of consistent matches.
Uses an efficient algorithm for finding the set
of consistent matches, but it doesnt necessarily
produce the maximum covering set of matches.
Uses a more efficient implementation of suffix
tree structure

26
Results. MUMmer vs. ours (ALIGNER)
Cat MUMER ALIGNER Coverage1 56707 58793 Co
verage2 56776 58812 Match Num
1996 2115 Chicken MUMER ALIGNER Coverage1
2793 3412 Coverage2 2796 3410 Match Num
103 132
27
Results. MUMmer vs. ours (ALIGNER)
Chimp MUMER ALIGNER Coverage1 791911 823739
Coverage2 792217 823884 Match Num
7834 8388 Cow MUMER ALIGNER Coverage1 52
503 63538 Coverage2 52487 63545 Match Num
1831 2242
28
Results. MUMmer vs. ours (ALIGNER)
Dog MUMER ALIGNER Coverage1 49427 51014 C
overage2 49420 51089 Match Num
1748 1820 Pig MUMER ALIGNER Coverage1 10
1592 120991 Coverage2 101556 121084 Match
Num 3564 4293 Total time 3'22" 5'41"
29
Results. AVID vs. ours (ALIGNER)
Chicken (Len1 1254651, Len2 1034310)
ALIGNER AVID Score -4755676 -6935493 Tim
e 122 111 Chimp (Len1 1100481 , Len2
990574) ALIGNER AVID Score
4380981 3172357 Time 036 026 Cow
(Len1 2573067 , Len2 3241677)
ALIGNER AVID Score -7258532 -8684808 Ti
me 156 414
30
Results. AVID vs. ours (ALIGNER)
Dog (Len1 1530045, Len2 2495979)
ALIGNER AVID Score -2853058
-3359416 Time 657 352 Pig (Len1
3805947, Len2 6120150) ALIGNER AVID
Score -11711531 -1235622 Time
316 858 Celera - EDGP (Len1 2695268, Len2
2626764) ALIGNER AVID Score 14546685
14311088 Time A few minutes!
??????

Write a Comment

User Comments (0)