Whole Genome Alignment - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Whole Genome Alignment

Description:

Two nucleotides are said to be homologous if they are descendants of a common ... in one sequence is either paired with a residue in the other one or a dash. ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 31
Provided by: Alireza7
Category:
Tags: alignment | dash | genome | whole

less

Transcript and Presenter's Notes

Title: Whole Genome Alignment


1
Whole Genome Alignment
  • Arvind Gupta, Ladislav Stacho , Alireza
    Khodabakhshi
  • School of Computing Science , SFU
  • Department of Mathematics, SFU

2
Overview
  • What is sequence alignment ?
  • Motivation
  • Definition and Examples
  • Local vs. Global Alignment
  • Applications
  • Sequence alignment techniques
  • Needlman-Wunsch algorithm (Optimal Alignment)
  • Heuristic based techniques (Approximate
    Alignment)
  • Whole Genome Alignment using Anchors
  • Method description
  • Results

3
What is Sequence Alignment ?
  • Genetic information changes in time (mutates).
  • Three chief ways of mutation
  • Insertion and Deletion of amino acids /
    nucleotides
  • Substitution of one nucleotide by another
  • Two nucleotides are said to be homologous if they
    are descendants of a common ancestral nucleotide,
    through copying and substitution. It is the
    process of pairing up homologous nucleotides or
    amino acids.
  • Alignment is the process of pairing up homologous
    nucleotides or amino acids

4
Example
  • Sequence1 ACGTACGTACGT
  • Sequence2 AGTAACGTCCGT

ACGT-ACGTACGT A-GTAACGTCCGT
Alignment
Nucleotides appearing in the same column are
supposed to be homologues.
5
An Informal Definition
  • A sequence alignment is a scheme of writing one
    sequence on top of another where a residue in one
    sequence is either paired with a residue in the
    other one or a dash.
  • There are so many alignments between two
    sequences but we are interested in the one that
    reveals the closeness of the two sequences.

ACGT-ACGTACGT A-GTAACGTCCGT
ACGTACGTACGT AGTAACGTCCGT
6
How to find the alignment
  • General idea
  • A) Define a score for a particular alignment
    of sequences
  • B) Find highest scoring alignment among all
    possible
  • Typical scoring
  • Defining a score for each pair (Match ,
    Mismatch and Indel scores).
  • Score of the alignment is the sum of the scores
    of the pairs.

7
Example
  • Match score 1
  • Mismatch score -2
  • Indel score -1

ACGTACGTACGT AGTAACGTCCGT 1 -2-2-2 1 1 1 1-2 1 1
1 0
ACGT-ACGTACGT A-GTAACGTCCGT 1 -1 1 1 -1 1 1 1 1
-2 1 1 1 6
8
Global vs. Local alignment
  • Sometimes we are interested in regions in the
    sequences with high degree of similarity
  • Local alignment is the alignment of two
    substring of the sequences which results in a
    high score alignment.

9
Applications
  • An alignment is useful to
  • Decide whether sequences are related. This can
    be used for inferring gene function.
  • Estimate evolutionary distance between sequences.
    This estimate can be used as a starting point
    for inferring e.g. a phylogenetic tree relating
    the sequences.
  • Find conserved domains, protein structure
    prediction.

10
Alignment Techniques
  • Needleman-Wunsch algorithm (A DP approach)
  • Global optimum built by recursion from optimal
    alignment of smaller sequences.
  • Involves the construction of a dynamic
    programming matrix, M
  • Comprised of initialization, recursion and
    trace-back phases

11
Needleman-Wunsch algorithm
12
Needleman-Wunsch algorithm
Optimal score is stored here
13
Complexity of Neelman-Wunsch algorithm
  • Requires O(m.n) space and time
  • There is a variation of this algorithm that needs
    O(n) space but takes O(m.n) time.
  • It is not efficient for aligning long sequences
    or for searching a sequence in a database of
    sequences

14
Heuristic based methods
  • Word or K-tuple methods (FASTA, BLAST)
  • Statistical methods (Hidden Markov Model)
  • Anchor based methods (MUMer, AVID, OURS, MGA).

15
Anchor based methods
Consists of three main steps
  • Finding maximal matches that are longer than a
    certain value say k, in both sequences using an
    efficient data structure called Suffix Tree.
  • Finding the maximum set of consistent matches
    (non-crossing matches).
  • Filling the gaps between the consistent matches.

16
Anchor based methods
17
Suffix tree
A suffix tree T for an m-character string S is a
rooted directed tree with exactly m leaves having
the following properties
  • Each internal node, other than the root, has at
    least two children and each edge is labeled with
    a nonempty substring of S.
  • No two edges out of a node can have edge labels
    beginning with the same character.
  • For any leaf i, the concatenation of the edge
    labels on the path from the root to the leaf i,
    exactly spell out the suffix of S that starts at
    position i.
  • The concatenation of the edge labels on the
    path from the root to an internal node represent
    a repeated substring in S and the number of
    repeats equals to the number of leaves beneath
    that node.

18
Suffix tree of the string ababc
19
Constructing the suffix tree
  • Suffix tree of a string can be built in O(n) time
    and can be stored in O(n) space, where n is the
    length of the strings
  • The first linear time algorithm was given by
    Weiner1973 and two more efficient algorithms
    were given by McCreight and Ukkonen later on.

20
Finding the maximal matches using suffix tree.
To find the maximal matches of two strings A and
B
  • Construct the suffix tree for the string S
    AB -, are two characters that are not
    occurring anywhere in A and B.
  • Examine all the internal nodes to find the
    matches and to check the maximal property refer
    to the original strings.

21
Finding the maximal consistent matches
  • Construct the adjacency matrix of the bipartide
    graph representing the matches.
  • Add a directed edge from each one in the matrix
    to all ones on the right and below it this
    results in a DAG.
  • Find the longest path in the resulting DAG.

22
Example
23
Complexity and Improvements
  • The time complexity of this algorithm is O(e2)
    where e is the number of edges in the bipartide
    graph or the number of the matches.
  • But we can do it in O(e.log e) and O(e) space in
    the average case.
  • If the matrix isnt sparse we can do it in O(m.n)
    which is proportional to O(e) in this case.

24
Filling the gaps
There are two ways to align the sequences between
the matches.
  • Recursively call the whole procedure using a
    shorter minimum length for the matches. We
    usually do this when the gap is long
  • Use the Needleman-Wuncsh or other approximating
    methods. (for short gaps)

25
Comparing the algorithms
  • MUMmer
  • Uses Maximal Unique Matches, so it doesnt cover
    large fraction of the sequences using matches.
  • Doesnt use an efficient algorithm for finding
    the set of consistent matches.
  • AVID
  • Uses Maximal Matches, but it doesnt allow
    overlap in the set of consistent matches.
  • Doesnt use an efficient algorithm for finding
    the set of consistent matches and it doesnt
    necessarily produce the maximum covering set of
    matches.
  • Ours
  • Uses Maximal Matches, and it does allow overlap
    in the set of consistent matches.
  • Produces a set of consistent matches that covers
    the maximum possible fraction of the sequences.
  • MGA
  • Uses Maximal Matches, but it doesnt allow
    overlap in the set of consistent matches.
  • Uses an efficient algorithm for finding the set
    of consistent matches, but it doesnt necessarily
    produce the maximum covering set of matches.
  • Uses a more efficient implementation of suffix
    tree structure

26
Results. MUMmer vs. ours (ALIGNER)
Cat MUMER ALIGNER Coverage1 56707 58793 Co
verage2 56776 58812 Match Num
1996 2115 Chicken MUMER ALIGNER Coverage1
2793 3412 Coverage2 2796 3410 Match Num
103 132
27
Results. MUMmer vs. ours (ALIGNER)
Chimp MUMER ALIGNER Coverage1 791911 823739
Coverage2 792217 823884 Match Num
7834 8388 Cow MUMER ALIGNER Coverage1 52
503 63538 Coverage2 52487 63545 Match Num
1831 2242
28
Results. MUMmer vs. ours (ALIGNER)
Dog MUMER ALIGNER Coverage1 49427 51014 C
overage2 49420 51089 Match Num
1748 1820 Pig MUMER ALIGNER Coverage1 10
1592 120991 Coverage2 101556 121084 Match
Num 3564 4293 Total time 3'22" 5'41"
29
Results. AVID vs. ours (ALIGNER)
Chicken (Len1 1254651, Len2 1034310)
ALIGNER AVID Score -4755676 -6935493 Tim
e 122 111 Chimp (Len1 1100481 , Len2
990574) ALIGNER AVID Score
4380981 3172357 Time 036 026 Cow
(Len1 2573067 , Len2 3241677)
ALIGNER AVID Score -7258532 -8684808 Ti
me 156 414
30
Results. AVID vs. ours (ALIGNER)
Dog (Len1 1530045, Len2 2495979)
ALIGNER AVID Score -2853058
-3359416 Time 657 352 Pig (Len1
3805947, Len2 6120150) ALIGNER AVID
Score -11711531 -1235622 Time
316 858 Celera - EDGP (Len1 2695268, Len2
2626764) ALIGNER AVID Score 14546685
14311088 Time A few minutes!
??????
Write a Comment
User Comments (0)
About PowerShow.com