Sequence Alignment - PowerPoint PPT Presentation

About This Presentation
Title:

Sequence Alignment

Description:

Initializing boundaries of the scoring matrix for gaps in front of either string ... Gaps in front: zeros in row or column representing the string ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 26
Provided by: csF8
Learn more at: https://cs.fit.edu
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment


1
Sequence Alignment
2
Motivation Types
  • Two sequences of same length, some characters are
    different (Database search)
  • Aagtacggaga
  • aagcaccgaga
  • Two seq are of different length, possible gaps in
    one of them (Database search)
  • Aaccaccgaga
  • Aa-caccgaga

3
Motivation Types
  • Match longest prefix of one with the suffix of
    the other (fragment assembly)
  • Aaacgtcgata
  • gatacgatg
  • Local alignment longest substring matching over
    two sequences (homolog search)
  • Gatacgatgctagtttacg
  • agagcgatgcataattcgaatga

4
Motivation Types
  • Multiple sequence alignment
  • (page 71) (Comparative studies of sequences)

5
Formalizing sequence comparison
  • Either a character matches with the corresponding
    character in an an alignment (1),
  • Or, it does not (-1),
  • Or, a gap needs to be inserted (-2)

6
Global Alignment
  • Smith-Waterman (1981) Dynamic programming
    algorithm
  • Scoring matrix for alignment (p 31)
  • Initializing boundaries of the scoring matrix for
    gaps in front of either string
  • Meaning of an entry to the matrix
  • Corner element is the final score

7
Global Alignment
  • Three alternatives in each iteration
  • Ordering of calculation row or column-wise
  • The algorithm (p 52)
  • Recursive recovery process from corner element
    (constant m and n, the string lengths)
  • Variable len returned by the algorithm
  • Convention for tie braking

8
Local alignment
  • Alignment will stop anywhere
  • So, the min score is zero, even on boundaries
  • Best local alignment is where the score is max in
    the matrix
  • Recovery starts from that max value, stops at a
    zero value

9
Semi-global (as-required alignment) alignment
  • Four alternatives penalty-less gaps in front of
    string s, in front of t, at the back of s, back
    of t)
  • Prefix-suffix matching by playing with
    alternatives
  • E.g., suffix of s with prefix of t gaps at the
    back of s but in the front of t

10
Semi-global alignment
  • Example p 56
  • Gaps in front zeros in row or column
    representing the string
  • Gaps at the back recovery starts from the max of
    row or column representing the string
  • Above may be combined as required
  • Exercise how to combine for matching suffix of s
    with prefix of t

11
Generalized gap penalty
  • Multiple gaps with the same penalty as that of
    one or by some formula w(k)
  • Each block matching gaps is to be considered as
    one unit (like a char)
  • Boundary (first row and col) initialization with
    w(k)

12
Generalized gap penalty
  • Three matrices interplaying
  • one for character matching with p(I,j)
  • One for gaps in s
  • One for gaps in t
  • Formula on p 63

13
Affine gap penalty
  • Generalized gap penalty, with
  • W(k) h gk, first gap costs more hg
  • Formula changes slightly with known w(k)
  • block gap-matrices compares only previous
    elements complexity reduces

14
Multiple sequence alignment
  • Function for each column character or gap for
    each sequence
  • Combinatorics 2k 1, for k sequences (-1 for
    not putting gaps in all columns)
  • But . . .

15
Multiple sequence alignment
  • Order of arguments for the function should not
    matter f(I,-,v) f(I,v,-)
  • Score pairwise on a column
  • Combinatorics (k choose 2)
  • For k10, 2k-1 1111, kC245
  • We need gap to gap scoring now

16
Multiple sequence alignment
  • Total score can be measured either way
  • Sum over all columns, Or,
  • Sum over all pairs of sequences
  • If p(-, -) 0, then both the scoring above is
    same

17
Multiple sequence alignment
  • Consider 3 sequence alignment s1, s2, and s3
  • (I, j, k)-th entry of the scoring matrix is for
    aligning s11..I, s21..j, s31..k
  • 3D matrix (n x m x l) dimension, for s1n,
    s2m, s3l

18
Multiple sequence alignment
  • Each entry in scoring matrix will be at a corner
    of a 3D box
  • Optimal score is calculated over all other 7
    corners (max)
  • AI-1, j,k, AI, j-1, k, AI,j, k-1,
  • AI-1, j-1, k, AI-1, j, k-1, AI, j-1, k-1,
  • AI-1, j-1, k-1
  • Vector(I,j,k) - bit-vector
  • In each case sum-of-pair scores are to be added
    for the column EXAMPLE
  • Initialization (-4)I 1ltIltn, for two gaps
    against substrings of s1, likewise for s2 and s3

19
Multiple sequence alignment
  • For k sequences, k-dimensional matrix
  • Each entry is a calculation over 2k 1 other
    corners of the box
  • Formula page 72

20
Alignment improvements
  • Alignment could be from the back also
  • SI1..n, tj1..m
  • Front and back alignment could be combined to
    cut alignment compute the two matrices, add
    them, align according to the added matrix

21
Alignment improvements
  • When the length of two sequences are comparable
    and expectation is to have good global alignment
  • Retrieval is mostly along the diagonal
  • Computation can focus around a strip (fixed (k)
    number) around diagonal k-band
  • More efficient
  • Usage of relevant cells only

22
Multiple sequence alignment Star alignment
  • One sequence at center all others are pairwise
    aligned against it
  • Which sequence to put at the center?
  • Try each
  • create a 2D similarity matrix for all pairs,
    pick up the best (least of summed) row page 79

23
Multiple sequence alignment Tree alignment
  • A spanning tree out of the sequences nodes are
    sequences
  • Each edge labels the similarity between pair of
    nodes
  • Total tree cost, or aggregate over edges should
    be max
  • Star is a special tree

24
PAM matrix for matching residues
25
BLAST search engine
Write a Comment
User Comments (0)
About PowerShow.com