Multiple Sequence Alignment - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Multiple Sequence Alignment

Description:

Identification and representation of conserved features of DNA/protein sequences ... same 3D structure(s), 2D substructure(s), active sites, or dispersed residues ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 41
Provided by: erict9
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment


1
Multiple Sequence Alignment
  • Motivation
  • Definition
  • Scoring (Sum of Pairs scoring)
  • Algorithms
  • Family Representations

2
Multiple Sequence Alignment
  • Motivation
  • What are we trying to accomplish?
  • Definition
  • Scoring (Sum of Pairs scoring)
  • Algorithms
  • Family Representations

3
Multiple Sequence Alignment
  • Motivation
  • Representation of protein families
  • Identification and representation of conserved
    features of DNA/protein sequences that correlate
    with structure or function
  • Deduction of evolutionary history from
    DNA/protein sequences
  • Read pages 333-342
  • A lot of this is done by heuristic or
    intuition and is difficult to automate

4
Biological Motivation
  • Previous First Fact of Biological Sequence
    Comparison
  • In biomolecular sequences (DNA, RNA, amino acid
    sequences), high sequence similarity usually
    implies significant functional or structural
    similarity
  • Second Fact of Biological Sequence Comparison
  • Evolutionarily and functionally related molecular
    strings can differ significantly throughout the
    string yet preserve the same 3D structure(s), 2D
    substructure(s), active sites, or dispersed
    residues

5
2 strings versus multiple strings
  • 2 strings
  • Based on first fact
  • Find unknown biological relationships using
    string similarity
  • Method database searching
  • Multiple strings
  • Based loosely on second fact
  • Given known biological relationships (function,
    structure, etc), identify unknown conserved
    subpatterns in a set of strings
  • These subpatterns can then be used as a known
    pattern for other database searches

6
Multiple Sequence Alignment
  • Motivation
  • Definition
  • Scoring (Sum of Pairs scoring)
  • Algorithms
  • Family Representations

7
Definition
  • A global alignment of a set of kgt2 strings Si
    is obtained
  • by inserting spaces (dashes) into each Si so that
    each string has the same length at the end.
  • Placing each string into columns, one character
    (or dash) per column.
  • Note ALL positions in both S and T are involved
  • A local alignment of a set of kgt2 strings Si is
    obtained
  • by selecting one substring Si from each string
    Si
  • globally aligning those substrings

8
Example
  • Strings abca, ababa, accb, cbbc
  • a b c - a
  • a b a b a
  • a c c b -
  • c b - b c

9
Multiple Sequence Alignment
  • Motivation
  • Definition
  • Scoring (Sum of Pairs scoring)
  • Induced pairwise alignments
  • Definition of sum of pair (SP) scoring
  • Justification (or lack thereof)
  • Algorithms
  • Family Representations

10
Scoring MSAs
  • Key fact there is no universally accepted score
    function
  • My impression is that people evaluate MSAs by
    feel (they know a good one when they see it)
  • Definitions
  • Given a MSA M, the induced pairwise alignment of
    Si and Sj is obtained from M by removing all rows
    except the two rows for Si and Sj. Opposing
    spaces can be removed if desired.

11
Definitions
  • Definitions
  • Given a MSA M, the induced pairwise alignment of
    Si and Sj is obtained from M by removing all rows
    except the two rows for Si and Sj. Opposing
    spaces can be removed if desired.
  • The score of an induced pairwise alignment is
    determined using any chosen scoring scheme for
    two-string alignment in the standard manner.

12
Example
  • Example
  • a b c - a
  • a b a b a
  • a c c b -
  • c b - b c
  • Induced alignment
  • a b c - a
  • a c c b -
  • Score
  • 0 1 0 1 1 3

13
Sum of Pairs (SP)
  • Definition The SP score of a MSA M is the sum of
    the scores of pairwise global alignments induced
    by M
  • Example
  • a b c - a
  • a b a b a
  • a c c b -
  • c b - b c
  • SP score 2 3 4 3 3 4 19

14
Justification
  • Difficult to give a sound biological
    justification for SP or any other scoring scheme
  • Main reasons for studying it
  • It is easy to work with
  • It has been used by many people in studying MSA
  • It is used in several packages

15
Multiple Sequence Alignment
  • Motivation
  • Definition
  • Scoring (Sum of Pairs scoring)
  • Algorithms
  • Exact, NP-hard problem
  • Approximation Algorithm (Center Star)
  • Heuristic Methods
  • Family Representations

16
Formal Problem
  • Input
  • k strings Si
  • Scoring function
  • Output
  • MSA of Si with minimum (maximum) SP score
  • Observation
  • Exact solution is NP-hard
  • Dynamic programming takes O(nk) time, so solving
    exactly for more than even 6 strings of typical
    length is often not feasible

17
Heuristic Speedup
  • View problem as a shortest path problem with
    O(nk) nodes
  • Given an upper bound on the actual value, we can
    eliminate exploration of many nodes using branch
    and bound ideas
  • Key is to send values forward rather than
    backwards
  • Backwards All nodes will eventually be evaluated
  • Forwards Limit to those which can possibly be
    less than current estimate on optimal

18
Backwards
19
Forwards
20
Approximation Algorithms
  • Given the hardness of computing the exact
    solution, how about developing algorithms that
    compute a solution that is guaranteed to be close
    to optimal
  • Goal Find a polynomial-time algorithm A that
    minimizes
  • supI A(I)/OPT(I)
  • Only computer scientists seem interested in this
  • Biologists seem to do things more heuristically

21
Alignments consistent with a tree
  • D(Si,Sj) is the optimal weighted edit distance
    between Si and Sj
  • Definition Let T be a tree where each node is
    labeled with a string from Si. Then a multiple
    alignment of Si is consistent with T if the
    induced pairwise alignment of Si and Sj has score
    D(Si,Sj) for each pair of strings (Si, Sj) that
    are connected by an edge in T.

22
Example
AYZ
AXZ
  • -AX-Z
  • -A-YZ
  • -AXYZ
  • --XYZ
  • AYXYZ
  • All edge alignment scores are optimal
  • Others are not such as AYXYZ with -AXYZ

AXYZ
XYZ
AYXYZ
23
Theorem
  • For any Si and any tree T whose nodes are
    labeled with distinct nodes of Si, we can
    efficiently find an MSA M(T) of Si that is
    consistent with T.
  • Proof
  • Incrementally align any two adjacent nodes
  • Two aligned gaps have zero cost
  • Add gaps as necessary to other already aligned
    sequences

24
Example
AYZ
AXZ
  • Align AXYZ and XYZ
  • AXYZ
  • -XYZ
  • Align AYXYZ and -XYZ
  • A-XYZ or -AXYZ
  • --XYZ --XYZ
  • AYXYX AYXYZ

AXYZ
XYZ
AYXYZ
25
Triangle Inequality
  • Assume an alphabet-weighted scoring scheme s(x,y)
  • x and y could be any character (or a space)
  • A scoring scheme satisfies the triangle
    inequality if for any three characters (including
    a space) x, y, and z,
  • s(x,z) lt s(x,y) s(y,z)
  • Note, not all scoring schemes used in biology
    satisfy this triangle inequality property

26
Center Star Method
  • For Si, define Sc to be the string that
    minimizes Sall strings D(Sc, Sj)
  • Define the center star to be the star where the
    center node is labeled with Sc
  • Define Mc to be an MSA of Si that is consistent
    with the center star
  • Define d(Si, Sj) to be the score of the pairwise
    alignment of Si and Sj induced by Mc.
  • Denote the score of an alignment M as d(M).

27
Example
AYZ
AXZ
  • Sall strings D(AXYZ, Sj) 4
  • Mc before AYXYZ added
  • AXYZ
  • AX-Z
  • A-YZ
  • -XYZ
  • Mc after AYXYZ added
  • A-XYZ
  • A-X-Z
  • A--YZ
  • --XYZ
  • AYXYZ

AXYZ
XYZ
AYXYZ
28
Example continued
AYZ
AXZ
  • Mc after AYXYZ added
  • A-XYZ
  • A-X-Z
  • A--YZ
  • --XYZ
  • AYXYZ
  • d(AYZ,AYXYZ) 2
  • d(Mc) 1 1 1 1 2 2 2 2 2 2 16

AXYZ
XYZ
AYXYZ
29
Results
  • Lemma Assuming triangle inequality, then
  • d(Si, Sj) lt d(Si, Sc) d(Sc, Sj)
  • D(Si, Sc) D(Sc, Sj)
  • Definition Let M be the optimal alignment of
    Si and d(Si, Sj) be the score of the induced
    pairwise alignment.
  • Theorem d(Mc) / d(M) lt 2(k-1)/k lt 2

30
Proof
31
Weighted SP
  • Each induced pairwise score is multiplied by a
    weight w(i,j).
  • Optimal weighted SP can be computed in
    exponential time (in k) using dynamic programming
  • Little is known about approximation of weighted
    SP
  • Why doesnt center star give a guaranteed bound
    here?

32
Heuristic Techniques
  • In practice, people tend to use more heuristic
    methods with no proven performance guarantees
  • Basic idea
  • Do some form of iterative or progressive
    alignment
  • For example, do an alignment based on a minimum
    spanning tree of some sort
  • Find two closest nodes and join them
  • how should we define closeness?
  • then iteratively add closest non-aligned node to
    the alignment

33
Heuristic Techniques
  • In practice, people tend to use more heuristic
    methods with no proven performance guarantees
  • Basic idea
  • Do some form of iterative or progressive
    alignment
  • For example, do an alignment based on a minimum
    spanning tree of some sort
  • Find two closest nodes and join them
  • how should we define closeness?
  • then iteratively add closest non-aligned node to
    the alignment

34
One method of defining closeness
  • sd(i,j) scores
  • given a scoring scheme
  • Compute D(Si, Sj)
  • 100 times do
  • Jumble Si and Sj and compute D(jum(Si),
    jum(Sj))
  • Compute mean and standard deviation of these 100
    jumbled comparisons
  • Define sd(i,j) D(Si, Sj)/standard deviation (no
    mean?)
  • Intuition
  • Strings Si and Sj contain non-random structures
    (hopefully secondary structure) in common if
    sd(i,j) is high

35
Multiple Sequence Alignment
  • Motivation
  • Definition
  • Scoring (Sum of Pairs scoring)
  • Algorithms
  • Family Representations
  • Profiles
  • Regular expressions/motifs

36
Representation Problem
  • Input
  • family of sequences that typically have a known
    biological similarity
  • Desired output
  • Representation of this family of sequences that
    reveals any string/sequence similarities that
    hopefully are related to their biological
    similarity

37
Profiles
  • Strings abca, ababa, accb, cbbc
  • a b c - a
  • a b a b a
  • a c c b -
  • c b - b c
  • Profile
  • 1 2 3 4 5
  • a 75 25 50
  • b 75 75
  • c 25 25 50 25
  • - 25 25 25

38
Log odds ratio
  • Strings abca, ababa, accb, cbbc
  • a b c - a
  • a b a b a
  • a c c b -
  • c b - b c
  • Profile
  • 1 2 3 4 5
  • a 75 25 50
  • b 75 75
  • c 25 25 50 25
  • - 25 25 25
  • p(a) 6/20 30
  • p(a,1) 3/4 75
  • log (p(x,j)/p(x)) is entry
  • Example (without logs)
  • 1 2 3 4 5
  • a 2.5 0 .83 0 1.7
  • b 0 2.5 0 2.5 0
  • c 1 1 2 0 1
  • - 0 0 1.7 1.7 1.7

39
Nice feature of profiles
  • Natural extension of alignment and scoring of
    strings to profiles
  • Aligning a string to a profile
  • We can generalize notions of pairwise string
    alignment
  • Scoring
  • Compute a weighted sum based on frequency of
    characters in the column
  • Can generalize to profile to profile alignments
  • Optimal alignment
  • Dynamic programming can solve

40
Signature representations
  • Signature or motif signature
  • pattern contained as a substring in most members
    of a family
  • typically represented as a regular expression
  • Such a regular expression might be derived given
    a multiple sequence alignment
Write a Comment
User Comments (0)
About PowerShow.com