Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics Algorithms and Data Structures

Description:

Bioinformatics Algorithms and Data Structures – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 40
Provided by: john244
Learn more at: https://www.cse.sc.edu
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures


1
Bioinformatics Algorithms and Data Structures
  • Chapter 14.1-5 Multiple String Comparisons
  • Lecturer Dr. Rose
  • Slides by Dr. Rose
  • March 1, 2007

2
Multiple String Comparisons
  • Q Why are we interesting in multiple string
    comparisons?
  • A At one level we are data-mining.
  • Looking for similarities
  • Common evolution
  • Common functionality
  • Significance of similarity may not be clear with
    only two strings.
  • Multiple string comparison is accomplished by
    multiple alignment.

3
Multiple String Comparisons
  • Defn. Global multiple alignment of k 2 strings
    is
  • Generalization of alignment of 2 strings
  • Strings S1,S2,,Sk are inflated with spaces to
    achieve strings S1,S2,,Sk with uniform
    length l.
  • Strings are arrayed in k rows of l columns.

4
Example
AGT..CTT.ACGCG AGTAGCTT...GCG ..TAGC.T..GGCG .CTA.
C.TAACCCG ACTA...TAAC...
5
Example
6
Multiple String Comparisons
  • Consider the relation between two-string
    comparison and biological function
  • two-string alignments are used to find
    unsuspected biological relationship from apparent
    string similarity.
  • This follows from the first fact of biological
    sequence comparison sequence similarity implies
    functional or structural similarity.

7
Multiple String Comparisons
  • Consider the relation between multiple string
    comparison and biological function
  • Multiple string alignments are used to find
    unknown string similarities from known biological
    relationships.
  • This isnt as obvious since there is the tendency
    to focus on one-dimensional sequences and not the
    corresponding three-dimensional structures or
    two-dimensional substructures.

8
Multiple String Comparisons
  • This follows from the second fact of biological
    sequences
  • Strings that are functionally related can appear
    very different and yet preserve the same
    important three-dimensional and two-dimensional
    features.
  • There are several levels of abstraction entailed
  • Three-dimensional structure
  • Functionality
  • Amino-acid sequence

9
Multiple String Comparisons
  • These different levels of abstraction are
    preserved/conserved to different degrees
  • Three-dimensional structure is most preserved
  • Functionality is somewhat conserved
  • Amino-acid sequence less likely to be conserved
  • Q What point are we trying to make?
  • A The significance is that similarity of
    structure may not be blatantly apparent at the
    sequence level.
  • ? Comparison of multiple sequences highlights
    less apparent similarity.

10
Multiple String Comparisons
  • Example from text Hemoglobin
  • 4 chains of 140 amino acids a piece
  • Found in insects to mammals
  • Insects and invertebrates diverged 600 million
    BP
  • large number of amino acid mutations (100) per
    chain in the two sequences (insect invertebrate)

11
Multiple String Comparisons
  • Comparison of two mammalian hemoglobin sequences
  • Exhibit high amino-acid similarity (Our cousin
    the chimpanzee shares the identical sequence)
  • Suggest similar functionality
  • Comparison of mammalian and insect hemoglobin
    sequences
  • Exhibits little amino-acid similarity
  • However, has similar functionality

12
Multiple String Comparisons
  • The important point is that while
  • sequence similarity ? functional structural
    similarity
  • The converse
  • functional structural similarity ? sequence
    similarity
  • is not true, i.e.,
  • functional structural similarity ? sequence
    similarity

13
Family Superfamily Representation
  • Data Mining Problem
  • Given a set of biologically similar strings ?
    find the commonalities that characterize the
    family.
  • Why would we want to do this?
  • Conserved features may explain function
    structure.
  • Characterization of the family may make it easy
    to recognize new members.
  • Characterization may also make it easier to
    exclude nonmembers.

14
Family Superfamily Representation
  • Example protein families
  • The similarity may be functionality or
  • Two- or three-dimensional structure
  • Specific Examples
  • globins (hemoglobins, myoglobins)
  • immunoglobulin (antibody) proteins

15
Family Superfamily Representation
  • Q Why would we be interested in identifying the
    family to which a protein belongs?
  • A Family membership immediately clues us in on
  • Physical structure
  • Biological functionality
  • Text suggests there are 100,000 proteins in
    humans but only 1000 or fewer protein families

16
Family Superfamily Representation
  • Q If we suspect that a new protein belongs to
    some family how do we check?
  • Align the new protein sequence with a
    representative member of the family?
  • Align the new protein sequence with several
    representative members of the family?
  • Align the new protein sequence with a
    generalization of members of the family?
  • A Align the new protein sequence with a
    generalization of members of the family.

17
Family Superfamily Representation
  • Q What is the representation of the
    generalization of members of the family?
  • Consider
  • We want to match family members while
  • Excluding non-family members
  • This is an established area in machine learning.
  • In general, the key is that the representation
    language must be sufficiently expressive to
    distinguish between - examples.
  • Conjecture amino acid strings lack sufficient
    expressiveness

18
Family Superfamily Representation
  • Three common currently used representations
  • Profile (based on multiple alignment)
  • Consensus sequence (based on multiple alignment)
  • Signature (some based on multiple alignment, some
    not)

19
Profile Representation
  • Defn. a profile (aka weight matrix)for a multiple
    alignment specifies the frequency of each
    character in each column.
  • Consider the following multiple alignment
  • a b c a
  • a b a b a
  • a c c b
  • c b b c
  • The corresponding extracted profile
  • C1 C2 C3 C4 C5
  • a .75 .25 .50
  • b .75 .75
  • c .25 .25 .50 .25
  • - .25 .25 .25

20
Profile Representation
  • log-odds ratios profile entries are sometimes
    expressed in this form.
  • Let p(y, j) denote the frequency of the
    occurrence of character y in column j.
  • Let p(y) denote the frequency of the occurrence
    of character y anywhere in multiply aligned
    sequences.
  • log p(y, j)/p(y) is the log-odds ratio for cell
    (y, j) of the profile (weight matrix).

21
Profile Representation
  • Alignment of string S with profile P
  • Insertion of spaces into S is allowed
  • Use regular string alignment?
  • Let C be a string of profile column positions
  • Align S by inserting spaces into S and C.

22
Profile Representation
  • Example S aabbc, P is the profile from the
    previous slide
  • C1 C2 C3 C4 C5
  • a .75 .25 .50
  • b .75 .75
  • c .25 .25 .50 .25
  • - .25 .25 .25
  • Alignment of S and C.
  • S a a b - b c
  • C 1 - 2 3 4 5
  • Q How do we score such an alignment???

23
Profile Representation
  • Q How do we score profile alignments?
  • Assume we have an alphabet-weight scoring scheme,
    e.g.,
  • a b c -
  • a 2 1 -3 -1
  • b 1 2 1 -1
  • c 3 1 2 -1
  • - -1 1 1 0
  • Column score compute the weighted sum of scores
    based on the frequency of characters in the
    column.
  • Alignment score sum the column scores.

24
Profile Representation
  • a b c - alphabet-weight scoring
    scheme
  • a 2 1 -3 -1
  • b 1 2 1 -1
  • c 3 1 2 -1
  • - -1 1 1 0
  • C1 C2 C3 C4 C5 profile
  • a .75 .25 .50
  • b .75 .75
  • c .25 .25 .50 .25
  • - .25 .25 .25
  • Compute the weighted sum of scores based on the
    frequency of characters in the column.
  • S a a b - b c
  • C 1 - 2 3 4 5
  • Column1 0.75 2 0.25(-3)
  • Column2 0.75 2 0.25(-1)
  • Column3 0.25 0 0.50 (-1) 0.25 (-1)
  • Column4 0.75 2 0.25 (-1)
  • Column5 0.50 (-3) 0.25 2 0.25 (-1)

25
Profile Representation
  • Q How do we find optimal alignments?
  • A Use dynamic programming to maximize
    similarity.
  • As before
  • s(x, y) denotes the alphabet-weight assignment
    for aligning x y.
  • p(y, j) denote the frequency of letter y in
    column j.
  • Then let S(x, j) denote Sys(x, y) p(y, j) ,
    the score for aligning x with column j.

26
Profile Representation
  • Defn. Let V(i, j) denote the value of the optimal
    alignment of S1..i with the first j columns of
    C.
  • Then V(0, j ) Sk?j S(_,k)
  • And V(i, 0) Sk?i S(S1(k), _)
  • Here S1(k) denotes the kth character of the first
    string argument, i.e., Sk.

27
Profile Representation
  • The general recurrence is then
  • V(i, j) max
  • V(i - 1, j - 1) S(S1(i), j), match ith and
    jth letters
  • V(i - 1, j) S(S1(i), _), insert a gap in
    the profile
  • V(i, j - 1) S(_,j) insert a gap in S1.
  • Q What is the time complexity for solving this
    recurrence using DP?

28
Profile Representation
  • Clearly the time complexity is O(smn) for DP
  • Where
  • n is the length of S the string.
  • m is length of the profile and
  • s is the size of the alphabet.
  • O(smn) is more costly than sequence to sequence
    alignment. (Do you recall what that cost was?)

29
Signature Representation
  • This representation is used by protein databases
    such as
  • PROSITE
  • BLOCKS
  • The core idea is that families of proteins are
    characterized by motifs or sequence signatures.
  • Q What is a motif?
  • A (Webster) A usu. repeating salient thematic
    element

30
Signature Representation
  • Example from text
  • HADDExnTSN x4QKG x7A
  • Where
  • A bracket indicates alternative amino acids
  • I, L, V, M, F, Y, W
  • x denotes any amino acid.
  • The subscript denote the length of the string, n
    denotes and arbitrary length.

31
Signature Representation
  • Example from text
  • HADDExnTSN x4QKG x7A
  • Observations
  • The representation is a generalization
  • The generalization is a regular expression

32
Signature Representation
  • Signature
  • HADDExnTSNx4QKGx7A
  • Matches
  • HADDITIIIIQGIIIIIIIA
  • IADDITIIIIQGIIIIIIIA
  • LADDITIIIIQGIIIIIIIA
  • VADDITIIIIQGIIIIIIIA
  • MADDITIIIIQGIIIIIIIA

33
Signature Representation
  • Regular expression representation
  • use regular expression pattern matching.
  • no need to worry about mismatches/errors.

34
Computing Multiple Alignments
  • Recall two string local alignment was defined in
    terms of global alignment of substrings.
  • We take the same approach for multiple string
    local alignment.
  • Defn. A local multiple alignment of a set S of
    strings is obtained by selecting one substring
    Si from each Si ? S and then globally aligning
    these substrings.

35
Computing Multiple Alignments
  • Q Global vs Local alignment which should we
    prefer?
  • Wait for someone to respond!
  • Gusfield notes for
  • Pairs of sequences and
  • Multiple sequences
  • there are biological justifications for
    preferring local over global alignment of
    multiple sequences.
  • But.

36
Computing Multiple Alignments
  • But.
  • The best (computer science) theoretical results
    are for global alignment.
  • Like the joke about the lost wallet, Gusfield
    chooses to emphasize global alignment.

37
Computing Multiple Alignments
  • Q How can we generalize the concept of score to
    multiple alignments?
  • IOW, what objective function should we use?
  • We will consider three types of objective
    functions
  • Sum-of-pairs
  • Consensus
  • Tree

38
Computing Multiple Alignments
  • First we define the concepts of induced pairwise
    alignment and its corresponding score.
  • Defn. The induced pairwise alignment of strings
    Si and Sj is obtained from the global alignment M
    by removing all other rows.
  • Note instances of matching spaces can be removed
    from the induced alignment.
  • Note to score an induced pairwise alignment any
    two-string alignment scoring scheme can be used.

39
Computing Multiple Alignments
  • Consider the following pairwise scoring scheme
  • score mismatches spaces
  • In the following example
  • 1 A A T - G G T T T
  • 2 A A - C G T T A T
  • T A T C G - A A T
  • score(1,2) 4
  • score(1,3) 5
  • score(2,3) 4
Write a Comment
User Comments (0)
About PowerShow.com