Multiple Sequence Alignment - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Multiple Sequence Alignment

Description:

'The large majority of proteins come from no more than one thousand families' ... that are 'coming from outside the polyhedron', as in the top part the inset. ... – PowerPoint PPT presentation

Number of Views:489
Avg rating:3.0/5.0
Slides: 48
Provided by: sch17
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment


1
Multiple Sequence Alignment
2
(No Transcript)
3
  • Highly conserved region in MSA (multiple sequence
    alignment) may imply important functional
    information.

4
Families
  • gene family a set of homologous genes
  • protein family a set of homologous
    proteins
  • examples
  • globin gene family
  • HOX gene family
  • serine/threonine kinase family

5
Protein families
  • The large majority of proteins come from no more
    than one thousand families (Chothia 1994)

6
Protein structure
  • amino acid sequence (primary)
  • three-dimensional structure
  • small scale (secondary)
  • alpha-helix, beta sheet, fold
  • large scale (tertiary)
  • domain
  • fully functional protein (quarternary)

7
Domains
  • Protein composed from several domains
  • domain carries specific function
  • Structure is more likely to be conserved than
    sequence
  • one exon might represent one domain

8
Domains
9
Related Motivation
  • Gain insight into evolutionary history
  • By looking at the number of mutations necessary
    to go from one sequence to another, one can
    assess the time of divergence

10
(No Transcript)
11
(No Transcript)
12
Alternatives to SP score
  • What we have now (loglikelihood ratio)
  • A natural extension for aligning 3 sequences (
    can be unrealistically over-parameterized)

13
(No Transcript)
14
(No Transcript)
15
Example
  • VSNS
  • SNA
  • AS

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Carrollo Lipman Algorithm
  • -- an attempt to reduce the volume of the dynamic
    programming matrix

21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
3 or more sequences
The optimal alignment path is contained in a
"polyhedron" close to the main diagonal. Here, a
polyhedron is a solid formed by plane faces, or
more complicated 2-dimensional surfaces. For
better visualization, the polyhedron's shadows
are displayed. While visiting a node and looking
for the minimum along all the incoming edges, we
can ignore those edges that are "coming from
outside the polyhedron", as in the top part the
inset. On its top-left side, the cube is
"covered" by the polyhedron. The edges 1, 2, 3, 6
and 7 are coming from the inside, and edges 4 and
5 can be ignored.
27
Progressive Alignment Methods
  • Most commonly used approach to multiple alignment

28
Progressive Methods
  • Start with the most related sequence then
    progressively add less related sequence(s) to the
    initial alignment

29
Guide Tree for Progressive Methods
Do NOT confuse with phylogenetic tree
30
Ad hoc Guide Tree Building
First construct a distance matrix of all pairwise
distances
31
Joining Nearest Neighbors
32
Preserving Adding Gaps
33
  • An Example of Progressive Multiple Alignment

34
(No Transcript)
35
Problems of Progressive Alignment
  • No guarantee of the global optimal multiple
    alignment
  • Initial choice of sequences affects the final
    alignment
  • When sequences are highly divergent, the
    progressive approach becomes less reliable

36
The CLUSTALW program
  • Fine tuned version of the above algorithm
  • Sequences are weighted to account for biased
    representation in large sub-families.
  • Substitution matrix is chosen flexibly
  • Manipulation of gap penalties

37
Motif Representations
CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CGGCGGCTTCTAAT
CCG ... CGGGGCAGACTATTCCG
  1. Consensus
  2. Frequency Matrix
  3. Logo

CGGNGCACANTCNTCCG
38
Logo explanation
  • The characters representing the sequence are
    stacked on top of each other for each position in
    the aligned sequences.
  • The height of each letter is made proportional to
    its frequency, the most common one is on top.
  • The height of the entire stack is then adjusted
    to signify the information content of the
    sequences at that position.

39
Information Content
  • Uncertainty
  • Information
  • Thomas D. Schneider and R. Michael Stephens,
    Nucleic Acids Research, 18 6097-6100 (1990)

40
Other MSA methods
  • Phylogenetic tree building
  • Alignment using the Sum-of-Pairs scoring scheme
    can be accomplished in a more probabilistic
    framework using profile HMM
  • EM algorithm

41
Motif Sampler (EM)
  • Lawrence et al. 1993, Liu et al. 1995
  • Model the distribution of residues with
    multinomial distributions
  • One multinom. distn per position within motif
  • One background distn for outside motif
  • The motif location is missing!

42
Problem Description
  • Given a set of N sequences S1,,SN
  • of length nk (k1,,N)
  • Identify a single pattern of fixed width(W)
    within each (N)input sequence
  • A ak (k1,,N) a set of starting positions
    for the common pattern within each sequence
    ak1nk-W1
  • Objective to find the best, defined as the
    most probable, common pattern

43
Algorithm- Initialization (1)
  • Choose random starting positions ak within
    the various sequences
  • A ak (k1,,N) a set of starting
    positions for the common pattern within each
    sequence ak1nk-W1

44
X X X X X X X X X X X X X ? ? Z
X X X X X X M X X X X
X X X X M X X X X X X X
X X X X X X M X X X
X X X X X X X X M X
X X M X X X X X X X

A G G A G C A A G A
A C A T C C A A G T
T C A T G A T A G T
T G T A A T G T C A
A A T G T G G T C A
N6, W10 q1A 3/5, q2G 2/5, q1G 0
45
Algorithm- Predictive Update (2)
  • One of the N sequences, Z, is chosen either at
    random or in specified order.
  • The pattern description qij and background
    frequency q0j are then calculated excluding z.

46
  • Calculate the new multinomial frequencies if the
    motif start at a given location in Z
  • calculated analogously with counts taken over all
    non-motif positions
  • Find the most reasonable location in Z
  • Iterate!

47
X X X X X X X X X X X X ? ? Z
X X X X X M X X X X
X X M X X X X X X X
X X X X X X M X X X
X X X X X X X X M X
X X M X X X X X X X
A T G G C T A A G C C A T T A A T C G C
q3G X q4G X q5C X q6T X q7A X q8A X q9G X q10C X q11C X q12A
q0G x q0G x q0C x q0T x q0A x q0A x q0G x q0C x q0C x q0A
AX Qx/Bx
Select a set of aks that maximizes the product
of these ratios, or F F S 1iW S j? A,T,G,C
ci,jlog(qij/q0j)
Write a Comment
User Comments (0)
About PowerShow.com