Multiple Sequence Alignment

About This Presentation

Title:

Multiple Sequence Alignment

Description:

'The large majority of proteins come from no more than one thousand families' ... that are 'coming from outside the polyhedron', as in the top part the inset. ... – PowerPoint PPT presentation

Number of Views:489

Avg rating:3.0/5.0

Slides: 48

Provided by: sch17

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment

1
Multiple Sequence Alignment
2
(No Transcript)
3

Highly conserved region in MSA (multiple sequence
alignment) may imply important functional
information.

4
Families

gene family a set of homologous genes
protein family a set of homologous
proteins
examples
globin gene family
HOX gene family
serine/threonine kinase family

5
Protein families

The large majority of proteins come from no more
than one thousand families (Chothia 1994)

6
Protein structure

amino acid sequence (primary)
three-dimensional structure
small scale (secondary)
alpha-helix, beta sheet, fold
large scale (tertiary)
domain
fully functional protein (quarternary)

7
Domains

Protein composed from several domains
domain carries specific function
Structure is more likely to be conserved than
sequence
one exon might represent one domain

8
Domains
9
Related Motivation

Gain insight into evolutionary history
By looking at the number of mutations necessary
to go from one sequence to another, one can
assess the time of divergence

10
(No Transcript)
11
(No Transcript)
12
Alternatives to SP score

What we have now (loglikelihood ratio)
A natural extension for aligning 3 sequences (
can be unrealistically over-parameterized)

13
(No Transcript)
14
(No Transcript)
15
Example

VSNS
SNA
AS

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Carrollo Lipman Algorithm

-- an attempt to reduce the volume of the dynamic
programming matrix

21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
3 or more sequences
The optimal alignment path is contained in a
"polyhedron" close to the main diagonal. Here, a
polyhedron is a solid formed by plane faces, or
more complicated 2-dimensional surfaces. For
better visualization, the polyhedron's shadows
are displayed. While visiting a node and looking
for the minimum along all the incoming edges, we
can ignore those edges that are "coming from
outside the polyhedron", as in the top part the
inset. On its top-left side, the cube is
"covered" by the polyhedron. The edges 1, 2, 3, 6
and 7 are coming from the inside, and edges 4 and
5 can be ignored.
27
Progressive Alignment Methods

Most commonly used approach to multiple alignment

28
Progressive Methods

Start with the most related sequence then
progressively add less related sequence(s) to the
initial alignment

29
Guide Tree for Progressive Methods
Do NOT confuse with phylogenetic tree
30
Ad hoc Guide Tree Building
First construct a distance matrix of all pairwise
distances
31
Joining Nearest Neighbors
32
Preserving Adding Gaps
33

An Example of Progressive Multiple Alignment

34
(No Transcript)
35
Problems of Progressive Alignment

No guarantee of the global optimal multiple
alignment
Initial choice of sequences affects the final
alignment
When sequences are highly divergent, the
progressive approach becomes less reliable

36
The CLUSTALW program

Fine tuned version of the above algorithm
Sequences are weighted to account for biased
representation in large sub-families.
Substitution matrix is chosen flexibly
Manipulation of gap penalties

37
Motif Representations
CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CGGCGGCTTCTAAT
CCG ... CGGGGCAGACTATTCCG

Consensus
Frequency Matrix
Logo

CGGNGCACANTCNTCCG
38
Logo explanation

The characters representing the sequence are
stacked on top of each other for each position in
the aligned sequences.
The height of each letter is made proportional to
its frequency, the most common one is on top.
The height of the entire stack is then adjusted
to signify the information content of the
sequences at that position.

39
Information Content

Uncertainty
Information
Thomas D. Schneider and R. Michael Stephens,
Nucleic Acids Research, 18 6097-6100 (1990)

40
Other MSA methods

Phylogenetic tree building
Alignment using the Sum-of-Pairs scoring scheme
can be accomplished in a more probabilistic
framework using profile HMM
EM algorithm

41
Motif Sampler (EM)

Lawrence et al. 1993, Liu et al. 1995
Model the distribution of residues with
multinomial distributions
One multinom. distn per position within motif
One background distn for outside motif
The motif location is missing!

42
Problem Description

Given a set of N sequences S1,,SN
of length nk (k1,,N)
Identify a single pattern of fixed width(W)
within each (N)input sequence
A ak (k1,,N) a set of starting positions
for the common pattern within each sequence
ak1nk-W1
Objective to find the best, defined as the
most probable, common pattern

43
Algorithm- Initialization (1)

Choose random starting positions ak within
the various sequences
A ak (k1,,N) a set of starting
positions for the common pattern within each
sequence ak1nk-W1

44
X X X X X X X X X X X X X ? ? Z
X X X X X X M X X X X
X X X X M X X X X X X X
X X X X X X M X X X
X X X X X X X X M X
X X M X X X X X X X

A G G A G C A A G A
A C A T C C A A G T
T C A T G A T A G T
T G T A A T G T C A
A A T G T G G T C A
N6, W10 q1A 3/5, q2G 2/5, q1G 0
45
Algorithm- Predictive Update (2)

One of the N sequences, Z, is chosen either at
random or in specified order.
The pattern description qij and background
frequency q0j are then calculated excluding z.

Calculate the new multinomial frequencies if the
motif start at a given location in Z
calculated analogously with counts taken over all
non-motif positions
Find the most reasonable location in Z
Iterate!

47
X X X X X X X X X X X X ? ? Z
X X X X X M X X X X
X X M X X X X X X X
X X X X X X M X X X
X X X X X X X X M X
X X M X X X X X X X
A T G G C T A A G C C A T T A A T C G C
q3G X q4G X q5C X q6T X q7A X q8A X q9G X q10C X q11C X q12A
q0G x q0G x q0C x q0T x q0A x q0A x q0G x q0C x q0C x q0A
AX Qx/Bx
Select a set of aks that maximizes the product
of these ratios, or F F S 1iW S j? A,T,G,C
ci,jlog(qij/q0j)

Write a Comment

User Comments (0)

About PowerShow.com

Multiple Sequence Alignment - PowerPoint PPT Presentation

Multiple Sequence Alignment

'The large majority of proteins come from no more than one thousand families' ... that are 'coming from outside the polyhedron', as in the top part the inset. ... – PowerPoint PPT presentation