Title: Multiple Sequence Alignment
1Multiple Sequence Alignment
2(No Transcript)
3- Highly conserved region in MSA (multiple sequence
alignment) may imply important functional
information.
4Families
- gene family a set of homologous genes
- protein family a set of homologous
proteins - examples
- globin gene family
- HOX gene family
- serine/threonine kinase family
5Protein families
- The large majority of proteins come from no more
than one thousand families (Chothia 1994)
6Protein structure
- amino acid sequence (primary)
- three-dimensional structure
- small scale (secondary)
- alpha-helix, beta sheet, fold
- large scale (tertiary)
- domain
- fully functional protein (quarternary)
7Domains
- Protein composed from several domains
- domain carries specific function
- Structure is more likely to be conserved than
sequence - one exon might represent one domain
8Domains
9Related Motivation
- Gain insight into evolutionary history
- By looking at the number of mutations necessary
to go from one sequence to another, one can
assess the time of divergence
10(No Transcript)
11(No Transcript)
12Alternatives to SP score
- What we have now (loglikelihood ratio)
- A natural extension for aligning 3 sequences (
can be unrealistically over-parameterized)
13(No Transcript)
14(No Transcript)
15Example
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20Carrollo Lipman Algorithm
- -- an attempt to reduce the volume of the dynamic
programming matrix
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26 3 or more sequences
The optimal alignment path is contained in a
"polyhedron" close to the main diagonal. Here, a
polyhedron is a solid formed by plane faces, or
more complicated 2-dimensional surfaces. For
better visualization, the polyhedron's shadows
are displayed. While visiting a node and looking
for the minimum along all the incoming edges, we
can ignore those edges that are "coming from
outside the polyhedron", as in the top part the
inset. On its top-left side, the cube is
"covered" by the polyhedron. The edges 1, 2, 3, 6
and 7 are coming from the inside, and edges 4 and
5 can be ignored.
27Progressive Alignment Methods
- Most commonly used approach to multiple alignment
28Progressive Methods
- Start with the most related sequence then
progressively add less related sequence(s) to the
initial alignment
29Guide Tree for Progressive Methods
Do NOT confuse with phylogenetic tree
30Ad hoc Guide Tree Building
First construct a distance matrix of all pairwise
distances
31Joining Nearest Neighbors
32Preserving Adding Gaps
33- An Example of Progressive Multiple Alignment
34(No Transcript)
35Problems of Progressive Alignment
- No guarantee of the global optimal multiple
alignment - Initial choice of sequences affects the final
alignment - When sequences are highly divergent, the
progressive approach becomes less reliable
36The CLUSTALW program
- Fine tuned version of the above algorithm
- Sequences are weighted to account for biased
representation in large sub-families. - Substitution matrix is chosen flexibly
- Manipulation of gap penalties
37Motif Representations
CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CGGCGGCTTCTAAT
CCG ... CGGGGCAGACTATTCCG
- Consensus
- Frequency Matrix
- Logo
CGGNGCACANTCNTCCG
38Logo explanation
- The characters representing the sequence are
stacked on top of each other for each position in
the aligned sequences. - The height of each letter is made proportional to
its frequency, the most common one is on top. - The height of the entire stack is then adjusted
to signify the information content of the
sequences at that position.
39Information Content
- Uncertainty
- Information
- Thomas D. Schneider and R. Michael Stephens,
Nucleic Acids Research, 18 6097-6100 (1990)
40Other MSA methods
- Phylogenetic tree building
- Alignment using the Sum-of-Pairs scoring scheme
can be accomplished in a more probabilistic
framework using profile HMM - EM algorithm
41Motif Sampler (EM)
- Lawrence et al. 1993, Liu et al. 1995
- Model the distribution of residues with
multinomial distributions - One multinom. distn per position within motif
- One background distn for outside motif
- The motif location is missing!
42Problem Description
- Given a set of N sequences S1,,SN
- of length nk (k1,,N)
- Identify a single pattern of fixed width(W)
within each (N)input sequence - A ak (k1,,N) a set of starting positions
for the common pattern within each sequence
ak1nk-W1 - Objective to find the best, defined as the
most probable, common pattern
43Algorithm- Initialization (1)
- Choose random starting positions ak within
the various sequences - A ak (k1,,N) a set of starting
positions for the common pattern within each
sequence ak1nk-W1
44X X X X X X X X X X X X X ? ? Z
X X X X X X M X X X X
X X X X M X X X X X X X
X X X X X X M X X X
X X X X X X X X M X
X X M X X X X X X X
A G G A G C A A G A
A C A T C C A A G T
T C A T G A T A G T
T G T A A T G T C A
A A T G T G G T C A
N6, W10 q1A 3/5, q2G 2/5, q1G 0
45Algorithm- Predictive Update (2)
- One of the N sequences, Z, is chosen either at
random or in specified order. - The pattern description qij and background
frequency q0j are then calculated excluding z.
46- Calculate the new multinomial frequencies if the
motif start at a given location in Z - calculated analogously with counts taken over all
non-motif positions - Find the most reasonable location in Z
- Iterate!
47X X X X X X X X X X X X ? ? Z
X X X X X M X X X X
X X M X X X X X X X
X X X X X X M X X X
X X X X X X X X M X
X X M X X X X X X X
A T G G C T A A G C C A T T A A T C G C
q3G X q4G X q5C X q6T X q7A X q8A X q9G X q10C X q11C X q12A
q0G x q0G x q0C x q0T x q0A x q0A x q0G x q0C x q0C x q0A
AX Qx/Bx
Select a set of aks that maximizes the product
of these ratios, or F F S 1iW S j? A,T,G,C
ci,jlog(qij/q0j)