Sequence Similarity - PowerPoint PPT Presentation

About This Presentation
Title:

Sequence Similarity

Description:

PSI-BLAST. Given a sequence query x, and database D ... BLAST Variants. BLASTN genomic sequences. BLASTP proteins ... Score of the global map is maximum ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 43
Provided by: root
Learn more at: https://web.stanford.edu
Category:
Tags: blast | map | sequence | similarity | y1

less

Transcript and Presenter's Notes

Title: Sequence Similarity


1
Sequence Similarity
2
The Viterbi algorithm for alignment
  • Compute the following matrices (DP)
  • M(i, j) most likely alignment of x1xi with
    y1yj ending in state M
  • I(i, j) most likely alignment of x1xi with
    y1yj ending in state I
  • J(i, j) most likely alignment of x1xi with
    y1yj ending in state J
  • M(i, j) log( Prob(xi, yj) )
  • max M(i-1, j-1) log(1-2?),
  • I(i-1, j) log(1-?),
  • J(i, j-1) log(1-?)
  • I(i, j) max M(i-1, j) log ?,
  • I(i-1, j) log ?

log(1 2?)
M P(xi, yj)
log Prob(xi, yj)
log(1 ?)
log(1 ?)
log ?
log ?
I P(xi)
J P(yj)
log ?
log ?
3
One way to view the state paths State M

y1
yn
x1

xm
4
State I

y1
yn
x1

xm
5
State J

y1
yn
x1

xm
6
Putting it all together
States I(i, j) are connected with states J and M
(i-1, j) States J(i, j) are connected with
states I and M (i-1, j) States M(i, j) are
connected with states J and I (i-1, j-1)

y1
yn
x1

xm
7
Putting it all together
States I(i, j) are connected with states J and M
(i-1, j) States J(i, j) are connected with
states I and M (i-1, j) States M(i, j) are
connected with states J and I (i-1, j-1) Optimal
solution is the best scoring path from top-left
to bottom-right corner This gives the likeliest
alignment according to our HMM

y1
yn
x1

xm
8
Yet another way to represent this model
Ix
Ix
BEGIN
END
Iy
Iy
Mx1
Mxm
Sequence X
We are aligning, or threading, sequence Y through
sequence X Every time yj lands in state xi, we
get substitution score s(xi, yj) Every time yj
is gapped, or some xi is skipped, we pay gap
penalty
9
From this model, we can compute additional
statistics
  • P(xi yj x, y) The probability that positions
    i, j align, given that sequences x and y align
  • P(xi yj x, y) ?a alignmentP(a x, y) 1(xi
    yj in a)
  • We will not cover the details, but
  • this quantity can also be
  • calculated with DP

log(1 2?)
M P(xi, yj)
log Prob(xi, yj)
log(1 ?)
log(1 ?)
log ?
log ?
I P(xi)
J P(yj)
log ?
log ?
10
Fast database search BLAST
  • (Basic Local Alignment Search Tool)
  • Main idea
  • Construct a dictionary of all the words in the
    query
  • Initiate a local alignment for each word match
    between query and DB
  • Running Time O(MN)
  • However, orders of magnitude faster than
    Smith-Waterman

query
DB
11
BLAST ? Original Version
  • Dictionary
  • All words of length k (11 nucl. 4 aa)
  • Alignment initiated between words of alignment
    score ? T (typically T k)
  • Alignment
  • Ungapped extensions until score below
    statistical threshold
  • Output
  • All local alignments with score gt statistical
    threshold


query

scan
DB
query
12
PSI-BLAST
  • Given a sequence query x, and database D
  • Find all pairwise alignments of x to sequences in
    D
  • Collect all matches of x to y with some minimum
    significance
  • Construct position specific matrix M
  • Each sequence y is given a weight so that many
    similar sequences cannot have much influence on a
    position (Henikoff Henikoff 1994)
  • Using the matrix M, search D for more matches
  • Iterate 14 until convergence

Profile M
13
BLAST Variants
  • BLASTN genomic sequences
  • BLASTP proteins
  • BLASTX translated genome versus proteins
  • TBLASTN proteins versus translated genomes
  • TBLASTX translated genome versus translated
    genome
  • PSIBLAST iterated BLAST search
  • http//www.ncbi.nlm.nih.gov/BLAST

14
Multiple Sequence Alignments
15
Protein Phylogenies
  • Proteins evolve by both duplication and species
    divergence

16
(No Transcript)
17
(No Transcript)
18
Definition
  • Given N sequences x1, x2,, xN
  • Insert gaps (-) in each sequence xi, such that
  • All sequences have the same length L
  • Score of the global map is maximum
  • A faint similarity between two sequences becomes
    significant if present in many
  • Multiple alignments can help improve the pairwise
    alignments

19
Scoring Function Sum Of Pairs
  • Definition Induced pairwise alignment
  • A pairwise alignment induced by the multiple
    alignment
  • Example
  • x AC-GCGG-C
  • y AC-GC-GAG
  • z GCCGC-GAG
  • Induces
  • x ACGCGG-C x AC-GCGG-C y AC-GCGAG
  • y ACGC-GAC z GCCGC-GAG z GCCGCGAG

20
Sum Of Pairs (contd)
  • Heuristic way to incorporate evolution tree

Human
Mouse
Duck
Chicken
  • Weighted SOP
  • S(m) ?kltl wkl s(mk, ml)
  • wkl weight decreasing with distance

21
A Profile Representation
- A G G C T A T C A C C T G T
A G C T A C C A - - - G C A G
C T A C C A - - - G C A G C
T A T C A C G G C A G C T A
T C G C G G A 1 1
.8 C .6 1 .4 1 .6
.2 G 1 .2 .2 .4
1 T .2 1 .6 .2 - .2
.8 .4 .8 .4
  • Given a multiple alignment M m1mn
  • Replace each column mi with profile entry pi
  • Frequency of each letter in ?
  • gaps
  • Optional gap openings, extensions, closings
  • Can think of this as a likelihood of each
    letter in each position

22
Multiple Sequence Alignments
  • Algorithms

23
Multidimensional DP
  • Generalization of Needleman-Wunsh
  • S(m) ?i S(mi)
  • (sum of column scores)
  • F(i1,i2,,iN) Optimal alignment up to (i1, ,
    iN)
  • F(i1,i2,,iN) max(all neighbors of
    cube)(F(nbr)S(nbr))

24
Multidimensional DP
  • Example in 3D (three sequences)
  • 7 neighbors/cell

F(i,j,k) max F(i-1,j-1,k-1)S(xi, xj,
xk), F(i-1,j-1,k )S(xi, xj, -
), F(i-1,j ,k-1)S(xi, -, xk), F(i-1,j
,k )S(xi, -, - ), F(i ,j-1,k-1)S( -,
xj, xk), F(i ,j-1,k )S( -, xj,
xk), F(i ,j ,k-1)S( -, -, xk)
25
Multidimensional DP
  • Running Time
  • Size of matrix LN
  • Where L length of each sequence
  • N number of sequences
  • Neighbors/cell 2N 1
  • Therefore O(2N LN)

26
Multidimensional DP
  • How do gap states generalize?
  • VERY badly!
  • Require 2N states, one per combination of
    gapped/ungapped sequences
  • Running time O(2N ? 2N ? LN) O(4N LN)
  • Running Time
  • Size of matrix LN
  • Where L length of each sequence
  • N number of sequences
  • Neighbors/cell 2N 1
  • Therefore O(2N LN)

Y
YZ
XY
XYZ
Z
X
XZ
27
Progressive Alignment
x
pxy
y
z
pxyzw
pzw
w
  • When evolutionary tree is known
  • Align closest first, in the order of the tree
  • In each step, align two sequences x, y, or
    profiles px, py, to generate a new alignment with
    associated profile presult
  • Weighted version
  • Tree edges have weights, proportional to the
    divergence in that edge
  • New profile is a weighted average of two old
    profiles

28
Progressive Alignment
x
y
Example Profile (A, C, G, T, -) px (0.8, 0.2,
0, 0, 0) py (0.6, 0, 0, 0, 0.4) s(px, py)
0.80.6s(A, A) 0.20.6s(C, A)
0.80.4s(A, -) 0.20.4s(C, -) Result pxy
(0.7, 0.1, 0, 0, 0.2) s(px, -) 0.81.0s(A, -)
0.21.0s(C, -) Result px- (0.4, 0.1, 0, 0,
0.5)
z
w
  • When evolutionary tree is known
  • Align closest first, in the order of the tree
  • In each step, align two sequences x, y, or
    profiles px, py, to generate a new alignment with
    associated profile presult
  • Weighted version
  • Tree edges have weights, proportional to the
    divergence in that edge
  • New profile is a weighted average of two old
    profiles

29
Progressive Alignment
x
y
?
z
w
  • When evolutionary tree is unknown
  • Perform all pairwise alignments
  • Define distance matrix D, where D(x, y) is a
    measure of evolutionary distance, based on
    pairwise alignment
  • Construct a tree
  • Align on the tree

30
Heuristics to improve alignments
  • Iterative refinement schemes
  • A-based search
  • Consistency
  • Simulated Annealing

31
Iterative Refinement
  • One problem of progressive alignment
  • Initial alignments are frozen even when new
    evidence comes
  • Example
  • x GAAGTT
  • y GAC-TT
  • z GAACTG
  • w GTACTG

Frozen!
Now clear correct y GA-CTT
32
Iterative Refinement
  • Algorithm (Barton-Stenberg)
  • For j 1 to N,
  • Remove xj, and realign to x1xj-1xj1xN
  • Repeat 4 until convergence

z
x
y
33
Iterative Refinement
  • Example align (x,y), (z,w), (xy, zw)
  • x GAAGTTA
  • y GAC-TTA
  • z GAACTGA
  • w GTACTGA
  • After realigning y
  • x GAAGTTA
  • y G-ACTTA 3 matches
  • z GAACTGA
  • w GTACTGA

Variant Refinement on a tree tree
partitioning
34
Iterative Refinement
  • Example align (x,y), (z,w), (xy, zw)
  • x GAAGTTA
  • y GAC-TTA
  • z GAACTGA
  • w GTACTGA
  • After realigning y
  • x GAAGTTA
  • y G-ACTTA 3 matches
  • z GAACTGA
  • w GTACTGA

35
Iterative Refinement
  • Example not handled well
  • x GAAGTTA
  • y1 GAC-TTA
  • y2 GAC-TTA
  • y3 GAC-TTA
  • z GAACTGA
  • w GTACTGA
  • Realigning any single yi changes nothing

36
Some Resources
  • http//www.ncbi.nlm.nih.gov/BLAST
  • BLAST PSI-BLAST
  • http//www.ebi.ac.uk/clustalw/
  • CLUSTALW most widely used
  • http//phylogenomics.berkeley.edu/cgi-bin/muscle/i
    nput_muscle.py
  • MUSCLE most scalable
  • http//probcons.stanford.edu/
  • PROBCONS most accurate

37
MUSCLE at a glance
  • Fast measurement of all pairwise distances
    between sequences
  • DDRAFT(x, y) defined in terms of common k-mers
    (k3) O(N2 L logL) time
  • Build tree TDRAFT based on DDRAFT, with a
    hierarchical clustering method (UPGMA)
  • Progressive alignment over TDRAFT, resulting in
    multiple alignment MDRAFT
  • Measure distances D(x, y) based on MDRAFT
  • Build tree T based on D
  • Progressive alignment over T, to build M
  • Iterative refinement for many rounds, do
  • Tree Partitioning Split M on one branch and
    realign the two resulting profiles
  • If new alignment M has better sum-of-pairs score
    than previous one, accept

38
PROBCONS Probabilistic Consistency-based
Multiple Alignment of Proteins
xi
yj
MATCH
INSERT X
INSERT Y
?
yj
39
A pair-HMM model of pairwise alignment
x
ABRACA-DABRA AB-ACARDI---
y
  • Parameterizes a probability distribution, P(A),
    over all possible alignments of all possible
    pairs of sequences
  • Transition probabilities gap penalties
  • Emission probabilities substitution matrix

40
Computing Pairwise Alignments
P(a)
aviterbi
P(a x, y)
  • The Viterbi algorithm
  • conditional distribution P(a x, y) reflects
    models uncertainty over the correct alignment
    of x and y
  • identifies highest probability alignment,
    aviterbi, in O(L2) time
  • Caveat the most likely alignment is not the most
    accurate
  • Alternative find the alignment of maximum
    expected accuracy

41
The Lazy-Teacher Analogy
  • 10 students take a 10-question true-false quiz
  • How do you make the answer key?
  • Approach 1 Use the answer sheet of the best
    student!
  • Approach 2 Weighted majority vote!

42
Viterbi vs. Maximum Expected Accuracy (MEA)
  • Viterbi
  • picks single alignment with highest chance of
    being completely correct
  • mathematically, finds the alignment a that
    maximizes
  • Ea1a a
  • Maximum Expected Accuracy
  • picks alignment with highest expected number of
    correct predictions
  • mathematically, finds the alignment a that
    maximizes
  • Eaaccuracy(a, a)

4. F
4. T
4. F
4. F
4. F
A-
A
B
A-
A
4. F
4. F
4. F
4. F
4. T
C
B
B
B-
B-
Write a Comment
User Comments (0)
About PowerShow.com