Computational Genomics Lecture - PowerPoint PPT Presentation

About This Presentation
Title:

Computational Genomics Lecture

Description:

substitutions, insertions, and deletions. Where does the scoring function come from ? ... x) is the score of inserting x. But how do we come up with the ' ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 60
Provided by: bch7
Category:

less

Transcript and Presenter's Notes

Title: Computational Genomics Lecture


1
Computational GenomicsLecture 3
  • Crash intro to ML
  • Scoring functions and DNA and AAs
  • Multiple sequence alignment

Background Readings Chapters 2.5, 2.7 in the
text book, Biological Sequence Analysis, Durbin
et al., 2001. Chapters 3.5.1- 3.5.3, 3.6.2 in
Introduction to Computational Molecular Biology,
Setubal and Meidanis, 1997.  Chapter 15 in
Gusfields book.
Much of this class has been edited from Nir
Friedmans lecture which is available at
www.cs.huji.ac.il/nir. Changes made by Dan
Geiger, then Shlomo Moran, and finally Benny Chor.
2
Scoring Functions, Reminder
  • So far, we discussed dynamic programming
    algorithms for
  • global alignment
  • local alignment
  • All of these assumed a scoring function
  • that determines the value of perfect matches,
  • substitutions, insertions, and deletions.

3
Where does the scoring function come from ?
  • We have defined an additive scoring function by
    specifying a function ?( ?, ? ) such that
  • ?(x,y) is the score of replacing x by y
  • ?(x,-) is the score of deleting x
  • ?(-,x) is the score of inserting x
  • But how do we come up with the correct score ?

Answer By encoding experience of what are
similar sequences for the task at hand.
Similarity depends on time, evolution trends, and
sequence types.
4
Why probability setting is appropriate to define
and interpret a scoring function ?
  • Similarity is probabilistic in nature because
    biological changes like mutation, recombination,
    and selection, are random events.
  • We could answer questions such as
  • How probable it is for two sequences to be
    similar?
  • Is the similarity found significant or spurious?
  • How to change a similarity score when, say,
    mutation rate of a specific area on the
    chromosome becomes known ?

5
A Probabilistic Model
  • For starters, will focus on alignment without
    indels.
  • For now, we assume each position (nucleotide
    /amino-acid) is independent of other positions.
  • We consider two options
  • M the sequences are Matched (related)
  • R the sequences are Random (unrelated)

6
Unrelated Sequences
  • Our random model R of unrelated sequences is
    simple
  • Each position is sampled independently from a
    distribution over the alphabet ?
  • We assume there is a distribution q(?) that
    describes the probability of letters in such
    positions.
  • Then

7
Related Sequences
  • We assume that each pair of aligned positions
    (si,ti) evolved from a common ancestor
  • Let p(a,b) be a distribution over pairs of
    letters.
  • p(a,b) is the probability that some ancestral
    letter evolved into this particular pair of
    letters.

Compare to
8
Odd-Ratio Test for Alignment
If Q gt 1, then the two strings s and t are more
likely to be related (M) than unrelated (R). If
Q lt 1, then the two strings s and t are more
likely to be unrelated (R) than related (M).
9
Log Odd-Ratio Test for Alignment
  • Taking logarithm of Q yields

Score(si,ti)
If log Q gt 0, then s and t are more likely to be
related. If log Q lt 0, then they are more likely
to be unrelated. How can we relate this
quantity to a score function ?
10
Probabilistic Interpretation of Scores
  • We define the scoring function via
  • Then, the score of an alignment is the log-ratio
    between the two models
  • Score gt 0 ? Model is more likely
  • Score lt 0 ? Random is more likely

11
Modeling Assumptions
  • It is important to note that this interpretation
    depends on our modeling assumption!!
  • For example, if we assume that the letter in each
    position depends on the letter in the preceding
    position, then the likelihood ratio will have a
    different form.
  • If we assume, for proteins, some joint
    distribution of letters that are nearby in 3D
    space after protein folding, then likelihood
    ratio will again be different.

12
Estimating Probabilities
  • Suppose we are given a long string s1..n of
    letters from ?
  • We want to estimate the distribution q() that
    generated the sequence
  • How should we go about this?
  • We build on the theory of parameter estimation in
    statistics using either maximum likelihood
    (today) estimation or the Bayesian approach
    (later on).

13
Estimating q(?)
  • Suppose we are given a long string s1..n of
    letters from ?
  • s can be the concatenation of all sequences in
    our database
  • We want to estimate the distribution q(?)
  • That is, q is defined per single letters

Likelihood function
14
Estimating q(?) (cont.)
  • How do we define q? Intuitively

Likelihood function
ML parameters (Maximum Likelihood)
15
Crash Course on Maximum Likelihood Binomial
Experiment
Head
Tail
  • When tossed, this device (thumbtack) can land
    in one of two positions Head or Tail
  • We denote by ? the (unknown) probability P(H).
  • Estimation task
  • Given a sequence of toss samples
  • x1, x2, , xM
  • we want to estimate the probabilities
  • P(H) ? and P(T) 1 - ?

16
Statistical Parameter Fitting
  • Consider instances
  • x1, x2, , xM , such that
  • The set of values that x can take is known
  • Each is sampled from the same distribution
  • Each sampled independently of the rest
  • The task is to find a vector of parameters ? that
    have generated the given data. This vector
    parameter ? can be used to predict future data.

17
The Likelihood Function
  • How good is a particular ??It depends on how
    likely it is to generate the observed data
  • The likelihood for the sequence H,T,T,H,H is

18
Sufficient Statistics
  • To compute the likelihood in the thumbtack
    example we only require NH and NT (the number
    of heads and the number of
  • tails)
  • NH and NT are sufficient statistics for the
    binomial distribution

19
Sufficient Statistics
  • A sufficient statistic is a function of the data
    that summarizes the relevant information for the
    likelihood

20
Maximum Likelihood Estimation
  • MLE Principle
  • Choose parameters that maximize the likelihood
    function
  • This is one of the most commonly used estimators
    in statistics
  • Intuitively appealing
  • One usually maximizes the log-likelihood function
    defined as lD(?) loge LD(?)

21
Example MLE in Binomial Data
  • Applying the MLE principle (taking derivative) we
    get

22
Estimating q(?) (cont.)
  • How do we define q?

Likelihood function
ML parameters (now) (Maximum Likelihood)
MAP parameters (later on) (Maximum A posteriori
Probability)
23
Estimating p(,)
  • Intuition
  • Find pair of aligned sequences s1..n, t1..n,
  • Estimate probability of pairs
  • The sequences s and t can be the concatenation of
    many aligned pairs from the database

Number of times a is aligned with b in (s,t)
24
Problems in Estimating p(,)
  • How do we find pairs of aligned sequences?
  • How far is the ancestor ?
  • earlier divergence ? low sequence similarity
  • recent divergence ? high sequence similarity
  • Does one letter mutate to the other or are they
    both mutations of a common ancestor having yet
    another residue/nucleotide acid ?

25
Scoring MatricesDeal with DNA first
(simpler)then AA (not too bad either)
26
What is it why ?
  • Let alphabet contain N letters
  • N 4 and 20 for nucleotides and amino acids
  • N x N matrix
  • (i,j) shows the relationship between i-th and
    j-th letters.
  • Positive number if letter i is likely to mutate
    into letter j
  • Negative otherwise
  • Magnitude shows the degree of proximity
  • Symmetric

27
Scoring Matrices for DNA
Transitions transversions
BLAST
identity
28
The BLOSUM45 Matrix
  • A R N D C Q E G H I L K M F P S
    T W Y V
  • A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1
    0 -2 -2 0
  • R -2 7 0 -1 -3 1 0 -2 0 -3 -2 3 -1 -2 -2 -1
    -1 -2 -1 -2
  • N -1 0 6 2 -2 0 0 0 1 -2 -3 0 -2 -2 -2 1
    0 -4 -2 -3
  • D -2 -1 2 7 -3 0 2 -1 0 -4 -3 0 -3 -4 -1 0
    -1 -4 -2 -3
  • C -1 -3 -2 -3 12 -3 -3 -3 -3 -3 -2 -3 -2 -2 -4 -1
    -1 -5 -3 -1
  • Q -1 1 0 0 -3 6 2 -2 1 -2 -2 1 0 -4 -1 0
    -1 -2 -1 -3
  • E -1 0 0 2 -3 2 6 -2 0 -3 -2 1 -2 -3 0 0
    -1 -3 -2 -3
  • G 0 -2 0 -1 -3 -2 -2 7 -2 -4 -3 -2 -2 -3 -2 0
    -2 -2 -3 -3
  • H -2 0 1 0 -3 1 0 -2 10 -3 -2 -1 0 -2 -2 -1
    -2 -3 2 -3
  • I -1 -3 -2 -4 -3 -2 -3 -4 -3 5 2 -3 2 0 -2 -2
    -1 -2 0 3
  • L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5 -3 2 1 -3 -3
    -1 -2 0 1
  • K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5 -1 -3 -1 -1
    -1 -2 -1 -2
  • M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6 0 -2 -2
    -1 -2 0 1
  • F -2 -2 -2 -4 -2 -4 -3 -3 -2 0 1 -3 0 8 -3 -2
    -1 1 3 0
  • P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9 -1
    -1 -3 -3 -3
  • S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4
    2 -4 -2 -1
  • T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1 2
    5 -3 -1 0
  • W -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2 1 -3 -4
    -3 15 3 -3

29
Scoring Matrices for Amino Acids
  • Chemical similarities
  • Non-polar, Hydrophobic (G, A, V, L, I, M, F, W,
    P)
  • Polar, Hydrophilic (S, T, C, Y, N, Q)
  • Electrically charged (D, E, K, R, H)
  • Requires expert knowledge
  • Genetic code Nucleotide substitutions
  • E GAA, GAG
  • D GAU, GAC
  • F UUU, UUC
  • Actual substitutions
  • PAM
  • BLOSUM

30
Scoring Matrices Actual Substitutions
  • Manually align proteins
  • Look for amino acid substitutions
  • Entry log (freq(observed) / freq(expected))
  • Log-odds matrices

31
BLOSUM BLOcks Substitution Matrices
  • Henikoff Henikoff, 1992

Next slides taken from lecture notes by Tamer
Kahveci, CISE Department, University of Florida
(www.cise.ufl.edu/tamer/teaching/
fall2004/lectures/03-CAP5510-Fall04.ppt
32
BLOSUM Matrix
  • Begin with a set of protein sequences and obtain
    aligned blocks.
  • 2000 blocks from 500 families of related
    proteins
  • A block is the ungapped alignment of a highly
    conserved region of a family of proteins.
  • MOTIF program is used to find blocks
  • Substitutions in these blocks are used to compute
    BLOSUM matrix

block 1
block 2
block 3
WWYIR CASILRKIYIYGPV GVSRLRTAYGGRKNRG WFYV
R CASILRHLYHRSPA GVGSITKIYGGRKRNG WYYVR
AAAVARHIYLRKTV GVGRLRKVHGSTKNRG WYFIR
AASICRHLYIRSPA GIGSFEKIYGGRRRRG
33
Constructing the Matrix
  • Count the frequency of occurrence of each amino
    acid.
  • This gives the background distribution pa
  • Count the number of times amino acid a is aligned
    with amino acid b fab
  • A block of width w and depth s contributes
    ws(s-1)/2 pairs.
  • Denote by np the total number of pairs.
  • Compute the occurrence probability of each pair
    qab fab/ np
  • Compute the expected probability of occurrence of
    each pair
  • eab 2papb, if a ? b
  • papb otherwise
  • Compute twice (?) the log likelihood ratios,
    normalize, and round to nearest integer.
  • 2 log2 qab / eab

i
j gt i
a?b
34
Computation of BLOSUM-X
  • The amount of similarity in blocks has a great
    effect on the BLOSUM score. BLOSUM-X is generated
    by taking only blocks with X identity.
  • For example, a BLOSUM62 matrix is calculated from
    protein blocks with 62 identity.
  • So BLOSUM80 represents closer sequences (more
    recent divergence) than BLOSUM62.
  • On the web, Blast uses BLOSUM80, BLOSUM62 (the
    default), or BLOSUM45.

a
b
35
BLOSUM 62 Matrix
Check scores for
M I L V -small hydrophobic N D E Q -acid,
hydrophilic H R K -basic F Y W -aromatic S T P
A G -small hydrophilic C -sulphydryl
36
PAM vs. BLOSUM
Equivalent PAM and BLOSSUM matrices PAM100
Blosum90 PAM120 Blosum80 PAM160
Blosum60 PAM200 Blosum52 PAM250
Blosum45 BLOSUM62 is the default matrix to use.
37
And NowLadies and GentlemenBoys and Girlsthe
holy grailMultiple Sequence Alignment
38
Multiple Sequence Alignment
S1AGGTC
S2GTTCG
S3TGAAC
39
Multiple Sequence Alignment
Aligning more than two sequences.
  • Definition Given strings S1, S2, ,Sk a
    multiple (global) alignment map them to strings
    S1, S2, ,Sk that may contain blanks, where
  • S1 S2 Sk
  • The removal of spaces from Si leaves Si

40
Multiple alignments
  • We use a matrix to represent the alignment of k
    sequences, K(x1,...,xk). We assume no columns
    consists solely of blanks.

The common scoring functions give a score to each
column, and set score(K) ?i score(column(i))
For k10, a scoring function has 2k -1 gt 1000
entries to specify. The scoring function is
symmetric - the order of arguments need not
matter score(I,_,I,V) score(_,I,I,V).
41
SUM OF PAIRS
A common scoring function is SP sum of scores
of the projected pairwise alignments
SPscore(K)?iltj score(xi,xj).
Note that we need to specify the score(-,-)
because a column may have several blanks (as long
as not all entries are blanks).
In order for this score to be written as ?i
score(column(i)), we set score(-,-) 0. Why ?
Because these entries appear in the sum of
columns but not in the sum of projected pairwise
alignments (lines).
42
SUM OF PAIRS
43
Example
Consider the following alignment
a c - c d b -
- c - a d b d
a - b c d a d
Using the edit distance and
for , this alignment has a SP
value of
44
Multiple Sequence Alignment
  • Given k strings of length n, there is a natural
    generalization of the dynamic programming
    algorithm that finds an alignment that maximizes
  • SP-score(K) ?iltj
    score(xi,xj).

Instead of a 2-dimensional table, we now have a
k-dimensional table to fill.
45
The idea via K2
Recall the notation and the following
recurrence for V
Note that the new cell index (i1,j1) differs
from previous indices by one of 2k-1 non-zero
binary vectors (1,1), (1,0), (0,1).
46
Multiple Sequence Alignment
Given k strings of length n, there is a
generalization of the dynamic programming
algorithm that finds an optimal SP alignment.
  • Computational Cost
  • Instead of a 2-dimensional table we now have a
    k-dimensional table to fill.
  • Each dimensions size is n1. Each entry depends
    on 2k-1 adjacent entries.

47
Complexity of the DP approach
Number of cells nk. Number of adjacent cells
O(2k). Computation of SP score for each
column(i,b) is o(k2)
Total run time is O(k22knk) which is totally
unacceptable ! Maybe one can do better?
48
But MSA is Intractable
Not much hope for a polynomial algorithm because
the problem has been shown to be NP complete
(proof is quite Tricky and recent. Some previous
proofs were bogus). Look at Isaac Elias
presentation of NP completeness proof. Need
heuristic or approximation to reduce time.
49
Multiple Sequence Alignment Approximation
Algorithm
Now we will see an O(k2n2) multiple alignment
algorithm for the SP-score that approximate the
optimal solutions score by a factor of at most
2(1-1/k) lt 2.
50
Star Alignments
Rather then summing up all pairwise alignments,
select a fixed sequence S1 as a center, and set
Star-score(K) ?jgt0score(S1,Sj). The
algorithm to find optimal alignment at each
step, add another sequence aligned with S1,
keeping old gaps and possibly adding new ones.
51
Multiple Sequence Alignment Approximation
Algorithm
  • Polynomial time algorithm
  • assumption the function d is a distance
    function

  • (triangle inequality)
  • Let D(S,T) be the value of the minimum global
    alignment between S and T.

52
Multiple Sequence Alignment Approximation
Algorithm (cont.)
  • 2. Call the remaining strings S2, ,Sk.
  • 3. Add a string to the multiple alignment that
    initially contains only S1 as follows
  • Suppose S1, ,Si-1 are already aligned as S1,
    ,Si-1. Add Si by running dynamic programming
    algorithm on S1 and Si to produce S1 and Si.
  • Adjust S2, ,Si-1 by adding spaces to those
    columns where spaces were added to get S1 from
    S1.
  • Replace S1 by S1.

53
Multiple Sequence Alignment Approximation
Algorithm (cont.)
  • Time analysis
  • Choosing S1 running dynamic programming
    algorithm
  • times O(k2n2)
  • When Si is added to the multiple alignment, the
    length of S1
  • is at most in, so the time to add all k
    strings is

54
Multiple Sequence Alignment Approximation
Algorithm (cont.)
  • Performance analysis
  • M - The alignment produced by this algorithm.
  • d(i,j) - the distance M induces on the pair
    Si,Sj.
  • M - optimal alignment.

55
Multiple Sequence Alignment Approximation
Algorithm (cont.)
Triangle inequality
Performance analysis
56
Multiple Sequence Alignment Approximation
Algorithm
Algorithm relies heavily on scoring function
being a distance. It produced an alignment whose
SP score is at most twice the minimum. What if
scoring function was similarity? Can we get an
efficient algorithm whose score is half the
maximum? Third of maximum? We dunno !
57
Tree Alignments
  • Assume that there is a tree T(V,E) whose leaves
    are the sequences.
  • Associate a sequence in each internal node.
  • Tree-score(K) ?(i,j)?Escore(xi,xj).
  • Finding the optimal assignment of sequences to
    the internal nodes is NP Hard.

We will meet again this problem in the study
of Phylogenetic trees (it is related to the
parsimony problem).
58
Multiple Sequence Alignment Heuristics
Example - 4 sequences A, B, C, D.
A.
B D A C
A B C D
Perform all 6 pair wise alignments. Find
scores. Build a similarity tree.
similar
distant
B.
Multiple alignment following the tree from A.
B
Align most similar pairs allowing gaps to
optimize alignment.
D
A
Align the next most similar pair.
C
Now, align the alignments, introducing gaps if
necessary to optimize alignment of (BD) with
(AC).
59
The tree-based progressive method for multiple
sequence alignment, used in practice (Clustal)
(a) a tree (dendrogram) obtained by cluster
analysis (b) pairwise alignment of 2 sequences
alignments.
(a)
L W R D G R G A L Q L W R G G R G A A Q D W R - G
R T A S G
(b)
DEHUG3 DEPGG3 DEBYG3 DEZYG3 DEBSGF
L R R - A R T A S A L - R G A R A A A E
(modified from Speeds ppt presentation,see p.
81 in Kanehisas book)
60
Visualization of Alignment Helps
Write a Comment
User Comments (0)
About PowerShow.com