Alignments, Matrices - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Alignments, Matrices

Description:

Alignments, Matrices & Markov Models. Chris Bailey. Bacterial Pathogenesis & Genomics Unit ... Alignment scored using an adapted smith-waterman algorithm ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 45
Provided by: christoph347
Category:

less

Transcript and Presenter's Notes

Title: Alignments, Matrices


1
Alignments, Matrices Markov Models
  • Chris Bailey
  • Bacterial Pathogenesis Genomics Unit
  • cmb036_at_bham.ac.uk

2
The Matrix
  • When analysing nucleotide sequences
  • Nucleotides match (GG)
  • Or they dont (G?C)
  • One nucleotide substitution is more relevant than
    another
  • (Except in the light of what that nucleotide
    sequence ends up coding for)

3
The Matrix
  • But in proteins
  • Some amino acids are readily substitutable (i.e.
    1 hydrophobic residue for another)
  • And others are badly tolerated (i.e. 1
    hydrophobic residue for a charged residue)
  • Assuming you want the 2º 3º structure to be the
    same

4
Accommodating for Change
  • So we adapt our scoring function (s)
  • So that scores for matches and mismatches take
    account of amino acid substitutability
  • We do this using protein substitution matrices

5
Whats a matrix
  • Lookup grid (20x20), with each row/column
    containing an amino acid

6
Whats a Matrix
  • Values in the matrix are scores of one amino acid
    vs another
  • For example
  • Big positive score amino acid 1 is readily
    substitutable for amino acid 2
  • Big negative score amino acid 1 is not
    substitutable for amino acid 3

7
Part of a Matrix
8
How do we fill the table?
  • Obvious what the values should represent (protein
    similarity)
  • But how do we calculate the actual numbers
  • Since we are looking for homology an evolutionary
    model is a good place to start

9
Why look at evolution?
  • Homology is a binary property
  • Similarity ? Homology
  • Homology indicates two proteins had a common
    ancestor

10
The PAM Matrix
  • PAM Point Accepted Mutation
  • Construct 71 phylogenetic trees of protein
    families
  • Observe amino acid substitutions on each branch
    of tree
  • Also need probability of occurrence for each
    amino acid (pa)

11
The PAM Matrix
  • Using substitution data calculate fab the
    observed frequency of the mutation a ? b
  • Also note that fab fba
  • Using this information calculate fa, the total
    number of mutations in which a involved

12
The PAM Matrix
  • And also calculate f, the total occurences of
    amino acid substitutions
  • From here we go on to calculate relative
    mutability

13
The PAM Matrix
  • Relative mutability Probability that a given
    amino acid will change in the evolutionary period
    of interest
  • Now we calculate the matrix

14
The PAM Matrix
  • 20 x 20 Matrix where Mab is the probability of
    amino acid a changing into amino acid b
  • Maa 1 ma
  • Mab is more complicated requires conditional
    probability
  • E.g. P(A and B) P(A)P(BA)

15
The PAM Matrix
  • In this case
  • Or

16
The PAM Matrix
  • These equations allow us to calculate a PAM1
    matrix
  • The number after PAM is the number of amino acid
    substitutions per 100 residues
  • PAM40 40 substitutions per 100 residues
  • PAM250 250 substitutions per 100 residues
  • All matrices calculated by multiplication of PAM1
    matrix

17
The PAM matrix
  • The final scores in a PAM matrix are expressed as
    a lod (logarithm of odds) score
  • Compare probability of mutation vs probability of
    random occurrence
  • Gives odds ratio
  • Scoring Matrix S is calculated by

18
The full PAM 250 matrix
19
The BLOSUM Matrix
  • Derived using similar mathematical principals as
    a PAM matrix
  • However substitution data is derived in a
    different manner

20
The BLOSUM Matrix
  • Substitution data comes from multiple alignments.
  • Straightforward count of number of substitutions
    in the alignement
  • Number after BLOSUM (e.g. 62) denotes the minimum
    level (as a age) of similarity between sequences
    within the alignments.

21
BLAST
  • The Basic Local Alignment Search Tool
  • It is a Heuristic (i.e. it does not guarantee
    optimal results)
  • Why

22
BLAST
  • Genbank currently stores about 1010 residues of
    protein data.
  • Trying to form alignments against such a huge
    database is unfeasible (even with vast computing
    power)
  • So we need to shortcut

23
PAM Vs BLOSUM
  • PAM100 ? BLOSUM90
  • PAM120 ? BLOSUM80
  • PAM160 ? BLOSUM60
  • PAM200 ? BLOSUM52
  • PAM250 ? BLOSUM45

24
How BLAST Works
  • 3 steps
  • Compile a list of High Scoring words
  • Search for hits hits give seeds
  • Extend seeds

25
Step 1 - Preprocessing
  • BLAST creates a series of words, of length W
    where
  • W is 2..4 for proteins
  • W is gt10 for DNA
  • These words are based on subsequences of the
    query (Q)

26
Step 1 - Preprocessing
  • Get each subsequence of Q, that is of length W
  • E.g.
  • LVNRKPVVP
  • LVN
  • VNR
  • NRK
  • RKP
  • etc

27
Step 1 - Preprocessing
  • For each of these words, find similar words
  • How does blast define similarity
  • Uses scoring matrices to score 2 words
  • E.g.
  • W1 R K P
  • W2 R R P
  • 9 -1 7 Total 15

28
Step 1 - Preprocessing
  • Words are similar if their score is greater than
    a value T (T 12 usually)
  • For RKP the following are examples of high
    scoring words
  • QKP, KKP, RQP, REP, RRP, RKP

29
Step 2 Looking for Hits
  • Formatdb create a hash lookup table
  • E.g. KKP ? 12054345, 23451635, 23452152
  • Maps each word to an entry in the database of
    proteins
  • Allows us to retrieve sequences which has a word
    match in constant time

30
Step 3 Extending the matches
  • Starting at the word match, extend the alignment
    in both directions
  • Alignment scored using an adapted smith-waterman
    algorithm
  • Alignment stopped once score has dropped below
    the specified threshold

31
Multiple alignments
  • The problem
  • Pairwise alignment requires n2 time
  • Multiple alignment requires nx where x is the
    number of sequences to align

32
Progressive alignment
  • Pairwise alignment of each combination of
    sequence pairs (requires xn2 time)
  • Use alignment scores to produce dendogram using
    neighbour-joining method
  • Align sequences sequentially, using relationships
    from tree

33
Progressive alignment
  • For 5 sequences align 1v2, 1v3 4v5
  • Make a tree

1
3
4
5
2
1
3
4
5
2
34
Progressive alignment
1
3
1.
4
2.
5
1
3
4
3.
5
1
3
4
4.
5
2
35
Hidden Markov models
  • Prediction tool
  • Two related things X and Y
  • Using information about X and Y to build model
  • Predict Y using X or vice versa

36
Markov Models
  • Defines a series of states and the probability of
    moving to another state
  • Eg. The weather

Sun
Rain
Cloud
37
Markov Models
  • Example probabilities
  • This is a state transition matrix (A)

38
Markov Models
  • Initialise the system use a vector of initial
    probabilities (? vector)
  • In this case the type of weather on day 1.

39
Hidden Markov Models
  • Use something observable to predict something
    hidden
  • E.g. Barometric pressure and the weather (if you
    couldnt see outside)
  • Every observable state is connected to every
    hidden state

40
Hidden Markov Models
Observable
28 Hg
31 Hg
29 Hg
30 Hg
Hidden
41
Hidden Markov Models
  • Also need a probability matrix for connections
    between observable hidden states (confusion
    matrix) (B)

42
Hidden Markov Models
  • You can do many different things using HMMs
  • Match a series of observation to a HMM
    (Evaluation)
  • Determine the hidden sequence most likely to have
    generated a sequence of observations (Decoding)
  • Determining the parameters (p, A and B) most
    likely to have generates a sequence of
    observations (Learning)

43
Hidden Markov models
  • There are a series of algorithms that you can use
    to
  • Evaluate (Forward algorithm)
  • Decode (Viterbi algorithm)
  • Learn (Forward-Backward algorithm)

44
Hidden Markov Models
Write a Comment
User Comments (0)
About PowerShow.com