Lecture outline - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture outline

Description:

O(max(m,n)) if only similarity score is needed. More complicated 'divide-and-conquer' algorithm that ... log2(3)=1.585 bits or loge(3) = ln(3) = 1.1 nats ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 50
Provided by: ambuj4
Category:
Tags: lecture | loge | outline

less

Transcript and Presenter's Notes

Title: Lecture outline


1
Lecture outline
  • Database searches
  • BLAST
  • FASTA
  • Statistical Significance of Sequence Comparison
    Results
  • Probability of matching runs
  • Karin-Altschul statistics
  • Extreme value distribution

2
DP Alignment Complexity
  • O(mn) time
  • O(mn) space
  • O(max(m,n)) if only similarity score is needed
  • More complicated divide-and-conquer algorithm
    that doubles time complexity and uses O(min(m,n))
    space Hirschberg, JACM 1977

3
Time and space bottlenecks
  • Comparing two one-megabase genomes.
  • Space
  • An entry 4 bytes
  • Table 4 106 106 4 T bytes memory.
  • Time
  • 1000 MHz CPU 1M entries/second
  • 1012 entries 1M seconds 10 days.

4
BLAST
  • Basic Local Alignment Search Tool
  • Altschul et al. 1990,1994,1997
  • Heuristic method for local alignment
  • Designed specifically for database searches
  • Idea good alignments contain short lengths of
    exact matches

5
Steps of BLAST
  • Query words of length 4 (for proteins) or 11 (for
    DNA) are created from query sequence using a
    sliding window
  • Scan each database sequence for an exact match to
    query words. Each match is a seed for an ungapped
    alignment.

MEFPGLGSLGTSEPLPQFVDPALVSS
MEFP EFPG FPGL PGLG GLGS
6
Steps of BLAST
  • 3. (Original BLAST) extend matching words to
    the left and right using ungapped alignments.
    Extension continues as long as score does not
    fall below a given threshold. This is an HSP
    (high scoring pair).
  • (BLAST2) Extend the HSPs using gapped alignment.

7
Steps of BLAST
  • 4. Using a cutoff score S, keep only the
    extended matches that have a score at
    least S.
  • 5. Determine statistical significance of each
    remaining match.

8
Example BLAST run
  • BLAST website
  • http//www.ncbi.nlm.nih.gov/BLAST/

9
FASTA
  • Derived from logic of the dot plot
  • compute best diagonals from all frames of
    alignment
  • Word method looks for exact matches between words
    in query and test sequence
  • construct word position tables
  • DNA words are usually 6 bases
  • protein words are 1 or 2 amino acids
  • only searches for diagonals in region of word
    matches faster searching

10
Steps of FASTA
  1. Find k-tups in the two sequences (k1-2 for
    proteins, 4-6 for DNA sequences)
  2. Create a table of positions for those k-tups

11
The offset table
position 1 2 3 4 5 6 7 8 9 10 11 proteinA n c s p
t a . . . . . proteinB . . . . . a c s p r k
position in offset amino
acid protein A protein B pos A -
posB ---------------------------------------------
-------- a 6 6
0 c 2 7
-5 k - 11 n
1 - p 4 9
-5 r - 10 s
3 8 -5 t
5 - -------------------------
---------------------------- Note the common
offset for the 3 amino acids c,s and p A possible
alignment is thus quickly found - protein 1 n c s
p t a protein 2 a c s p r k
12
FASTA
  • 3. Select top 10 scoring local diagonals
    with matches and mismatches but no gaps.
  • 4. Rescan top 10 diagonals (representing
    alignments), score with PAM250 (proteins) or DNA
    scoring matrix. Trim off the ends of the regions
    to achieve highest scores.

13
FASTA Algorithm
14
FASTA
  • 5. After finding the best initial region, FASTA
    performs a DP global alignment centered on the
    best initial region.

15
FASTA Alignments
16
History of sequence searching
  • 1970 NW
  • 1981 SW
  • 1985 FASTA
  • 1990 BLAST
  • 1997 BLAST2

17
The purpose of sequence alignment
  • Homology
  • Function identification
  • about 70 of the genes of M. jannaschii were
    assigned a function using sequence similarity
    (1997)

18
Similarity
  • How much similar do the sequences have to be to
    infer homology?
  • Two possibilities when similarity is detected
  • The similarity is by chance
  • They evolved from a common ancestor hence, have
    similar functions

19
Measures of similarity
  • Percent identity
  • 40 similar, 70 similar
  • problems with percent identity?
  • Scoring matrices
  • matching of some amino acids may be more
    significant than matching of other amino acids
  • PAM matrix in 1970, BLOSUM in 1992
  • problems?

20
Statistical Significance
  • Goal to provide a universal measure for
    inferring homology
  • How different is the result from a random match,
    or a match between unrelated requences?
  • Given a set of sequences not related to the query
    (or a set of random sequences), what is the
    probability of finding a match with the same
    alignment score by chance?
  • Different statistical measures
  • p-value
  • E-value
  • z-score

21
Statistical significance measures
  • p-value the probability that at least one
    sequence will produce the same score by chance
  • E-value expected number of sequences that will
    produce same or better score by chance
  • z-score measures how much standard deviations
    above the mean of the score distribution

22
How to compute statistical significance?
  • Significance of a match-run
  • Erdös-Renyí
  • Significance of local alignments without gaps
  • Karlin-Altschul statistics
  • Scoring matrices revisited
  • Significance of local alignments with gaps
  • Significance of global alignments

23
Analysis of coin tosses
  • Let black circles indicate heads
  • Let p be the probability of a head
  • For a fair coin, p 0.5
  • Probability of 5 heads in a row is (1/2)50.031
  • The expected number of times that 5H occurs in
    above 14 coin tosses is 100.031 0.31

24
Analysis of coin tosses
  • The expected number of a length l run of heads in
    n tosses.
  • What is the expected length R of the longest
    match in n tosses?

25
Analysis of coin tosses
  • (Erdös-Rényi) If there are n throws, then the
    expected length R of the longest run of heads is
  • R log1/p (n)

26
Example
  • Example Suppose n 20 for a fair coin
  • Rlog2(20)4.32
  • In other words in 20 coin tosses we expect a run
    of heads of length 4.32, once.
  • Trick is how to model DNA (or amino acid)
    sequence alignments as coin tosses.

27
Analysis of an alignment
  • Probability of an individual match p 0.05
  • Expected number of matches 10x8x0.05 4
  • Expected number of two successive matches
  • 10x8x0.05x0.05 0.2

28
Matching runsin sequence alignments
  • Consider two sequences a1..m and b1..n
  • If the probability of occurrence for every symbol
    is p, then a match of a residue ai with bj is p,
    and a match of length l from ai,bj to
    ail-1,bjl-1 is pl.
  • The head-run problem of coin tosses corresponds
    to the longest run of matches along the diagonals

29
Matching runsin sequence alignments
  • There are m-l1 x n-l1 places where the match
    could start
  • The expected length of the longest match can be
    approximated as
  • Rlog1/p(mn)
  • where m and n are the lengths of the two
    sequences.

30
Matching runsin sequence alignments
  • So suppose m n 10 and were looking at DNA
    sequences
  • Rlog4(100)3.32
  • This analysis makes assumptions about the base
    composition (uniform) and no gaps, but its a
    good estimate.

31
Statistics for matching runs
  • Statistics of matching runs
  • Length versus score?
  • Consider all mismatches receive a negative score
    of -8 and aibj match receives a positive score of
    si,j.
  • What is the expected number of matching runs with
    a score x or higher?
  • Using this theory of matching runs, Karlin and
    Altschul developed a theory for statistics of
    local alignments without gaps (extended this
    theory to allow for mismatches).

32
Statistics of local alignments without gaps
  • A scoring matrix which satisfy the following
    constraint
  • The expected score of a single match obtained by
    a scoring matrix should be negative.
  • Otherwise?
  • Arbitrarily long random sequences will get higher
    scores just because they are long, not because
    theres a significant match.
  • If this requirement is met then the expected
    number of alignments with score x or higher is
    given by

33
Statistics of local alignments without gaps
  • K lt 1 is a proportionality constant that corrects
    the mn space factor for the fact that there are
    not really mn independent places that could have
    produced score S x.
  • K has little effect on the statistical
    significance of a similarity score
  • ? is closely related to the scoring matrix used
    and it takes into account that the scoring
    matrices do not contain actual probabilities of
    co-occurence, but instead a scaled version of
    those values. To understand how ? is computed, we
    have to look at the construction of scoring
    matrices.

34
Scoring Matrices
  • In 1970s there were few protein sequences
    available. Dayhoff used a limited set of families
    of protein sequences multiply aligned to infer
    mutation likelihoods.

35
Scoring Matrices
  • Dayhoff represented the similarity of amino acids
    as a log odds ratio
  • where qij is the observed frequency of
    co-occurrence, and pi, pj are the individual
    frequencies.

36
Example
  • If M occurs in the sequences with 0.01 frequency
    and L occurs with 0.1 frequency. By random
    pairing, you expect 0.001 amino acid pairs to be
    M-L. If the observed frequency of M-L is actually
    0.003, score of matching M-L will be
  • log2(3)1.585 bits or loge(3) ln(3) 1.1 nats
  • Since, scoring matrices are usually provided as
    integer matrices, these values are scaled by a
    constant factor. ? is approximately the inverse
    of the original scaling factor.

37
How to compute ?
  • Recall that
  • and

Sum of observed frequencies is 1.
Given the frequencies of individual amino acids
and the scores in the matrix, ? can be estimated.
38
Extreme value distribution
  • Consider an experiment that obtains the maximum
    value of locally aligning a random string with
    query string (without gaps). Repeat with another
    random string and so on. Plot the distribution
    of these maximum values.
  • The resulting distribution is an extreme value
    distribution, called a Gumbel distribution.

39
Normal vs. Extreme Value Distribution
Normal distribution y (1/v2p)e-x2/2
Normal
Extreme value distribution y e-x e-x
Extreme Value
40
Local alignments with gaps
  • The EVD distribution
  • is not always observed.
  • Theory of local alignments
  • with gaps is not well studied
  • as in without gaps.
  • Mostly empirical results.
  • For example, BLAST allows
  • only a certain range of
  • gap penalties.

41
BLAST statistics
  • Pre-computed ? and K values for different scoring
    matrices and gap penalties are used for faster
    computation.
  • Raw score is converted to bit score
  • E-value is computed using
  • m is query size, n is database size and L is the
    typical length of maximal scoring alignment.

42
FASTA Statistics
  • FASTA tries to estimate the probability
    distribution of alignments for every query.
  • For any query sequence, a large collection of
    scores is gathered during the search of the
    database.
  • They estimate the parameters of the EVD
    distribution based on the histogram of scores.
  • Advantages
  • reliable statistics for different parameters
  • different databases, different gap penalties,
    different scoring matrices, queries with
    different amino acid compositions.

43
Statistical significanceanother example
  • Suppose, we have a huge graph with weighted edges
    and we want to find strongly connected clusters
    of nodes.
  • Suppose, an algorithm for this task is given.
  • The algorithms gives you the best hundred
    clusters in this graph.
  • How do you define best?
  • Cluster size?
  • Total weight of edges?

44
Statistical significance
  • How different is a found cluster of size N from a
    random cluster of the same size?
  • This measure will enable comparison of clusters
    of different sizes.

45
Statistical significance of a cluster
  • Use maximum spanning tree weight of a cluster as
    a quantitative representation of that cluster.
  • And see what
  • values random
  • clusters get.
  • (sample many
  • random
  • clusters)

46
Statistical significance of a cluster
Looks like an exponential decay. We may fit an
exponential distribution on this histogram.
47
Fitting an exponential
48
Statistical significance of a cluster
After we fit an exponential distribution, we
compute the probability that another random
cluster gets a higher score than the score of
found cluster.
49
Examples
  • ?5 1.7 for clusters of size 5 and ?20 0.36
    for clusters of size 20.
  • Suppose you have found a cluster of size 5 with
    weights of its edges sum up to 15 and you have
    found a cluster of size 20 with weight 45 which
    one would you prefer?
Write a Comment
User Comments (0)
About PowerShow.com