Probabilities and Probabilistic Models - PowerPoint PPT Presentation

About This Presentation
Title:

Probabilities and Probabilistic Models

Description:

A probabilistic model is a model that produces different outcomes ... In bioinformatics, the objects usually are DNA or protein ... to infer a phylogeny. ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 26
Provided by: LAB84
Category:

less

Transcript and Presenter's Notes

Title: Probabilities and Probabilistic Models


1
Probabilities and Probabilistic Models
2
Probabilistic models
  • A model means a system that simulates an object
    under consideration.
  • A probabilistic model is a model that produces
    different outcomes with different probabilities
    it can simulate a whole class of objects,
    assigning each an associated probability.
  • In bioinformatics, the objects usually are DNA or
    protein sequences and a model might describe a
    family of related sequences.

3
Examples
  • The roll of a six-sided die six parameters p1,
    p2, , p6, where pi is the probability of rolling
    the number i. For probabilities, pi gt 0 and
  • Three rolls of a die the model might be that the
    rolls are independent, so that the probability of
    a sequence such as 2, 4, 6 would be p2 p4 p6.
  • An extremely simple model of any DNA or protein
    sequence is a string over a 4 (nucleotide) or 20
    (amino acid) letter alphabet. Let qa denote the
    probability, that residue a occurs at a given
    position, at random, independent of all other
    residues in the sequence. Then, for a given
    length n, the probability of the sequence
    x1,x2,,xn is

4
Conditional, joint, and marginal probabilities
  • two dice D1 and D2. For j 1,2, assume that the
    probabi-lity of using die Dj is P(Dj ), and for
    i 1,2,?, 6, assume that the probability of
    rolling an i with dice Dj is
  • In this simple two dice model, the conditional
    probability of rolling an i with dice Dj is
  • The joint probability of picking die Dj and
    rolling an i is
  • The probability of rolling i marginal
    probability

5
Maximum likelihood estimation
  • Probabilistic models have parameters that are
    usually estimated from large sets of trusted
    examples, called a training set.
  • For example, the probability qa for seeing amino
    acid a in a protein sequence can be estimated as
    the observed frequency fa of a in a database of
    known protein sequences, such as SWISS-PROT.
  • This way of estimating models is called Maximum
    likelihood estimation, because it can be shown
    that using the observed frequencies maximizes the
    total probability of the training set, given the
    model.
  • In general, given a model with parameters and a
    set of data D, the maximum likelihood estimate
    (MLE) for ? is the value which maximizes P(D ?).

6
Model comparison problem
  • An occasionally dishonest casino uses two kinds
    of dice, of which 99 are fair, but 1 are
    loaded, so that a 6 appears 50 of the time.
  • We pick up a dice and roll 6, 6, 6. This looks
    like a loaded die, is it? This is an example of a
    model comparison problem.
  • I.e., our hypothesis Dloaded is that the die is
    loaded. The other alternative is Dfair. Which
    model fits the observed data better? We want to
    calculate
  • P(Dloaded 6, 6, 6)

7
Prior and posterior probability
  • P(Dloaded 6, 6, 6) is the posterior
    probability that the dice is loaded, given the
    observed data.
  • Note that the prior probability of this
    hypothesis is 1/100 prior because it is our
    best guess about the dice before having seen any
    information about the it.
  • The likelihood of the hypothesis Dloaded
  • Posterior probability using Bayes theorem

8
Comparing models using Bayes theorem
  • We set X Dloaded and Y 6, 6, 6, thus
    obtaining
  • The probability P(Dloaded) of picking a loaded
    die is 0.01.
  • The probability P(6, 6, 6 Dloaded) of rolling
    three sixes using a loaded die is 0.53 0.125.
  • The total probability P(6, 6, 6) of three sixes
    is
  • P(6, 6, 6 Dloaded) P(Dloaded) P(6, 6,
    6 Dfair)P(Dfair).
  • Now,
  • Thus, the die is probably fair.

9
Biological example
  • Lets assume that extracellular (ext) proteins
    have a slightly different composition than
    intercellular (int) ones. We want to use this to
    judge whether a new protein sequence x1,, xn is
    ext or int.
  • To obtain training data, classify all proteins in
    SWISS-PROT into ext, int and unclassifiable ones.
  • Determine the frequencies and of
    each amino acid a in ext and int proteins,
    respectively.
  • To be able to apply Bayes theorem, we need to
    determine the priors pint and pext, i.e. the
    probability that a new (unexamined) sequence is
    extracellular or intercellular, respectively.

10
Biological example - cont.
  • We have and
  • If we assume that any sequence is either
    extracellular or intercellular, then we have
  • P(x) pext P(x ext) pintP(x int).
  • By Bayes theorem, we obtain
  • the posterior probability that a sequence is
    extracellular.
  • (In reality, many transmembrane proteins have
    both intra- and extracellular components and more
    complex models such as HMMs are appropriate.)

11
Probability vs. likelihoodpravdepodobnost vs.
vierohodnost
  • If we consider P( X Y ) as a function of X,
    then this is called a probability.
  • If we consider P( X Y ) as a function of Y ,
    then this is called a likelihood.

12
Sequence comparison by compression
13
Motivation
  • similarity as a marker for homology. And homology
    is used to infer function.
  • Sometimes, we are only interested in a numerical
    distance between two sequences. For example, to
    infer a phylogeny.

Figure adapted from http//www.inf.ethz.ch/persona
l/gonnet/acat2000/side2.html
14
Text- vs DNA compression
  • compress, gzip or zip routinely used to
    compress text files. They can be applied also to
    a text file containing DNA.
  • E.g., a text file F containing chromosome 19 of
    human in fasta format F 61 MB, but
    compress(F) 8.5 MB.
  • 8 bits are used for each character. However, DNA
    consists of only 4 different bases 2 bits per
    base are enough A 00, C 01, G 10, and T
    11.
  • Applying a standard compression algorithm to a
    file containing DNA encoded using two bits per
    base will usually not be able to compress the
    file further.

15
The repetitive nature of DNA
  • Take advantage of the repetitive nature of DNA!!
  • LINEs (Long Interspersed Nuclear Elements),
    SINEs.
  • UCSC Genome Browser http//genome.ucsc.edu

16
DNA compression
  • DNA sequences are very compressible, especially
    for higher eukaryotes they contain many repeats
    of different size, with different numbers of
    instances and different amounts of identity.
  • A first idea While processing the DNA string
    from left to right, detect exact repeats and/or
    palindromes (reverse-complemented repeats) that
    possess previous instances in the already
    processed text and encode them by the length and
    position of an earlier occurrence. For stretches
    of sequence for which no significant repeat is
    found, use two-bit encoding. (The program
    Biocompress is based on this idea.)
  • Data structure for fast access to sequence
    patterns already encountered.
  • Sliding window along unprocessed sequence.

17
DNA compression
  • A second idea Build a suffix tree for the whole
    sequence and use it to detect maximal repeats of
    some fixed minimum size. Then code all repeats as
    above and use two-bit encoding for bases not
    contained inside repeats. (The program Cfact is
    based on this idea.)
  • Both of these algorithms are lossless, meaning
    that the original sequences can be precisely
    reconstructed from their encodings. An number of
    lossy algorithms exist, which we will not discuss
    here.
  • In the following we will discuss the GenCompress
    algorithm due to Xin Chen, Sam Kwong and Ming Li.

18
Edit operations
  • The main idea of GenCompress is to use inexact
    matching, followed by edit operations. In other
    words, instances of inexact repeats are encoded
    by a reference to an earlier instance of the
    repeat, followed by some edit operations that
    modify the earlier instance to obtain the current
    instance.
  • Three standard edit operations
  • Replace (R, i, char) replace the character at
    position i by character char.
  • Insert (I, i, char) insert character char at
    position i.
  • Delete (D, i) delete the character at position
    i.

19
Edit operations
  • different edit operation sequences
  • (a) C C C C R C C C C C or (b) C C C C D
    C I C C C C
  • g a c c g t c a t t g a c c g t c a
    t t
  • g a c c t t c a t t g a c c
    t t c a t t
  • infinite number of ways to convert one string
    into another.
  • Given two strings q and p. An edit transcript
    ?(q, p) is a list of edit operations that
    transforms q into p.
  • E.g., in case (a) the edit transcript is
    ?(gaccgtcatt, gaccttcatt) (R, 4, t),
  • whereas in case (b) it is ?(gaccgtcatt,
    gaccttcatt) (D, 4), (I, 4, g).
  • (positions start at 0 and are given relative to
    current state of the
  • string, as obtained by application of any
    preceding edit operations.)

20
Encoding DNA
  • Using the two-bit encoding method, gaccttcatt can
    be encoded in 20 bits 10 00 01 01 11 11 01 00
    11 11
  • The following three encode gaccttcatt relative
    to gaccgtcatt
  • In the exact matching method we use a pair of
    numbers (repeat - position, repeat - length) to
    represent an exact repeat. We can encode
    gaccttcatt as (0, 4), t, (5, 5), relative to
    gaccgtcatt. Let 4 bits encode an integer, 2 bits
    encode a base and one bit to indicate whether the
    next part is a pair (indicating a repeat) or a
    base. We obtain an encoding in 21
    bits 0 0000 0100 1 11 0 0101 0101.

21
Encoding DNA
  • In the approximate matching method we can encode
    gaccttcatt as (0, 10), (R, 4, t) relative to
    gaccgtcatt. Let use encode R by 00, I by 01, D by
    11 and use a single bit to indicate whether the
    next part is a pair or a triplet. We obtain an
    encoding in 18 bits 0 0000 1010 1 00 0100 11
  • For approximate matching, we could also use the
    edit sequence (D, 4), (I, 4, t), for example,
    yielding the relative encoding (0, 10), (D, 4),
    (I, 4, t), which uses 25 bits 0 0000
    1010 1 11 0100 1 01 0100 11.

22
GenCompress
  • a one-pass algorithm based on approximate
    matching
  • For a given input string w, assume that a part v
    has already been compressed and the remaining
    part is u, with w vu. The algorithm finds an
    optimal prefix p of u that approximately
    matches some substring of v such that p can be
    encoded economically. After outputting the
    encoding of p, remove the prefix p from u and
    append it to v. If no optimal prefix is found,
    output the next base and then move it from u to
    v. Repeat until u ?.

w
v
u
p
p
23
The condition C
  • How do we find an optimal prefix p? The
    following condition will be used to limit the
    search.
  • Given two numbers k and b. Let p be a prefix of
    the unprocessed part u and q a substring of the
    processed part v. If q gt k, then any transcript
    (q, p) is said to satisfy the condition C (k,
    b) for compression, if its number of edit
    operations is ? b.
  • Experimental studies indicate that C (k, b)
    (12, 3) gives good results.
  • In other words, when attempting to determine the
    optimal prefix for compression, we will only
    consider repeats of length k that require at most
    b edit operations.

24
The compression gain function
  • We define a compression gain function G to
    determine whether a particular approximate repeat
    q, p and edit transcript are beneficial for the
    encoding
  • G(q, p, ?) max 2p - (i, q) - wl (q,
    p) - c, 0.
  • where
  • p is a prefix of the unprocessed part u,
  • q is a substring of the processed part v of
    length q that starts at position i,
  • 2p is the number of bits that the two-bit
    encoding would use,
  • (i, q) is the encoding size of (i, q),
  • w is the cost of encoding an edit operation,
  • (q, p) is the number of edit operations in (q,
    p),
  • and c is the overhead proportional to the size of
    control bits.

25
The GenCompress algorithm
  • Input A DNA sequence w, parameter C (k, b)
  • Output A lossless compressed encoding of w
  • Initialization u w and v e
  • while u ¹ e do
  • Search for an optimal prefix p of u
  • if an optimal prefix p with repeat q in v is
    found then
  • Encode the repeat representation (i, q), where
    i is
  • the position of q in v, together with the
    shortest edit
  • transcript (q, p). Output the code.
  • else Set p equal to the first character of u,
  • encode and output it.
  • Remove the prefix p from u and append it to v
  • end
Write a Comment
User Comments (0)
About PowerShow.com