Algorithmic Foundations of Computational Biology: Course 5 - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Algorithmic Foundations of Computational Biology: Course 5

Description:

... independence holds population genetics shows that nucleotides frequencies at ... sites tend to evolve in dependent fashion leading to dependence of observing a ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 37
Provided by: applerae
Category:

less

Transcript and Presenter's Notes

Title: Algorithmic Foundations of Computational Biology: Course 5


1
Algorithmic Foundations of Computational Biology
Course 5 -
  • Statistical Significance in Bioinformatics
  • Statistics
  • Probability Theory

2
SIGNIFICANT SIMILARITYFOR TWO DNA SEQUENCES
  • ACTACCGCGTAAATTCTAAC
  • ACACTTACGTTAACCCGGGA

Size of sequences 20 Number of matches 8
If the sequences were generated at random with 4
letters A, C, G, T, having equal probability of
occurrence at any position, then the two
sequences should agree at about ¼ of their
positions. 20/45. But we observe 8
agreements! Is this significant ?
3
WHAT ARE THE ASSUMPTIONS ?
  • How unlikely is this outcome if the sequences
    were generated at random ?
  • Assumption Equal probabilities for A, C, G, T at
    any site
  • Assumption Independence of all A, C, G, T
    involved
  • Clearly in our case, something other than chance
    is going on!!!

4
STATISTICS
  • Optimal methods for analyzing data generated by a
    random process
  • What to measure ?
  • ACTACCGCGTAAATTCTAAC
  • ACACTTACGTTAACCCGGGT

8 3
5
ACCURACY OF ASSUMPTIONS
  • The probability calculated based on the
    assumptions about data (equal probability at any
    site and independence)
  • Accuracy of conclusions of statistical analysis
    depends on the accuracy of assumptions made

6
SIMPLIFYING ASSUMPTIONS
  • We need to make simplifying assumptions, even
    when they do not hold.
  • Required by the complex computations involved

7
RANDOM VARIABLES
  • A discrete random variable is a numerical
    quantity that in some experiment that involves
    randomness takes one value from some discrete set
    of values
  • Rolling a two six-sided dice, the random variable
    X sum of the two outcomes
  • Toss of a fair coin, the random variable
  • Y number of tosses until the first head
    appears

8
Number of Matches
  • the number of matches among two random DNA
    sequences of length 20 is a random variable,
    denoted Y
  • The observed value of Y in our example, denoted
    y, equals 8

9
PROBABILITY DISTRIBUTION OF A RANDOM VARIABLE
  • Is the set of values that this random variable
    can take together with their associated
    probabilities
  • Example. Toss a fair coin twice. Let X be the
    random variable, X the number of heads
    obtained

Values of Y 0 1 2 Probabilities .25
.5 .25
10
INDEPENDENCE
  • A central concept in probability and statistics
  • Two or more events are independent if the outcome
    of one event does not affect in any way any other
    event
  • Discrete random variables are independent if the
    value of one does not affect in any way the
    probabilities associated with the possible values
    of any other random variable

11
Examples
  • Different rolls of a die are independent
  • Different tosses of coin are independent

12
The BERNOULLI Random Variable
  • A Bernoulli trial is a single trial with two
    outcomes, called success and failure
  • The probability of success is denoted p and the
    probability of failure is q 1-p
  • The Bernoulli random variable is
  • Y number of successes
  • obtained in this trial

13
Bernoulli Probability Distribution
14
The BINOMIAL Distribution
  • A Binomial random variable is the number of
    successes in a fixed number of n of independent
    Bernoulli trials with the same probability of
    success for each trial
  • The number of heads in some fixed number of
    tosses of a coin is an example of a binomial
    random variable

15
ASSUMPTIONS the 4 conditions
  • Each trial must result in one of two possible
    outcomes success or failure
  • Trails must be independent
  • The probability of success must be the same on
    all trials
  • The number n of trials must be fixed in advance
    not determined by the outcomes of the trials

16
The BINOMIAL Probability Distribution
  • The Binomial random variable is the variable
  • Y number of successes in n trials

n choose y, also known as the Binomial
coefficient
17
Observations
  • Bernoulli distribution is a special case of the
    Binomial distribution (when n1)
  • p is often an unknown parameter

18
Careful when using Binomial distribution
  • Are the 4 conditions satisfied ?
  • When comparing two DNA sequences our question
    about whether 8 matches are due to chance or not
    is based on the assumption that the number of
    matches follow a Binomial distribution
  • Success is the event that two nucleotides in
    corresponding positions in the two sequences
    match
  • ACTACCGCGTAAATTCTAAC
  • ACACTTACGTTAACCCGGGT

19
Careful (cont)
  • It is not necessarily true that the probability
    of success is the same at all sites
  • It is not necessarily true that independence
    holds population genetics shows that
    nucleotides frequencies at close sites tend to
    evolve in dependent fashion leading to dependence
    of observing a success at very close sites
  • Thus 2 of the 4 conditions for a Binomial
    distribution do not hold for our pair of DNA
    sequences comparison

20
SIMPLIFICATIONS ARE A MUST
  • Still it might be desirable to make these
  • incorrect assumptions as approximations
  • Constructing models implies making simplifying
    assumptions about the process generating the data

21
The UNIFORM Distribution
  • The simplest probability distribution
  • A uniformly distributed random variable Y takes
    values
  • 1,2,,m each with same probability

22
The GEOMETRIC Distribution
  • Suppose a sequence of independent Bernoulli
    trials is performed, each having probability of
    success p
  • The geometric distributed random variable is the
    variable Y the number of trials before but not
    including the first failure
  • The possible values of the random variable
  • are 1,2,3 .

23
The GEOMETRIC Distribution (cont)
  • The probability of several independent events is
    the product of their probabilities
  • For Y y, there must be y successes followed by
    one failure
  • The length of a successful run
  • ACTACCGCGTAAATTCTAAC
  • ACACTTACGTTAACCCGGGT

24
The NEGATIVE BINOMIAL Distribution
  • A sequence of independent Bernoulli trials each
    with a probability p of success
  • The Binomial distribution has n such trials with
    n fixed in advance, and the random variable is
    the number of successes in these n random trials
  • In the Generalized Geometric distribution, the
    number of successes is fixed in advance, at some
    value m, and the random variable is N the number
    of trials up to and including this m success
  • N is said to have the negative binomial
    distribution

25
The NEGATIVE BINOMIAL Distribution (cont)
  • The probability that Nn is the probability that
    the first n-1 trials result in exactly m-1
    successes and n-m failures and the trial n
    results in success

26
PROBABILITY THEORY
  • Probability measures uncertainty
  • Experiments are performed involving chance or
    randomness they are things that can be repeated.
  • Suppose you roll a pair of dice once.
  • you get a pair of numbers (a,b) such that
  • a 1,,6 and b 1,,6
  • (1,1),(1,2),(1,3),(1,4),(1,5),(1,6),
  • (2,1),(2,2),(2,3),(2,4),(2,5),(2,6),
  • (3,1),(3,2),(3,3),(3,4),(3,5),(3,6),
  • (4,1),(4,2),(4,3),(4,4),(4,5),(4,6),
  • (5,1),(5,2),(5,3),(5,4),(5,5),(5,6),
  • (6,1),(6,2),(6,3),(6,4),(6,5),(6,6)

Sample Space
Outcomes
27
PROBABILITY THEORY (cont)
  • The things that we measure are called events
  • Rolling a 7 (1,6), (2,5), (3,4),
    (4,3),(5,2),(6,1)
  • We say that the experiment of rolling out a pair
    of dice give rise to a Sample Space S which is
    just the 36 outcomes possible, and an event is
    just a set of some of these outcomes.

28
PROBABILITY THEORY (cont)
  • Tossing a coin twice
  • Outcome example H,T
  • Sample Space SH,H, H,T,T,H, T,T
  • Event A at least one Head occurs
  • A H,H, H,T,T,H

29
PROBABILITY THEORY (cont)
  • Sample space provides a mathematical model of
    real-life situations for which it is supposed to
    be an abstraction
  • Mathematical analyses can only be performed on
    the abstract objects of the sample space and not
    on real-life situation itself
  • Since the abstraction resemble the real world you
    may think that the mathematical relationships you
    found have something to do with the real world
  • You can perform now scientific experiments to
    check out the real world situation

30
PROBABILITY THEORY (cont)
  • If you were successful, the mathematical model
    helped you decipher the real world you will
    know this because the results of your experiments
    are consistent with the mathematical
    relationships your obtained from the model
  • It could, of course, also happen that your
    mathematical model was too simple, or otherwise
    in error and did not give a true picture of the
    real world. In such a case, the mathematical
    relationships, while true for the model, cannot
    be verified by the laboratory experiments. We
    then need another better model.

31
PROBABILITY THEORY (cont)
  • The Sample Space constructed to model a real life
    situation is a figment of the imagination of the
    observer of that situation, it depends on what
    the observers thinks is important. It is not in
    general unique, and it depends on the subjective
    interpretation of what is the relevant
    information.

32

Tyche, or Fortuna, the Goddess of Probability

33
PROBABILITY THEORY (cont)
  • Consider the Sample Space S, say with the 36
    outcomes of rolling a pair of dice.
  • To each of the outcome in the sample space
    associate a number between 0 and 1 such that the
    sum of these numbers over all outcomes is equal
    to 1.
  • The number associated with a particular outcome
    is called the probability of the outcome, and the
    entire assignment of probabilities to outcomes is
    called a probability distribution on S.

34
PROBABILITY THEORY (cont)
  • We now define the probability for any event A in
    the sample space S.
  • If A is the empty set, P(A)0.
  • If
    then
  • So given the probability distribution on S we can
    figure out the probabilities of all events in S.

35
PROBABILITY SPACE
  • The sample space with its probability
    distribution is called a probability space

36
The Car and Goat Problem
  • Monty Hall, the master of ceremonies at the
    Lets Make a Deal game show confronts you wit
    three closed doors, one of which hides the car of
    your dreams. Behind each of the other two doors,
    however, is standing a smelly goat. You will
    choose a door and win whatever is behind it.
  • You decide on a door, and announce your choice.
  • Your host opens then one of the other two doors
    and reveals a goat.
  • He then ask you whether you would like to switch
    your choice to the unopend door that you did not
    at first choose.
  • Is it in your advantage to switch ??????


Monty Halls game show
Lets Make a Deal
Write a Comment
User Comments (0)
About PowerShow.com