Title: Testing observations against an expected distribution
1Testing observations against an expected
distribution
You mate two banana slugs that are yellow, but
heterozygous for a recessive pink allele, and
observe 40 offspring
How can you test whether these results disagree
with a model where pink color is recessive to
yellow eye color?
Stats terminology H0 The data is consistent
with the model HA The data is not consistent
Goal Can we reject H0?
2What is the Chi-square test?
- The chi-square test is used to test if a sample
of data came from a population with a specific
distribution.1 - From Wikipedia2 It tests a null hypothesis that
the relative frequencies of occurrence of
observed events follow a specified frequency
distribution. The events are assumed to be
independent and have the same distribution, and
the outcomes of each event must be mutually
exclusive. A simple example is the hypothesis
that an ordinary six-sided die is "fair", i.e.,
all six outcomes occur equally often.
1NIST/SEMATECH e-Handbook of Statistical Methods,
http//www.itl.nist.gov/div898/handbook/eda/sectio
n3/eda35f.htm 2Wikipedia, Pearson's chi-square
test, http//en.wikipedia.org/wiki/Pearson27s_chi
-square_test
3Pearson chi-square test
- Summed over the k different outcomes,
- What does this chi-square value mean?
- How can we determine if this chi-square value is
significant?
4Chi-square tests Example 1
- You mate two banana slugs that are yellow, but
heterozygous for a recessive pink allele, and
observe 40 offspring - Given the following phenotypic classes, does this
data significantly differ from a Mendelian
dominant/recessive allele?
- Expect a 31 dominantrecessive ratio, so 40 x ¼
and 40 x 3/4
- How do we know if this chi-square value means the
data was significantly different than our
expected values?
5What is a significant chi-square result?
Significance (aka p-value) how often would you
expect to find a chi-square value that large by
chance?
Exact method You can determine the probability
from the distribution using a statistics
program/package (or calculators on various
websites from google) Excel CHIDIST(chi-square
value,degrees of freedom) Typical usage
Compare Chi-square value against known critical
values the chi-square value that corresponds
to certain useful probabilities (0.01, 0.001, etc)
Links http//www.stat.tamu.edu/west/applets/chis
qdemo1.html will let you see the distribution for
various df (degrees of freedom) Above graph
Hyperstat (http//davidmlane.com/hyperstat/A100557
.html)
6Sidetrack Degrees of freedom
- Degrees of freedom refers to how many values you
are free to choose in your dataset - For 1 variable (i.e., genotypes) Since the total
has to equal N, if you have k categories
(genotypes), once you know the number of
observations for the first k-1 categories, the
kth category must be N minus the sum - In the previous example you have 40 individuals,
and once you know there are 28 yellow then there
have to be 12 pink thus, there is only 1 value
you can choose, or 1 degree of freedom - Why it matters if you have many degrees of
freedom, you will get a higher chi-square value
by chance, and thus your significance values will
be lower for the same chi-square value
7Example critical values
8Example 1
- Given the following phenotypic classes, does this
data significantly differ from a Mendelian
dominant/recessive allele?
- Expect a 31 dominantrecessive ratio, so 40 x ¼
and 40 x ¾
- This data has 1 degree of freedom, so look up the
table for df1 (actual p 0.47)
9Sample size matters!
- Expect a 31 dominantrecessive ratio, so 400 x ¼
and 400 x ¾
- Critical value is 5.024 for 0.025 significance
with 1 degree of freedom (actual p-value is
0.021) - Thus, now this is significant at a 0.025
significance threshold (even though the ratio of
yellowpink is the same as the previous example)
10Example 2 Two traits
- Analyzing 2 traits (A/a and B/b)
- Given the following phenotypic class frequencies,
does this data significantly differ from a
two-trait Mendelian dominant/recessive allele? - If that hypothesis is true, you expect a 9331
phenotypic ratio
11Example 2 Two traits
- In this case, there are 3 degrees of freedom (as
there are 4 possible phenotypic classes, but the
4th must be 160 minus the sum of the first three) - Thus, you would reject the model at the 0.01
significance level, but not at the 0.001
significance level (actual p-value 0.0024)
12Multiple hypothesis testing
- What does a significance of 0.01 mean?
- You expect to observe a spurious result as
significant as your result 1 out of every 100
times - What happens if you test many thousands of
different genes/traits/etc?
13- As an example you want to find differentially
expressed genes on a microarray - Take a typical Affymetrix human expression
microarray 20,000 genes - Using some algorithm, you find 300 genes
significantly enriched at p0.01 - The problem p0.01 means that 1 of every 100
genes will be that significant by chance thus,
you would expect 0.01 of 20000 200 genes to be
that significant just by random chance! In other
words, two-thirds of your 300 genes are
false-positives that probably wont be
interesting for further study - Multiple hypothesis correction many different
methods depending on your specific topic/question
(more complicated than required for this class,
for more info consult a statistics
person/textbook) - Simplest ( most conservative) method Bonferoni
correction - If you are testing 20,000 genes, multiply every
p-value by 20,000 to get an adjusted p-value - Thus, if you want an adjusted error rate of 0.01,
you would look for genes with an original p
5x10-7