Statistics - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Statistics

Description:

Statistics Achim Tresch Gene Center LMU Munich * – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 60
Provided by: kec62
Category:

less

Transcript and Presenter's Notes

Title: Statistics


1
Statistics
  • Achim Tresch
  • Gene Center
  • LMU Munich

2
Topics
  • Descriptive Statistics
  • Test theory
  • III. Common tests
  • IV. Bivariate Analysis
  • V. Regression

3
III. Common Tests
Two group comparisons
4
III. Common Tests
Two group comparisons
Question / Hypothesis Is the expression of gene g
in group 1 less than the expression in group 2?
Data Expression of gene g in different samples
(absolute scale)
Test statistic, e.g. difference of group means
Decision for less expressed if
5
III. Common Tests
Two group comparisons
Problem d is not scale invariant
Solution Divide d by its standard deviation
This is essentially the two sample t-statistic
(for unpaired samples)
6
III. Common Tests
The t-test
  • There are variants of the t-testOne group
  • one sample t-test (is the group mean µ ?)with
    mean and variance
  • Two group comparisons
  • Paired t-test (are the group means equal ?) -gt
    later
  • two sample t-test assuming equal variances
  • two sample t-test not assuming equal variances
    (Baum-Welch test)

7
III. Common Tests
The t-test
Requirement The data is approx. normally
distributed in both groups (there are ways to
check this, e.g. by the Kolmogoroff-Smirnov test)
Decision for unpaired t-test or Baum-Welch test
8
III. Common Tests
Wilcoxon rank sum test (Mann-Whitney-Test,
U-test) Non parametric test for the comparison of
two groups Is the distribution in group 1
systematically shifted relative to group 2 ?
Data Data Data Data Data
Group 1 18 3 6 9 5
Group 2 15 10 8 7 12
3 5 6 7 8 9 10 12 15 18
Original scale
Rank scale
1 2 3 4 5 6 7 8 9 10
Rank sum Group 1 123610 22
Rank sum Group 245789 33
9
III. Common Tests
Wilcoxon rank sum test (Mann-Whitney-Test, U-test)
The test statistic is the rank sum of group 1
The corrseponding p-value can be calculated
exactly for small group sizes. There are
approximations available for larger group sizes
(Ngt20).
P(W22, H0) 0.15
The Wilcoxon test can be carried out as a
one-sided and as a two-sided test (default)
15
20
25
30
35
40
22
Wilcoxon W
Rank sum distribution for group 1, Group 1
5, Group 2 5
10
III. Common Tests
Tests for paired samples
Reminder paired data There are coupled
measurements (xi, yi) of the same kind. x1,
x2, ....., xn Data y1, y2, .....,
yn Essential Calculate the differences of the
pairs d1 x1 y1, d2 x2 y2,..... dn
xn yn. Now perform a one-sample t-test for the
data (dj) with µ0.Advantage over unpaired
data Removal of interindividual intra group
variance
NB Approx. normal distribution of the data in
both groups is a requirement.
11
III. Common Tests
t-Test for paired samples
Graphical Description
Difference Boxplot
pulse
pulse
trained
untrained
Difference
12
III. Common Tests
Wilcoxon signed rank test
Nonparametric version for paired samplesAre the
values in group 1 smaller than in group 2 ?
Data Data Data Data Data
Group 1 18 3 6 9 5
Group 2 15 10 8 7 12
Difference Gr.2-Gr.1 -3 7 2 -2 7
Idea If the groups are not different, the
mirrored distribution of the differences with a
negative sign should be similar to the
distribution of the differences with a positive
sign.Check similarity with a Wilcoxon rank sum
test for the comparison of -? und ? .
13
III. Common Tests
Wilcoxon signed rank test
Negative Differences
Positive Differences
-3 -2 2 7
Original scale
Absolute values
0 1 2 3 4 5 6 7 8 9 . . .
Rank scale
1 2 3 4 5 6 . . .
Rank sums Group 1 1.53 4.5 Group 2
1.54.54.5 10.5
? Perform a Wilcoxon rank sum test for Gruppe
1 2 , Gruppe 2 3
In case of k ties (k identical values), replace
the ranks j,,jk-1 of these by the common value
j (k-1)/2.
14
III. Common Tests
Summary Group comparison of a continuous endpoint
Question Are group 1 and group 2 identical with
respect to the distribution of the endpoint?
Does the data follow a Gaussian distribution?
yes
no
Paired data?
Paired data?
ja
nein
ja
nein
t-Test for paired data
t-Test for unpaired data
Wilcoxon signed rank test
Wilcoxon rank sum test
15
III. Common Tests
Comparison of two binary variablesunpaired data
Fishers Exact Test
Are there differences in the distribution and
?
Var. Y Var. Y Var. Y
0 1 ?
Var. X 0 a b ab
Var. X 1 c d cd
Var. X ? ac bc N abcd
given
16
III. Common Tests
Comparison of two binary variables Paired data
McNemar Test (sparse Scotsman)
Are the measurements in resp. concordant or
discordant?
Ex. Clinical trial, Placebo vs. Verum (each
individual obtains both treatments at different
occasions)
Effect of Placebo Effect of Placebo
yes no
Effect of Verum yes 31 15
Effect of Verum no 2 14
Discordant pairs
Concordant pairs
17
III. Common Tests
Comparison of two binary variables Paired data
McNemar Test (Sparsamer Schotte)
Are the measurements in resp. concordant or
discordant?
Var. X, measurement 2 Var. X, measurement 2 Var. X, measurement 2
0 1 ?
Var. X, mea-sure-ment 1 0 a b ab
Var. X, mea-sure-ment 1 1 c d cd
Var. X, mea-sure-ment 1 ? ac bc N abcd
18
III. Common Tests
Comparison of two categorial variablesUnpaired
data Stichproben Chisquared-Test (?2-Test)
H0 The two variables are independent.
Var. Y Var. Y Var. Y Var. Y Var. Y
0 1 s ?
Var. X 0 n00 n01 n0.
Var. X 1 n10 n11 n1.
Var. X njk
Var. X r nr.
Var. X ? n.1 n.2 n.s N
Idea Measure deviation from this equality
This test statistic follows asymptotically a
?2-distribution. -gt Requirement each cell
contains 5 counts.
19
III. Common Tests
Summary Comparison of two categorial variables
Question Is there a difference in the frequency
distributions of one variable w.r.t. the values
of the second variable?
Binary data?
yes
no
Paired data?
Paired data?
yes
yes
no
no
McNemar Test
Fishers Exact Test
(bivariate symmetry tests)
Chisquared (?2) -test
20
III. Common Tests
Summary Description and Testing (Two sample
comparison)
variable Design Deskription numerisch Deskription graphisch Test
continuous unpaired
continuous paired
binary unpaired
binary paired
categorial unpaired
For Gaussian distributions/ at least symmetric
distributions (skewnesslt1)
21
III. Common Tests
Summary Description and Testing (Two sample
comparison)
variable Design Deskription numerisch Deskription graphisch Test
continuous unpaired Median, Quartile 2 Boxplots Wilcoxon-rank sum-/ unpaired t-Test
continuous paired
binary unpaired
binary paired
categorial unpaired
For Gaussian distributions/ at least symmetric
distributions (skewnesslt1)
22
III. Common Tests
Summary Description and Testing (Two sample
comparison)
variable Design Deskription numerisch Deskription graphisch Test
continuous unpaired Median, Quartile 2 Boxplots Wilcoxon-rank sum-/ unpaired t-Test
continuous paired Median, Quartile of difference Difference-Boxplot Wilcoxonsigned rank-/ paired t-Test
binary unpaired
binary paired
categorial unpaired
For Gaussian distributions/ at least symmetric
distributions (skewnesslt1)
23
III. Common Tests
Summary Description and Testing (Two sample
comparison)
variable Design Deskription numerisch Deskription graphisch Test
continuous unpaired Median, Quartile 2 Boxplots Wilcoxon-rank sum-/ unpaired t-Test
continuous paired Median, Quartile of difference Difference-Boxplot Wilcoxonsigned rank-/ paired t-Test
binary unpaired Cross table, row (3D-)Barplot Fishers Exact Test
binary paired
categorial unpaired
For Gaussian distributions/ at least symmetric
distributions (skewnesslt1)
24
III. Common Tests
Summary Description and Testing (Two sample
comparison)
variable Design Deskription numerisch Deskription graphisch Test
continuous unpaired Median, Quartile 2 Boxplots Wilcoxon-rank sum-/ unpaired t-Test
continuous paired Median, Quartile of difference Difference-Boxplot Wilcoxonsigned rank-/ paired t-Test
binary unpaired Cross table, row (3D-)Barplot Fishers Exact Test
binary paired Cross table (Mc-Nemar-table) (3D-)Barplot McNemar-Test
categorial unpaired
For Gaussian distributions/ at least symmetric
distributions (skewnesslt1)
25
III. Common Tests
Summary Description and Testing (Two sample
comparison)
variable Design Deskription numerisch Deskription graphisch Test
continuous unpaired Median, Quartile 2 Boxplots Wilcoxon-rank sum-/ unpaired t-Test
continuous paired Median, Quartile of difference Difference-Boxplot Wilcoxonsigned rank-/ paired t-Test
binary unpaired Cross table, row (3D-)Barplot Fishers Exact Test
binary paired Cross table (Mc-Nemar-table) (3D-)Barplot McNemar-Test
categorial unpaired Cross table, row (3D-)Barplot ?2-Test
For Gaussian distributions/ at least symmetric
distributions (skewnesslt1)
26
III. Common Tests
Caveat
  • Statistical Significance ? Relevance

For large sample numbers, very small differences
may become significant For small sample numbers,
an observed difference may be relevant, but not
statistically significant
27
III. Common Tests
Multiple Testing Problems
  • Examples
  • Simultaneous testing of many endpoints (e.g.
    genes in a microarray study)
  • Simultaneous pairwise comparison of many (k)
    groups (k pairwise tests k(k-1)/2 tests)

Although each individual test keeps the
significance level (say a 5), the probability
of obtaining (at least one) false positive
increases dramatically with the number of
testsFor 6 tests, the probability of a false
positive is already gt30! (if there are no true
positives)
28
III. Common Tests
Multiple Testing Problems
One possible solution p-value correction for
multiple testing, e.g. Bonferroni
correction Each single test is performed at the
level a/m (local significance level a/m), where
m is the number of tests. The probability of
obtaining a (at least one) false positive is then
at most a (multiple/global significance level
a)
Ex. m 6 Desired multiple level a 5 ?
local level a/m 5/6 0.83
Other solutions Bonferroni-Holm,
Benjamini-Hochberg,Control of False discovery
rate (FDR) instead of significance at the group
level (family wise error rate, FWER) SAM
29
IV. Bivariate Analysis
(Relation of two Variables)
Ex.
How to quanty a relation between two continuous
variables?
From A.Wakolbinger
30
IV. Bivariate Analysis
  • Pearson-Correlation coefficient rxy
  • Useful for gaussian variables X,Y (but not only
    for those)
  • Measures the degree of linear dependence
  • Properties
  • -1 rxy 1
  • rxy 1 perfect linear
    dependenceThe sign indicates the direction of
    the relation (pos/neg dependence)

From A.Wakolbinger
31
IV. Bivariate Analysis
Pearson-Correlation coefficient rxy
  • The closer rxy to 0, the weaker the (linear)
    dependence

From A.Wakolbinger
32
IV. Bivariate Analysis
Pearson-Correlation coefficient rxy
  • The closer rxy to 0, the weaker the (linear)
    dependence

From A.Wakolbinger
33
IV. Bivariate Analysis
Pearson-Correlation coefficient rxy
  • The closer rxy to 0, the weaker the (linear)
    dependence

From A.Wakolbinger
34
IV. Bivariate Analysis
Pearson-Correlation coefficient rxy
  • The closer rxy to 0, the weaker the (linear)
    dependence

From A.Wakolbinger
35
IV. Bivariate Analysis
Pearson-Correlation coefficient rxy
  • The closer rxy to 0, the weaker the (linear)
    dependence

From A.Wakolbinger
36
IV. Bivariate Analysis
Pearson-Correlation coefficient rxy
  • The closer rxy to 0, the weaker the (linear)
    dependence
  • rxy ryx (Symmetry)

From A.Wakolbinger
37
IV. Bivariate Analysis
Pearson-Correlation coefficient rxy
Example Relation body height body weight arm
length
rxy 0,38
rxy 0,84
The closer the data scatters around the
regression line (see later), the larger rxy.
From A.Wakolbinger
38
IV. Bivariate Analysis
Pearson-Correlation coefficient rxy
How large is r in these examples?
rxy 0
rxy 0
rxy 0
The Pearson correlation coefficient cannot
measure non-linear dependence properly.
39
IV. Bivariate Analysis
Spearman correlation sxy
Idea Calculate the Pearson correlation
coefficient on rank transformed data.
sxy 0,95
rxy 0,88
Rang(Y)
Y
Rang(X)
X
Spearman correlation measures the monotony of a
dependence.
40
IV. Bivariate Analysis
Pearson vs. Spearman correlation
Original scale
41
Pearson correlation
IV. Bivariate Analysis
Pearson vs. Spearman correlation
NM_001767 NM_000734 NM_001049 NM_006205
NM_001767 1.00000000 0.94918522 -0.04559766 0.04341766
NM_000734 0.94918522 1.00000000 -0.02659545 0.01229839
NM_001049 -0.04559766 -0.02659545 1.00000000 -0.85043885
NM_006205 0.04341766 0.01229839 -0.85043885 1.00000000
42
IV. Bivariate Analysis
Pearson vs. Spearman correlation
Rank transformed data
43
IV. Bivariate Analysis
Pearson vs. Spearman correlation
Spearman correlation
NM_001767 NM_000734 NM_001049 NM_006205
NM_001767 1.00000000 0.9529094 -0.10869080 -0.17821449
NM_000734 0.9529094 1.00000000 -0.11247013 -0.20515650
NM_001049 -0.10869080 -0.11247013 1.00000000 0.03386758
NM_006205 -0.17821449 -0.20515650 0.03386758 1.00000000
44
IV. Bivariate Analysis
Pearson vs. Spearman correlation
Conclusion Spearman correlation is more robust
against outliers and insensitive against changes
of scale. In case of a (expected) linear
dependence however, Pearson correlation is more
sensitive.
45
IV. Bivariate Analysis
Pearson vs. Spearman correlation
  • Summary
  • Pearson correlation is a measure for linear
    dependence
  • Spearman correlation is a measure for monotone
    dependence
  • Correlation coefficients do not tell anything
    about the (non-)existence of a functional
    dependence.
  • Correlation coefficients tell nothing about
    causal relations of two variables X and Y (on the
    contrary, they are symmetric in X and Y)
  • Correlation coefficients hardly tell anything
    about the shape of a scatterplot

46
IV. Bivariate Analysis
Fake correlation, Confounding
Example
r0.6
Income
Foot size
corr.
Foot size Income
Confounding A variable that explains (part of)
the dependence of two others.
47
IV. Bivariate Analysis
Fake correlation, Confounding
Partial correlation remaining
correlation(here after correction w.r.t.
gender)
rXY Geschl. partial correlation
r0.03
Income
Foot size
48
V. Regression
(Approximation of one variable by a function of
a second variable)
Population
Unknown functional dependence
?
?
?
?
?
49
V. Regression
Regression The Method
  • Choose a family of functions that you think is
    capable of capturing the functional dependence of
    the two variables.E.g. the set of linear
    functions, f(x) axb or the set of quadratic
    functions, f(x) ax2bxc

50
V. Regression
Regression The Method
  • Choose a loss function the quality measure for
    the approximation. E.g. for continuous data
    usually Quadratic Loss (RSS, Residual Sum of
    Squares)

Y f(X)
51
V. Regression
Regression Methode
  • Identify the (a) function from the family of
    candidates that approximates the response
    variable best in terms of the loss function.

RSS 8.0
RSS 1.1
RSS 1.7
RSS 3
52
V. Regression
Univariate linear Regression
Brain weight as a function of body weight
Brain-/Body weights of 62 mammals
Brain weight
Body weight
53
V. Regression
Univariate linear Regression
Brain weight as a function of body weight
Brain-/Body weights of 62 mammals
Log10(Brain weight)
Log10(Body weight)
54
V. Regression
Univariate linear Regression
Brain weight as a function of body weight
Brain-/Body weights of 62 mammals
Log10(Gehirngewicht)
Log10(Körpergewicht)
55
V. Regression
Univariate linear Regression
Brain weight as a function of body weight
Residuals
Residuen
Log10(Gehirngewicht)
Log10(Körpergewicht)
Chironectes minimus (Schwimmbeutelratte,
Wasseropossum)
56
V. Regression
Goodness of Fit, Model selection
  • Questions
  • Did I choose a proper class of approximation
    functions (a good regression model)?
  • How good does my regression function fit the
    data?

One possible measure coefficient of
determination, R2 Measures the fraction of the
variance that is not explained away by the
regression function R2 1 unexplained Var. /
total Var. 1 - RSS/Var(Y) If we were using
linear regression, R2 is simply the squared
Pearson correlation of X and Y R2
rxy2 (and hence 0 R2 1)
57
II.10 Gütekriterien
Questions?
58
Was ist Zufall?
Schönheit ist die Abwesenheit von
Zufällen. Felix Magath
Der Zufall ist ein Pseudonym, das der liebe Gott
wählt, wenn er unerkannt bleiben will. Albert
Schweitzer
Ein Kasten Bier hat 24 Flaschen, ein Tag hat 24
Stunden. Das kann doch kein Zufall sein. Anonymus
59
End of Part II
Write a Comment
User Comments (0)
About PowerShow.com