STATISTICS WORKSHOP - 2 presentation

About This Presentation

Transcript and Presenter's Notes

Title: STATISTICS WORKSHOP - 2

1
STATISTICS WORKSHOP - 2

Contingency tables
Correlation
Analysis of variance

2
Why relations between variables are important

The ultimate goal of every research or scientific
analysis is finding relations between variables.
The philosophy of science teaches us that there
is no other way of representing meaning except
in terms of relations between some quantities or
qualities either way involves relations between
variables.
The advancement of science must always involve
finding new relations between variables.

3
Qualitative Data (Contingency Table)
Example This test would be the one to use if we
have, say, different classes of patients (e.g.,
six types of cancers) and for a set of 1000
markers we can have either presence/absence of
each marker in each patient (this would yield
1000 contingency tables of dimensions 6x2 ---each
marker by each cancer type)
4
Contingency Table
Question Is there evidence in the data for
association between the categorical variables?
For cross-classified data, the Pearson chi-square
test for independence and Fisher's exact test can
be used to test the null hypothesis that the row
and column classification variables of the data's
two-way contingency table are independent.
5
Chi-Square test
2
2
Odds Ratio (OR) (ad)/(bc) Relative risk
a(cd)/c(ab)
6
Contingency Table
Chi-Square Test Example 3500 were observed
whether they snore or not Is there an
association between snoring and gender ?
7
Contingency Table
Example - Is there an association between snoring
and gender?
8
Contingency Table
Odds ratio 1.58 95 CI 1.39 to 1.81
9
Contingency Table
Is there evidence of differences in smoking
pattern between the sexes?
10
Contingency Table
11
Measuring treatment differences with Y/N response

For outcomes such as reduction in blood pressure
there are obvious summaries of treatment effect
such as the difference between the average of
each group
For yes/no outcomes like death or cure the choice
of summary is not so obvious

Dead

Y N
aspirin
804
7783
9.4
placebo
1016
7584
11.8
TOTAL
1820
15367
12
Relative Risk or Risk Ratio

Relative risk or risk ratio risk of death in
aspirin group divided by risk in placebo group
Relative Risk 9.4 / 11.8 0.80
mortality is reduced by 20
Relative risk estimates are likely to generalise
well from one population to another

13
Absolute Risk Difference

Absolute risk difference is the proportion of
deaths in the aspirin group minus the proportion
in the placebo group
risk difference 9.7 - 11.8 -2.1
"2.1 lives saved for each 100 patients treated"
Risk difference has a more direct clinical
interpretation, especially when considering cost-
effectiveness

14
Odds Ratio

Odds ratio the odds of death in the aspirin
group divided by the odds in the placebo group
Odds Ratio (9.7/90.3) / (11.8/88.2) 0.77
"reduction of 23 in the odds of death
The odds ratio has some purely mathematical
advantages. It is not much used in randomised
studies

15
Berksons Fallacy

It is a treatment-seeking bias so called because
Berkson indicated that individuals with more than
one disorder are more likely to seek clinical
services than are those with only one disorder.
This leads to an erroneously higher estimate of
the prevalence of the association between these
disorders than would be the case if each single
disorder independently led the patient to seek
care.

16
Berksons Fallacy

2784 individuals were surveyed to determine
whether each subject suffered from either a
disease A or disease B or both. It is found that
257 out of the 2784 patients were hospitalised
for the condition.

Disease A Disease B Disease B Total
Disease A Yes No Total
Yes 7 29 36
No 13 208 221
Total 20 237 257
Disease A Disease B Disease B Total
Disease A Yes No Total
Yes 22 171 193
No 202 2389 2591
Total 224 2560 2784
P lt 0.025 There is some association between
having disease A and having disease B
P gt 0.1 There is no association between having
disease A and having disease B
17
Gene Association Studies Typically Wrong

Evolution of the strength of an association as
more information is accumulated. The strength of
the association is shown as an estimate of the
odds ratio (OR) without confidence intervals.
a, Eight topics in which the results of the first
study or studies differed beyond chance (Plt0.05)
when compared with the results of the subsequent
studies.
b, Eight topics in which the first study or
studies did not claim formal statistical
significance for the genetic association but
formal significance was reached by the end of the
meta-analysis.
Each trajectory starts at the OR of the first
study or studies. Updated cumulative OR estimates
are obtained at the end of each subsequent year,
summarizing all information to that time.

(Adapted from J.P.Ioannidis et al., Nature
Genetics 29306-9, 2001)
18
(No Transcript)
19
Studies of disease association

Given the number of potentially identifiable
genetic markers and the multitude of clinical
outcomes to which these may be linked, the
testing and validation of statistical hypotheses
in genetic epidemiology is a task of
unprecedented scale

20
Testing for equality of two proportions

Example Two groups of genes 1. genes for
transcription and translation 2. genes in the
immune system Question Do they have similar
purine-pyrimidine compositions? The question is
asking whether the percentage of purines (or
pyrimidines) in group 1 is the same as the
percentage of purines (or pyrimidines) in group
2. To form the null and alternative hypotheses
we can say G1 the percentage of purines in
group 1 G2 the percentage of purines in group
2 H0 G1 G2 H1 G1 gt G2 or G2 gt G1

21
Correlation

Correlation can be used to summarise the amount
of linear association between two continuous
variables x and y.
Let (x1, y1), (x2, y2), ..., (xn, yn) denote the
data points.
A scatter plot gives a "cloud" of points

Y
Y
X
X
Positive correlation
Negative correlation
No correlation
22
Positive and Negative Association

If the points are nearly in a straight line then
knowing the value of one variable helps you to
predict the value of the other.
If there is little or no association, the "cloud"
is more spread out and information about one
variable doesn't tell you much about the other.

23
A simple correlation formula

Suppose there are n points altogether and that
n(A) is the number in region A and similarly for
n(B), n(C) and n(D
Give a value of 1/n to every point in A or C and
-1/n to every point in B or D
Define
Cor n(A)n(C)-n(B)-n(D)
n
What are the properties of cor?

D
A
C
B
24
The Pearson product moment correlation coefficient

The formula for cor works, but it is rather
crude. For example both the diagrams below
would give cor 1.

and
are positive or negative in the different regions
and so is the product

Sum will not lie between -1 and 1. It depends
on
The scale of x and y
The number of points

25
Correlation formula
where
Partial correlation Correlation between 2
variables that controls for the effects of one or
more other variables. Rank Correlation
26
Pearson Correlation Coefficient

A measure of linear association between two
variables, denoted as r.
Values of the correlation coefficient range from
-1 to 1.
The sign of the coefficient indicates the
direction of the relationship, and its absolute
value indicates the strength, with larger
absolute values indicating stronger relationships.

27
Interpretation of correlation

r measures the extent of linear association
between two continuous variables.
Association does not imply causation - both
variables may be affected by a third variable.
If r 0, there is no association between X and Y
r does not indicate the extent of non linear
associations
The value of r can be affected by outliers
Correlations Do Not Establish Causality
Example When a gene is isolated that has some
positive correlation to cancer, claim is often
made that it enhances the susceptibility to the
disease, and not cause it.

28
Some misconceptions

When the value of the correlation coefficient is
large (small), the relation between the two
variates is close to linear, thus, when r 0.9
or 0.95 the relation is nearly linear
When the value of the correlation coefficient is
zero or near zero the two variates have no or
almost no functional relation
When the value of the correlation coefficient is
positive (negative), the value of Y becomes
larger (smaller) as a whole, as the value of X
becomes large
Example Let (X,Y) take (1,-1),(2,-2),(3,-3),(4,-
4),(5,20) each with probability 1/5. Then we
have Cor(X,Y) 0.62
Concerning the first four points Y decreases as
X increases. This example shows that even when
the correlation coefficient between X and Y is
positive, Y does not always increase as a whole
as X increases.

29
Examples

Eg1. In Australia total alcohol consumption and
the number of ministers of religion have both
increased over time and would be positively
correlated but the increase in one has not caused
the increase in the other (both are related to
the total population size)
Eg2. In Japanese schoolchildren shoe size was
reported to be correlated (positively) with
scores on a test of mathematical ability.
Eg3. Extracting informative genes with negative
correlation for accurate cancer classification

30
Effectiveness of the first Cold-War arms agreement

"Most important, the negative correlation between
the mutation rate and the parental year of birth
among those born between 1950 and 1956 provides
experimental evidence for change in human
germline mutation rate with declining exposure to
ionizing radiation and therefore shows that the
Moscow treaty banning nuclear weapon tests in the
atmosphere (August 1963) has been effective in
reducing genetic risk to the affected
population."

31
Example - Heights and weights of 6 female students

The table below shows the heights and weights of
6 female students. How closely related are the
heights and the weights?

The correlation coefficient 0.904
32
Spearman Correlation Coefficient

Commonly used nonparametric measure of
correlation between two ordinal variables. For
all of the cases, the values of each of the
variables are ranked from smallest to largest,
and the Pearson correlation coefficient is
computed on the ranks.

33
Rank Correlation

10 students, arranged in alphabetical order, were
ranked according to their achievements in both
the laboratory and lecture sections of a biology
course. Find the coefficient of rank
correlation.

Rank correlation 0.8545
34
Thoughts
Patterns often emerge before the reasons for them
become apparent. - Vasant Dhar If you do not
expect, you cannot find the unexpected. -
Heracletes To consult the statistician after an
experiment is finished is often merely to ask him
to conduct a post mortem examination. He can
perhaps say what the experiment died of. -
R.A.Fisher

Write a Comment

User Comments (0)

About PowerShow.com

STATISTICS WORKSHOP - 2 PowerPoint PPT Presentation