Data Analysis Using R: 2. Descriptive Statistics - PowerPoint PPT Presentation

About This Presentation
Title:

Data Analysis Using R: 2. Descriptive Statistics

Description:

Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia Overview Measurements Population vs sample ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 39
Provided by: DrTu3
Category:

less

Transcript and Presenter's Notes

Title: Data Analysis Using R: 2. Descriptive Statistics


1
Data Analysis Using R2. Descriptive Statistics
  • Tuan V. Nguyen
  • Garvan Institute of Medical Research,
  • Sydney, Australia

2
Overview
  • Measurements
  • Population vs sample
  • Summary of data mean, variance, standard
    deviation, standard error
  • Graphical analyses
  • Transformation

3
Scales of Measurement
  • In general, most observable behaviors can be
    measured on a ratio-scale
  • In general, many unobservable psychological
    qualities (e.g., extraversion), are measured on
    interval scales
  • We will mostly concern ourselves with the simple
    categorical (nominal) versus continuous
    distinction (ordinal, interval, ratio)

variables
categorical
continuous
ordinal
interval
ratio
4
Ordinal Measurement
  • Ordinal Designates an ordering quasi-ranking
  • Does not assume that the intervals between
    numbers are equal.
  • finishing place in a race (first place, second
    place)

1st place
2nd place
3rd place
4th place
1 hour 2 hours 3 hours 4 hours 5 hours 6 hours 7
hours 8 hours
5
Interval and Ratio Measurement
  • Interval designates an equal-interval ordering
  • The distance between, for example, a 1 and a 2 is
    the same as the distance between a 4 and a 5
  • Example Common IQ tests are assumed to use an
    interval metric
  • Ratio designates an equal-interval ordering with
    a true zero point (i.e., the zero implies an
    absence of the thing being measured)
  • Example number of intimate relationships a
    person has had
  • 0 quite literally means none
  • a person who has had 4 relationships has had
    twice as many as someone who has had 2

6
Statististics Enquiry to the unknown
Population Sample
Parameter Estimate
7
Estimate the population mean
  • Population height mean 160 cm
  • Standard deviation 5.0 cm

ht lt- rnorm(10, mean160, sd5) mean(ht) ht lt-
rnorm(10, mean160, sd5) mean(ht) ht lt-
rnorm(100, mean160, sd5) mean(ht) ht lt-
rnorm(1000, mean160, sd5) mean(ht) ht lt-
rnorm(10000, mean160, sd5) mean(ht) hist(ht)
The larger the sample, the more accurate the
estimate is!
8
Estimate the population proportion
  • Population proportion of males 0.50
  • Take n samples, record the number of k males
  • rbinom(n, k, prob)

males lt- rbinom(10, 10, 0.5) males mean(males) ma
les lt- rbinom(20, 100, 0.5) males mean(males) mal
es lt- rbinom(1000, 100, 0.5) males mean(males)
The larger the sample, the more accurate the
estimate is!
9
Summary of Continuous Data
  • Measures of central tendency
  • Mean, median, mode
  • Measures of dispersion or variability
  • Variance, standard deviation, standard error
  • Interquartile range

R commands length(x), mean(x), median(x),
var(x), sd(x) summary(x)
10
R example
  • height lt- rnorm(1000, mean55, sd8.2)
  • mean(height)
  • 1 55.30948
  • median(height)
  • 1 55.018
  • var(height)
  • 1 68.02786
  • sd(height)
  • 1 8.2479
  • summary(height)
  • Min. 1st Qu. Median Mean 3rd Qu. Max.
  • 28.34 49.97 55.02 55.31 60.78 85.05

11
Graphical Summary Box plot
boxplot(height)
95 percentile
75 percentile
Median, 50 perc.
25 percentile
5 percentile
12
Strip chart
13
Histogram
14
Implications of the mean and SD
  • In the Vietnamese population aged 30 years, the
    average of weight was 55.0 kg, with the SD being
    8.2 kg.
  • What does this mean?
  • 68 individuals will have height between 55 /-
    8.21 46.8 to 63.2 kg
  • 95 individuals will have height between 55 /-
    8.21.96 38.9 to 71.1 kg

15
Implications of the mean and SD
  • The distribution of weight of the entire
    population can be shown to be

1.96SD
1SD
16
Summary of Categorical Data
  • Categorical data
  • Gender male, female
  • Race Asian, Caucasian, African
  • Semi-quantitative data
  • Severity of disease mild, moderate, severe
  • Stages of cancer I, II, III, IV
  • Preference dislike very much, dislike,
    equivocal, like, like very much

17
Mean and variance of a proportion
  • For an individual i consumer, the probability
    he/she prefers A is pi. Assuming that all
    consumers are independent, then pi p.
  • Variance of pi is var(pi) p(1-p)
  • For a sample of n consumers, the estimated
    probability of preference for A is

and the variance of p_bar is
18
Normal approximation of a binomial distribution
  • For an individual i consumer, the probability
    he/she prefers A is pi. Assuming that all
    consumers are independent, then pi p.
  • Variance of pi is var(pi) p(1-p)
  • For a sample of n consumers, the estimated
    probability of preference for A is

and the variance of p_bar is
and standard deviation
19
Normal approximation of a binomial distribution -
example
  • 10 consumbers, 8 preferred product A.
  • Proportion of preference for A p 0.8
  • Variance var(p) 0.8(0.2)/10 0.016
  • Standard deviation of p s 0.126
  • 95 CI of p 0.8 1.96(0.126) 0.55 to 1.00

20
Descriptive AnalysesContinuous data
21
Paired t-test
  • Continuous data
  • Normally distributed
  • Two samples are NOT independent

22
Paired t-test an example
  • The problem Viewing certain meats under red
    light might enhance judges preferences for meat.
    12 judges were asked to score the redness of meat
    under red light and white light

Results Judge Red White 1 20 22 2 18 19 3 19 17
4 22 18 5 17 21 6 20 23 7 19 19 8 16 20 9 21 22 1
0 17 20 11 23 27 12 18 24
23
Paired t-test analysis
Judge Red light White light Difference
1 20 22 2
2 18 19 1
3 19 17 -2
4 22 18 -4
5 17 21 4
6 20 23 3
7 19 19 0
8 16 20 4
9 21 22 1
10 17 20 3
11 23 27 4
12 18 24 6
Mean 21.0 19.2 1.83
SD 2.8 2.1 2.82
Mean difference 1.83, SD 0.81 Standard error
(SE) SD/sqrt(n) 0.81/sqrt(10) 0.81 T-test
(1.83 0)/0.81 2.23 P-value
0.0459 Conclusion there was a significant effect
of light colour.
24
Paired t-test R analysis
  • red lt -c(20,18,19,22,17,20,19,16,21,17,23,18)
  • white lt -c(22,19,17,18,21,23,19,20,22,20,27,24)
  • t.test(red, white, pairedTRUE)

data red and white t -2.2496, df 11,
p-value 0.04592 alternative hypothesis true
difference in means is not equal to 0 95 percent
confidence interval -3.6270234 -0.0396433
sample estimates mean of the differences
-1.833333
25
Two-sample t-test
Mean difference D x y Variance of D
Sample Group 1 Group2 1 x1 y1
2 x2 y2 3 x3 y3 4 x4 y4 5 x5 y5
n xn yn Sample size n1 n2 Mean x y
SD sx sy
T-statistic
95 Confidence interval
26
Two-group comparison an example
20 consumers rated their preference for two rice
desserts (A and B)
  • ID A B
  • 1 3 3
  • 2 7 1
  • 3 1 2
  • 4 9 4
  • 5 3 5
  • 6 4 2
  • 7 1 2
  • 8 2 5
  • 9 6 3
  • 10 7 2

ID A B 11 5 3 12 8 4 13 5 2 14 9 3 15 4 5 16 6 4 1
7 4 3 18 3 1 19 9 3 20 5 2
27
Unpaired t-test using R
  • alt-c(3,7,1,9,3,4,1,2,6,7,5,8,5,9,4,6,4,3,9,5)
  • blt-c(3,1,2,4,5,2,2,5,3,2,3,4,2,3,5,4,3,1,3,2)
  • t.test(red,white)

Welch Two Sample t-test data a and b t
3.3215, df 27.478, p-value 0.002539 alternativ
e hypothesis true difference in means is not
equal to 0 95 percent confidence interval
0.8037895 3.3962105 sample estimates mean of x
mean of y 5.05 2.95
28
Transformation of data multiplicative effects
  • The following data represent lysozyme levels in
    the gastric juice of 29 patients with peptic
    ulcer and of 30 normal controls. It was
    interested to know whether lysozyme levels were
    different between two groups.
  • Group 1
  • 0.2 0.3 0.4 1.1 2.0 2.1 3.3 3.8
    4.5 4.8 4.9 5.0 5.3 7.5 9.8 10.4 10.9
    11.3 12.4 16.2 17.6 18.9 20.7 24.0 25.4
    40.0 42.2 50.0 60.0
  • Group 2
  • 0.2 0.3 0.4 0.7 1.2 1.5 1.5 1.9 2.0 2.4
    2.5 2.8 3.6 4.8 4.8 5.4 5.7 5.8 7.5 8.7
    8.8 9.1 10.3 15.6 16.1 16.5 16.7 20.0 20.7
    33.0

29
Unpaired t-test by R
  • g1 lt- c( 0.2, 0.3, 0.4, 1.1, 2.0, 2.1, 3.3, 3.8,
  • 4.5, 4.8, 4.9, 5.0, 5.3, 7.5, 9.8, 10.4,
  • 10.9, 11.3, 12.4, 16.2, 17.6, 18.9, 20.7,
  • 24.0, 25.4, 40.0, 42.2, 50.0, 60)
  • g2 lt- c(0.2, 0.3, 0.4, 0.7, 1.2, 1.5, 1.5, 1.9,
    2.0,
  • 2.4, 2.5, 2.8, 3.6, 4.8, 4.8, 5.4, 5.7,
    5.8,
  • 7.5, 8.7, 8.8, 9.1, 10.3, 15.6, 16.1,
    16.5,
  • 16.7, 20.0, 20.7, 33.0)
  • t.test(g1, g2)

data g1 and g2 t 2.0357, df 40.804,
p-value 0.04831 alternative hypothesis true
difference in means is not equal to 0 95 percent
confidence interval 0.05163216 13.20239083
sample estimates mean of x mean of y 14.310345
7.683333
30
Exploration of data
  • par(mfrowc(1,2))
  • hist(g1)
  • hist(g2)

Group 1 mean(g1) 14.3 sd(g1) 15.7 Group
2 mean(g2) 7.7 sd(g2) 7.8
31
Re-analysis of lysozyme data
  • log.g1 lt- log(g1)
  • log.g2 lt- log(g2)
  • t.test(log.g1, log.g2)

data log.g1 and log.g2 t 1.406, df 55.714,
p-value 0.1653 alternative hypothesis true
difference in means is not equal to 0 95 percent
confidence interval -0.2182472 1.2453165
sample estimates mean of x mean of y 1.921094
1.407559
exp(1.921-1.407) 1.67 Group 1s mean is 67
higher than group 2s
32
Descriptive analysisCategorical data
33
Comparison of two proportions - theory
Group 1 2 ___________________
_________________________ Sample
size n1 n2 Number of events e1 e2 Proportion of
events p1 p2
Difference D p1 p2 SE difference SE
p1(1p1)/n1 p2(1p2)/n21/2 Z D / SE 95 CI
D 1.96(SE) With (n1 n2) gt 20, and if Z gt
2, it is possible to reject the null hypothesis.
34
Comparison of two proportions - example
Thirty-day mortality rate () of 100 rats who had
been exposed to heroine or cocain.
Analysis Difference D 0.90 0.36 0.54 SE
(D) 0.9(0.1)/100 0.36(0.64)/1001/2
0.057 Z 0.54 / 0.057 9.54 95 CI 0.54
1.96(0.057) 0.43 to 0.65 Conclusion reject
the null hypothesis.
Group Heroine Cocaine __________________
________________________ Sample
size 100 100 Number of deaths 90 36 Mortality
rate 0.90 0.36
35
Comparison of two proportions - R
  • events lt- c(90, 36)
  • total lt- c(100, 100)
  • prop.test(events, total)

2-sample test for equality of proportions with
continuity correction data deaths out of total
X-squared 60.2531, df 1, p-value
8.341e-15 alternative hypothesis two.sided 95
percent confidence interval 0.4190584 0.6609416
sample estimates prop 1 prop 2 0.90 0.36
36
Comparison of gt2 proportions Chi square
analysis
  • table(sex, ethnicity)
  • ethnicity
  • sex African Asian Caucasian Others
  • Female 4 43 22 0
  • Male 4 17 8 2

females lt- c(4, 43, 22, 0) total lt- c(8, 60, 30,
2) prop.test(females, total)
37
Comparison of gt2 proportions Chi square
analysis
  • 4-sample test for equality of proportions
    without continuity
  • correction
  • data females out of total
  • X-squared 6.2646, df 3, p-value 0.09942
  • alternative hypothesis two.sided
  • sample estimates
  • prop 1 prop 2 prop 3 prop 4
  • 0.5000000 0.7166667 0.7333333 0.0000000
  • Warning message
  • Chi-squared approximation may be incorrect in
    prop.test(females, total)

38
Summary
  • Examine the distribution of data
  • Mean and variance systematic difference?
  • Normally distributed ?
  • Transformation?
  • Present confidence intervals (and p-values)
Write a Comment
User Comments (0)
About PowerShow.com