An Introduction to Nonparametric Statistics - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

An Introduction to Nonparametric Statistics

Description:

Bivariate tests. Tests of equivalence of distributions. Test type ... Bivariate tests (continued) Test of homogeneity of means/median across groups. Test type ... – PowerPoint PPT presentation

Number of Views:269

Avg rating:3.0/5.0

Slides: 40

Provided by: bridget90

Category:

more less

Transcript and Presenter's Notes

Title: An Introduction to Nonparametric Statistics

1
Bridget Neville, MPH Dana-Farber Cancer
Institute Boston MA
An Introduction toNonparametric Statistics
Using SAS
2
What are we going to talk about today?

Working with continuous variables
Descriptive statistics
Testing for normality
Using nonparametric tests
Signed rank (paired data)
Wilcoxon rank sum (2 samples)
Kruskal-Wallis (3 or more samples)
Spearman correlation
Interpretation of test results

3
You already learned about categorical variables...

Now for continuous variables
Numbers typically serve as values
Can take on any range of values
Equal differences between values do have equal
quantitative meaning
Examples
Age
Length, weight
Distance

4
Descriptive statistics Measures of central
tendency and spread

Central tendancy a value/number that reveals
where the CENTER of the distribution is located
in that sample
mean
median
Spread (or variability) represents extent of
dispersion in a distribution
Standard deviation
Interquartile range (IQR)

5
How to report measures

Always report a measure of central tendency with
a measure of spread
Mean (standard deviation)
Patients in this sample had a mean age of 65
years (standard deviation4).
Median (IQR)
Patients in this sample had a median age of 72
years (IQR45,85).
Note that 45 is the 25th percentile and 85 is
the 75th percentile. It is best to represent the
IQR this way instead of one number that is the
difference between the 25th and 75th
percentiles

6
Mean vs. MedianWhen to report which

Mean (standard deviation)
When distribution is
normal or approximately normal/fairly
symmetrical
Large sample size
Strongly influenced by outliers
Median (IQR)
When distribution is
Non-normal distribution (skewed)
Small sample size
Robust to outliers
Sometimes with ordinal data

7
SAS Procedures for Mean and Median

Proc means datadsname
n mean std median p25 p75 min max
var varname
run
Std (or stddev)standard deviation
p2525th percentile p7575th percentile
Proc univariate datadsname
var varname
run

8
What does normal look like?

Normal distribution points are as likely to
occur on one side of the mean as the other

9
Non-normal distribution

Asymmetrical because natural limit prevents
outcomes on one side
Peak of distribution is off-center and toward
limit and tail stretches away from it
Called right- or left-skewed according to
direction of tail

10
How to assess normality(Use combination of these
methods)

Visual inspection of distributions of data in
each group (skewedNOT normal)
Histogram, normal probability plot, stem and leaf
plot, box plot
Comparison of mean vs. median (differentNOT
normal)
Test hypothesis that data are from a normal
distribution (plt0.05NOT normal)
Knowledge of variable (ordinal usually NOT
normal)
Small sample usually NOT normal

11
Proc UnivariateVisual Inspection

Stem Leaf Boxplot
2 6 1 0
1 46 2
0 133344566788 12 ----
-0 95 2 -----
-1 43 2
-2 2 1 0
----------------
Multiply Stem.Leaf by 101
Normal Probability Plot
25
-25
--------------------------------------
--
-2 -1 0 1 2

Normality Your data
12
Proc UnivariateMean vs. Median
The SAS System Testing for
significant difference in scores
The UNIVARIATE Procedure Variable
scordiff (Differences in exam scores)
Moments N
20 Sum Weights 20 Mean
2.55 Sum Observations
51 Std Deviation 10.9711344 Variance
120.365789 Skewness
-0.3418653 Kurtosis
0.81164646 Uncorrected SS 2417
Corrected SS 2286.95 Coeff Variation
430.240564 Std Error Mean 2.45322023
Basic Statistical Measures
Location
Variability Mean 2.550000 Std Deviation
10.97113 Median 4.000000 Variance
120.36579 Mode 3.000000
Range 48.00000
Interquartile Range 9.50000
13
Proc UnivariateTest normality hypothesis
Tests for Normality Test
--Statistic--- -----p Value------ Shapiro-Wi
lk W 0.942009 Pr lt W
0.2616 Kolmogorov-Smirnov D 0.216359 Pr gt
D 0.0153 Cramer-von Mises W-Sq 0.142641
Pr gt W-Sq 0.0270 Anderson-Darling A-Sq
0.687072 Pr gt A-Sq 0.0646
Ho Data come from normal distribution Ha Data
NOT normal Note that SAS will not report
Shapiro-Wilk for samples with ngt2000
14
Nonparametric Statistics for Non-normal data

Parameter-free or distribution-free do not rely
on estimation of parameters such as mean and
standard deviation to describe distribution of
population variables.
Assumption-free do not rely on assumptions
about variable distributions (like assumption of
normality, approximate normality or symmetry).

15
Summary of analytic strategies
Univariate tests
16
Analytic Strategies (cont.)
Bivariate tests Tests of equivalence of
distributions
17
Analytic strategies (cont.)
Bivariate tests (continued) Test of homogeneity
of means/median across groups
18
Analytic strategies (cont.)
Tests of linear association
19
Tests for paired data

You want to compare the mean/median difference
between 2 continuous measurements on the same
person/unit of analysis
Examples
Repeated obs on same person
Before and after (like diet studies)
Body parts (right knee vs. left knee)
2 test scores
Sibling studies

20
Nonparametric Signed rank test

Proc univariate datasta6207 normal
var scordiff
Run
Tests for Location Mu00
Test -Statistic- -----p Value------
Student's t t 1.03945 Pr gt t 0.3116
Sign M 5 Pr gt M 0.0414
Signed Rank S 33 Pr gt S 0.2265

S test statistic sum of positive ranks Note
For a 1-sided p-value, we would divide the
2-sided p-value by 2.
21
Signed Rank TestInterpretation

Ho Median difference 0
Ha Median difference ne 0
Interpretation/write-up We performed the signed
rank test to determine if there was a significant
difference in test scores. The median difference
in test scores (test2-test1) was 4 (IQR -2,
7.5), although it was not significant at
alpha0.05 (p0.2265).

22
Tests for 2 independent samples

You want to compare mean/median of a continuous
variable between 2 independent categorical groups
Examples
Age of men vs. women
Days between diagnosis and treatment for
hematological vs. solid cancers
Income in more vs. less educated individuals

23
2-sample tests for normality

Proc sort datawork.tumor
by group
Proc univariate datawork.tumor plot normal
by group
var mass
Run

Need to test normality in each group!
If at least 1 group NOT normal
nonparametric analysis

24
Wilcoxon rank sum test

Proc npar1way datatumor wilcoxon
class group
var mass
Run
Also known as Mann-Whitney U-test

25
Wilcoxon rank sum output

The NPAR1WAY Procedure
Wilcoxon Scores (Rank Sums) for Variable
mass
Classified by Variable group
Sum of Expected
Std Dev Mean
group N Scores Under H0
Under H0 Score
A 5 33.0 25.0
4.065437 6.60
B 4 12.0 20.0
4.065437 3.00
Average scores were used for ties.
Wilcoxon Two-Sample Test
Statistic 12.0000
Normal Approximation
Z -1.8448
One-Sided Pr lt Z 0.0325
Two-Sided Pr gt Z 0.0651
t Approximation

26
Wilcoxon rank sum outputExplanations

1. Group levels of variable that defines group
2. N lists of obs in each group
3. Sum of scores sum of Wilcoxon scores in
each group (these are ranks)
4. Expected/Std Dev under Ho lists Wilcoxon
scores expected under Ho of no difference between
groups
5. Mean score average score (rank) for each
group
6. Statistic sum of scores for group with
smaller sample size
7. Normal approximation for larger samples

27
Wilcoxon rank sum testInterpretation

Ho Distribution of continuous variable is same
in both groups
Ha Distribution not the same
Interpretation/write-up We performed the
Wilcoxon rank sum test to determine if there was
a signficant difference in tumor size between
groups A and B. The median tumor sizes were 2.5
(IQR 2.2, 2.7) and 0.5 (IQR 0,1.65) for groups A
and B, respectively. While group A resulted in
more tumor growth than B, this difference was not
significant at alpha0.05 (p0.1023).

28
Tests for more than 2 samples

You want to compare mean/median of a continuous
variable between 3 or more independent
categorical groups
Examples
Age of patients in 4 treatment groups RT, chemo,
surgery or best supportive care
Batting average in 1st base vs. 2nd base vs. 3rd
base players

29
More than 2 sample tests for normality

Proc sort datawork.baseball
by position
Run
Proc univariate datawork.baseball
plot normal
by position
var bat_avg
Run
Need to test normality in each group!
If at least 1 group NOT normal
nonparametric analysis

30
Kruskal-Wallis test

Proc npar1way databaseball wilcoxon
class position
var bat_avg
Run
OUTPUT
Kruskal-Wallis Test
Chi-Square 10.4955
DF 2
Pr gt Chi-Square 0.0053

31
Kruskal-Wallis testInterpretation

Ho Distribution of continuous variable is same
in all groups.
Ha Distribution different in at least 1 group.
Interpretation/write-up We performed the
Kruskal-Wallis test to determine if there was a
significant difference in batting average between
player positions. The median batting averages
were 286 (IQR 269,319), 201 (IQR 194,252), and
303 (IQR 278, 337) for 1st, 2nd and 3rd base,
respectively. Batting average was significantly
different overall at alpha0.05 (p0.0053).
Pairwise comparisons were then conducted using
Wilcoxon rank-sum tests, which found both 1st and
3rd base players had significantly higher batting
averages than 2nd base players (1st p0.0226 3rd
p0.0120).

32
Correlation tests

You want to see how 2 continuous variables are
related to each other (with respect to strength
and direction)
Examples
Relationship between weight lost and
hours exercising
calorie consumption
Relationship between baby birth weight and
mother's age at birth

33
Correlation tests for normality

Proc univariate datawt_loss plot normal
var kg_lost hours calories
Run
Need to test normality of each variable that
contributes to the corrlelation test!
If at least 1 variable NOT normal
nonparametric analysis

34
Spearman correlation

Proc corr datawt_loss spearman fisher
var kg_lost
with hours calories
Run
Magnitude, direction and significance are
equally and separately important
Magnitudenumerical value
Directionpositive or negative
Significancep-value

35
Spearman correlation output

Spearman Correlation Coefficients
Prob gt r under H0 Rho0
Number of Observations
kg_lost
hours 0.51521
0.0168
21
calories -0.56577
0.0061
22

36
Spearman output (cont.)95 Confidence Interval

Achieved with fisher option on proc statement
Spearman Correlation Statistics (Fisher's z
Transformation)
With
p Value for
Variable Variable 95 Confidence Limits
H0Rho0
kg_lost hours 0.094658
0.769409 0.0156
kg_lost calories -0.792255
-0.176304 0.0052

37
Spearman correlation testInterpretation

Ho There is no linear correlation between the 2
continuous variables
Ha There is a linear correlation
Interpretation/write-up We calculated spearman
correlation coefficients of kg lost with hours
exercised and calories consumed. At alpha0.05,
there is significant evidence of a moderate,
positive linear association between kg lost and
hours (r0.52) with a 95 CI of 0.09 to 0.77,
p0.0168. There is a significant negative linear
association between kg lost and calories
(r-0.57) with a 95 CI of -0.79 to -0.18,
p0.0061.

38
To sum it all up

No hard and fast rules for determining parametric
vs. nonparametric
Use combination of evidence from previously
mentioned assessments of normality
Descriptive stats always report measure of
central tendancy with spread
Dont do nonparametric just in case without
attempting a normality assessment
Try assessing normality and if you are not able
to come to a conclusion then nonparametric is a
safe and conservative bet
If you are wrong, you will lose a very little
amount of power
If normality assumptions ARE met, then
nonparametric tests are not as powerful as their
parametric counterparts.

39
References

Cody, R.P., Smith, J.K. (1997). Applied
statistics and the SAS programming language,
fourth edition. Upper Saddle River, NJ Prentice
Hall.
Hatcher H. (2003). Step-by-step basic statistics
using SAS student guide. Cary, NC SAS
Institute Inc.
Schlotzhauer, S.D., Littell, R.C. (1997). SAS
system for elementary statistical analysis,
second edition. Cary, NC SAS Institute Inc.