Title: An Introduction to Nonparametric Statistics
1 Bridget Neville, MPH Dana-Farber Cancer
Institute Boston MA
An Introduction toNonparametric Statistics
Using SAS
2What are we going to talk about today?
- Working with continuous variables
- Descriptive statistics
- Testing for normality
- Using nonparametric tests
- Signed rank (paired data)
- Wilcoxon rank sum (2 samples)
- Kruskal-Wallis (3 or more samples)
- Spearman correlation
- Interpretation of test results
3You already learned about categorical variables...
- Now for continuous variables
- Numbers typically serve as values
- Can take on any range of values
- Equal differences between values do have equal
quantitative meaning - Examples
- Age
- Length, weight
- Distance
4Descriptive statistics Measures of central
tendency and spread
- Central tendancy a value/number that reveals
where the CENTER of the distribution is located
in that sample - mean
- median
- Spread (or variability) represents extent of
dispersion in a distribution - Standard deviation
- Interquartile range (IQR)
5How to report measures
- Always report a measure of central tendency with
a measure of spread - Mean (standard deviation)
- Patients in this sample had a mean age of 65
years (standard deviation4). - Median (IQR)
- Patients in this sample had a median age of 72
years (IQR45,85). - Note that 45 is the 25th percentile and 85 is
the 75th percentile. It is best to represent the
IQR this way instead of one number that is the
difference between the 25th and 75th
percentiles
6Mean vs. MedianWhen to report which
- Mean (standard deviation)
- When distribution is
- normal or approximately normal/fairly
symmetrical - Large sample size
- Strongly influenced by outliers
- Median (IQR)
- When distribution is
- Non-normal distribution (skewed)
- Small sample size
- Robust to outliers
- Sometimes with ordinal data
7SAS Procedures for Mean and Median
- Proc means datadsname
- n mean std median p25 p75 min max
- var varname
- run
- Std (or stddev)standard deviation
- p2525th percentile p7575th percentile
- Proc univariate datadsname
- var varname
- run
8What does normal look like?
- Normal distribution points are as likely to
occur on one side of the mean as the other
9Non-normal distribution
- Asymmetrical because natural limit prevents
outcomes on one side - Peak of distribution is off-center and toward
limit and tail stretches away from it - Called right- or left-skewed according to
direction of tail
10How to assess normality(Use combination of these
methods)
- Visual inspection of distributions of data in
each group (skewedNOT normal) - Histogram, normal probability plot, stem and leaf
plot, box plot - Comparison of mean vs. median (differentNOT
normal) - Test hypothesis that data are from a normal
distribution (plt0.05NOT normal) - Knowledge of variable (ordinal usually NOT
normal) - Small sample usually NOT normal
11Proc UnivariateVisual Inspection
- Stem Leaf Boxplot
- 2 6 1 0
- 1 46 2
- 0 133344566788 12 ----
- -0 95 2 -----
- -1 43 2
- -2 2 1 0
- ----------------
- Multiply Stem.Leaf by 101
- Normal Probability Plot
- 25
-
-
-
-
- -25
- --------------------------------------
-- - -2 -1 0 1 2
Normality Your data
12Proc UnivariateMean vs. Median
The SAS System Testing for
significant difference in scores
The UNIVARIATE Procedure Variable
scordiff (Differences in exam scores)
Moments N
20 Sum Weights 20 Mean
2.55 Sum Observations
51 Std Deviation 10.9711344 Variance
120.365789 Skewness
-0.3418653 Kurtosis
0.81164646 Uncorrected SS 2417
Corrected SS 2286.95 Coeff Variation
430.240564 Std Error Mean 2.45322023
Basic Statistical Measures
Location
Variability Mean 2.550000 Std Deviation
10.97113 Median 4.000000 Variance
120.36579 Mode 3.000000
Range 48.00000
Interquartile Range 9.50000
13Proc UnivariateTest normality hypothesis
Tests for Normality Test
--Statistic--- -----p Value------ Shapiro-Wi
lk W 0.942009 Pr lt W
0.2616 Kolmogorov-Smirnov D 0.216359 Pr gt
D 0.0153 Cramer-von Mises W-Sq 0.142641
Pr gt W-Sq 0.0270 Anderson-Darling A-Sq
0.687072 Pr gt A-Sq 0.0646
Ho Data come from normal distribution Ha Data
NOT normal Note that SAS will not report
Shapiro-Wilk for samples with ngt2000
14Nonparametric Statistics for Non-normal data
- Parameter-free or distribution-free do not rely
on estimation of parameters such as mean and
standard deviation to describe distribution of
population variables. - Assumption-free do not rely on assumptions
about variable distributions (like assumption of
normality, approximate normality or symmetry).
15Summary of analytic strategies
Univariate tests
16Analytic Strategies (cont.)
Bivariate tests Tests of equivalence of
distributions
17Analytic strategies (cont.)
Bivariate tests (continued) Test of homogeneity
of means/median across groups
18Analytic strategies (cont.)
Tests of linear association
19Tests for paired data
- You want to compare the mean/median difference
between 2 continuous measurements on the same
person/unit of analysis - Examples
- Repeated obs on same person
- Before and after (like diet studies)
- Body parts (right knee vs. left knee)
- 2 test scores
- Sibling studies
20Nonparametric Signed rank test
- Proc univariate datasta6207 normal
- var scordiff
- Run
- Tests for Location Mu00
- Test -Statistic- -----p Value------
- Student's t t 1.03945 Pr gt t 0.3116
- Sign M 5 Pr gt M 0.0414
- Signed Rank S 33 Pr gt S 0.2265
S test statistic sum of positive ranks Note
For a 1-sided p-value, we would divide the
2-sided p-value by 2.
21Signed Rank TestInterpretation
- Ho Median difference 0
- Ha Median difference ne 0
- Interpretation/write-up We performed the signed
rank test to determine if there was a significant
difference in test scores. The median difference
in test scores (test2-test1) was 4 (IQR -2,
7.5), although it was not significant at
alpha0.05 (p0.2265).
22Tests for 2 independent samples
- You want to compare mean/median of a continuous
variable between 2 independent categorical groups - Examples
- Age of men vs. women
- Days between diagnosis and treatment for
hematological vs. solid cancers - Income in more vs. less educated individuals
232-sample tests for normality
- Proc sort datawork.tumor
- by group
- Proc univariate datawork.tumor plot normal
- by group
- var mass
- Run
- Need to test normality in each group!
- If at least 1 group NOT normal
- nonparametric analysis
24Wilcoxon rank sum test
- Proc npar1way datatumor wilcoxon
- class group
- var mass
- Run
- Also known as Mann-Whitney U-test
25Wilcoxon rank sum output
- The NPAR1WAY Procedure
- Wilcoxon Scores (Rank Sums) for Variable
mass - Classified by Variable group
- Sum of Expected
Std Dev Mean - group N Scores Under H0
Under H0 Score
- A 5 33.0 25.0
4.065437 6.60 - B 4 12.0 20.0
4.065437 3.00 - Average scores were used for ties.
- Wilcoxon Two-Sample Test
- Statistic 12.0000
- Normal Approximation
- Z -1.8448
- One-Sided Pr lt Z 0.0325
- Two-Sided Pr gt Z 0.0651
- t Approximation
26Wilcoxon rank sum outputExplanations
- 1. Group levels of variable that defines group
- 2. N lists of obs in each group
- 3. Sum of scores sum of Wilcoxon scores in
each group (these are ranks) - 4. Expected/Std Dev under Ho lists Wilcoxon
scores expected under Ho of no difference between
groups - 5. Mean score average score (rank) for each
group - 6. Statistic sum of scores for group with
smaller sample size - 7. Normal approximation for larger samples
27Wilcoxon rank sum testInterpretation
- Ho Distribution of continuous variable is same
in both groups - Ha Distribution not the same
- Interpretation/write-up We performed the
Wilcoxon rank sum test to determine if there was
a signficant difference in tumor size between
groups A and B. The median tumor sizes were 2.5
(IQR 2.2, 2.7) and 0.5 (IQR 0,1.65) for groups A
and B, respectively. While group A resulted in
more tumor growth than B, this difference was not
significant at alpha0.05 (p0.1023).
28Tests for more than 2 samples
- You want to compare mean/median of a continuous
variable between 3 or more independent
categorical groups - Examples
- Age of patients in 4 treatment groups RT, chemo,
surgery or best supportive care - Batting average in 1st base vs. 2nd base vs. 3rd
base players
29More than 2 sample tests for normality
- Proc sort datawork.baseball
- by position
- Run
- Proc univariate datawork.baseball
- plot normal
- by position
- var bat_avg
- Run
- Need to test normality in each group!
- If at least 1 group NOT normal
- nonparametric analysis
30Kruskal-Wallis test
- Proc npar1way databaseball wilcoxon
- class position
- var bat_avg
- Run
- OUTPUT
- Kruskal-Wallis Test
- Chi-Square 10.4955
- DF 2
- Pr gt Chi-Square 0.0053
31Kruskal-Wallis testInterpretation
- Ho Distribution of continuous variable is same
in all groups. - Ha Distribution different in at least 1 group.
- Interpretation/write-up We performed the
Kruskal-Wallis test to determine if there was a
significant difference in batting average between
player positions. The median batting averages
were 286 (IQR 269,319), 201 (IQR 194,252), and
303 (IQR 278, 337) for 1st, 2nd and 3rd base,
respectively. Batting average was significantly
different overall at alpha0.05 (p0.0053).
Pairwise comparisons were then conducted using
Wilcoxon rank-sum tests, which found both 1st and
3rd base players had significantly higher batting
averages than 2nd base players (1st p0.0226 3rd
p0.0120).
32Correlation tests
- You want to see how 2 continuous variables are
related to each other (with respect to strength
and direction) - Examples
- Relationship between weight lost and
- hours exercising
- calorie consumption
- Relationship between baby birth weight and
mother's age at birth
33Correlation tests for normality
- Proc univariate datawt_loss plot normal
- var kg_lost hours calories
- Run
- Need to test normality of each variable that
contributes to the corrlelation test! - If at least 1 variable NOT normal
- nonparametric analysis
34Spearman correlation
- Proc corr datawt_loss spearman fisher
- var kg_lost
- with hours calories
- Run
- Magnitude, direction and significance are
equally and separately important - Magnitudenumerical value
- Directionpositive or negative
- Significancep-value
35Spearman correlation output
- Spearman Correlation Coefficients
- Prob gt r under H0 Rho0
- Number of Observations
-
kg_lost - hours 0.51521
- 0.0168
- 21
- calories -0.56577
- 0.0061
- 22
36Spearman output (cont.)95 Confidence Interval
- Achieved with fisher option on proc statement
- Spearman Correlation Statistics (Fisher's z
Transformation) - With
p Value for - Variable Variable 95 Confidence Limits
H0Rho0 - kg_lost hours 0.094658
0.769409 0.0156 - kg_lost calories -0.792255
-0.176304 0.0052
37Spearman correlation testInterpretation
- Ho There is no linear correlation between the 2
continuous variables - Ha There is a linear correlation
- Interpretation/write-up We calculated spearman
correlation coefficients of kg lost with hours
exercised and calories consumed. At alpha0.05,
there is significant evidence of a moderate,
positive linear association between kg lost and
hours (r0.52) with a 95 CI of 0.09 to 0.77,
p0.0168. There is a significant negative linear
association between kg lost and calories
(r-0.57) with a 95 CI of -0.79 to -0.18,
p0.0061.
38To sum it all up
- No hard and fast rules for determining parametric
vs. nonparametric - Use combination of evidence from previously
mentioned assessments of normality - Descriptive stats always report measure of
central tendancy with spread - Dont do nonparametric just in case without
attempting a normality assessment - Try assessing normality and if you are not able
to come to a conclusion then nonparametric is a
safe and conservative bet - If you are wrong, you will lose a very little
amount of power - If normality assumptions ARE met, then
nonparametric tests are not as powerful as their
parametric counterparts.
39References
- Cody, R.P., Smith, J.K. (1997). Applied
statistics and the SAS programming language,
fourth edition. Upper Saddle River, NJ Prentice
Hall. - Hatcher H. (2003). Step-by-step basic statistics
using SAS student guide. Cary, NC SAS
Institute Inc. - Schlotzhauer, S.D., Littell, R.C. (1997). SAS
system for elementary statistical analysis,
second edition. Cary, NC SAS Institute Inc.