An Introduction to Nonparametric Statistics - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

An Introduction to Nonparametric Statistics

Description:

Bivariate tests. Tests of equivalence of distributions. Test type ... Bivariate tests (continued) Test of homogeneity of means/median across groups. Test type ... – PowerPoint PPT presentation

Number of Views:269
Avg rating:3.0/5.0
Slides: 40
Provided by: bridget90
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Nonparametric Statistics


1
Bridget Neville, MPH Dana-Farber Cancer
Institute Boston MA
An Introduction toNonparametric Statistics
Using SAS
2
What are we going to talk about today?
  • Working with continuous variables
  • Descriptive statistics
  • Testing for normality
  • Using nonparametric tests
  • Signed rank (paired data)
  • Wilcoxon rank sum (2 samples)
  • Kruskal-Wallis (3 or more samples)
  • Spearman correlation
  • Interpretation of test results

3
You already learned about categorical variables...
  • Now for continuous variables
  • Numbers typically serve as values
  • Can take on any range of values
  • Equal differences between values do have equal
    quantitative meaning
  • Examples
  • Age
  • Length, weight
  • Distance

4
Descriptive statistics Measures of central
tendency and spread
  • Central tendancy a value/number that reveals
    where the CENTER of the distribution is located
    in that sample
  • mean
  • median
  • Spread (or variability) represents extent of
    dispersion in a distribution
  • Standard deviation
  • Interquartile range (IQR)

5
How to report measures
  • Always report a measure of central tendency with
    a measure of spread
  • Mean (standard deviation)
  • Patients in this sample had a mean age of 65
    years (standard deviation4).
  • Median (IQR)
  • Patients in this sample had a median age of 72
    years (IQR45,85).
  • Note that 45 is the 25th percentile and 85 is
    the 75th percentile. It is best to represent the
    IQR this way instead of one number that is the
    difference between the 25th and 75th
    percentiles

6
Mean vs. MedianWhen to report which
  • Mean (standard deviation)
  • When distribution is
  • normal or approximately normal/fairly
    symmetrical
  • Large sample size
  • Strongly influenced by outliers
  • Median (IQR)
  • When distribution is
  • Non-normal distribution (skewed)
  • Small sample size
  • Robust to outliers
  • Sometimes with ordinal data

7
SAS Procedures for Mean and Median
  • Proc means datadsname
  • n mean std median p25 p75 min max
  • var varname
  • run
  • Std (or stddev)standard deviation
  • p2525th percentile p7575th percentile
  • Proc univariate datadsname
  • var varname
  • run

8
What does normal look like?
  • Normal distribution points are as likely to
    occur on one side of the mean as the other

9
Non-normal distribution
  • Asymmetrical because natural limit prevents
    outcomes on one side
  • Peak of distribution is off-center and toward
    limit and tail stretches away from it
  • Called right- or left-skewed according to
    direction of tail

10
How to assess normality(Use combination of these
methods)
  • Visual inspection of distributions of data in
    each group (skewedNOT normal)
  • Histogram, normal probability plot, stem and leaf
    plot, box plot
  • Comparison of mean vs. median (differentNOT
    normal)
  • Test hypothesis that data are from a normal
    distribution (plt0.05NOT normal)
  • Knowledge of variable (ordinal usually NOT
    normal)
  • Small sample usually NOT normal

11
Proc UnivariateVisual Inspection
  • Stem Leaf Boxplot
  • 2 6 1 0
  • 1 46 2
  • 0 133344566788 12 ----
  • -0 95 2 -----
  • -1 43 2
  • -2 2 1 0
  • ----------------
  • Multiply Stem.Leaf by 101
  • Normal Probability Plot
  • 25
  • -25
  • --------------------------------------
    --
  • -2 -1 0 1 2

Normality Your data
12
Proc UnivariateMean vs. Median
The SAS System Testing for
significant difference in scores
The UNIVARIATE Procedure Variable
scordiff (Differences in exam scores)
Moments N
20 Sum Weights 20 Mean
2.55 Sum Observations
51 Std Deviation 10.9711344 Variance
120.365789 Skewness
-0.3418653 Kurtosis
0.81164646 Uncorrected SS 2417
Corrected SS 2286.95 Coeff Variation
430.240564 Std Error Mean 2.45322023
Basic Statistical Measures
Location
Variability Mean 2.550000 Std Deviation
10.97113 Median 4.000000 Variance
120.36579 Mode 3.000000
Range 48.00000
Interquartile Range 9.50000
13
Proc UnivariateTest normality hypothesis
Tests for Normality Test
--Statistic--- -----p Value------ Shapiro-Wi
lk W 0.942009 Pr lt W
0.2616 Kolmogorov-Smirnov D 0.216359 Pr gt
D 0.0153 Cramer-von Mises W-Sq 0.142641
Pr gt W-Sq 0.0270 Anderson-Darling A-Sq
0.687072 Pr gt A-Sq 0.0646
Ho Data come from normal distribution Ha Data
NOT normal Note that SAS will not report
Shapiro-Wilk for samples with ngt2000
14
Nonparametric Statistics for Non-normal data
  • Parameter-free or distribution-free do not rely
    on estimation of parameters such as mean and
    standard deviation to describe distribution of
    population variables.
  • Assumption-free do not rely on assumptions
    about variable distributions (like assumption of
    normality, approximate normality or symmetry).

15
Summary of analytic strategies
Univariate tests
16
Analytic Strategies (cont.)
Bivariate tests Tests of equivalence of
distributions
17
Analytic strategies (cont.)
Bivariate tests (continued) Test of homogeneity
of means/median across groups
18
Analytic strategies (cont.)
Tests of linear association
19
Tests for paired data
  • You want to compare the mean/median difference
    between 2 continuous measurements on the same
    person/unit of analysis
  • Examples
  • Repeated obs on same person
  • Before and after (like diet studies)
  • Body parts (right knee vs. left knee)
  • 2 test scores
  • Sibling studies

20
Nonparametric Signed rank test
  • Proc univariate datasta6207 normal
  • var scordiff
  • Run
  • Tests for Location Mu00
  • Test -Statistic- -----p Value------
  • Student's t t 1.03945 Pr gt t 0.3116
  • Sign M 5 Pr gt M 0.0414
  • Signed Rank S 33 Pr gt S 0.2265

S test statistic sum of positive ranks Note
For a 1-sided p-value, we would divide the
2-sided p-value by 2.
21
Signed Rank TestInterpretation
  • Ho Median difference 0
  • Ha Median difference ne 0
  • Interpretation/write-up We performed the signed
    rank test to determine if there was a significant
    difference in test scores. The median difference
    in test scores (test2-test1) was 4 (IQR -2,
    7.5), although it was not significant at
    alpha0.05 (p0.2265).

22
Tests for 2 independent samples
  • You want to compare mean/median of a continuous
    variable between 2 independent categorical groups
  • Examples
  • Age of men vs. women
  • Days between diagnosis and treatment for
    hematological vs. solid cancers
  • Income in more vs. less educated individuals

23
2-sample tests for normality
  • Proc sort datawork.tumor
  • by group
  • Proc univariate datawork.tumor plot normal
  • by group
  • var mass
  • Run
  • Need to test normality in each group!
  • If at least 1 group NOT normal
  • nonparametric analysis

24
Wilcoxon rank sum test
  • Proc npar1way datatumor wilcoxon
  • class group
  • var mass
  • Run
  • Also known as Mann-Whitney U-test

25
Wilcoxon rank sum output
  • The NPAR1WAY Procedure
  • Wilcoxon Scores (Rank Sums) for Variable
    mass
  • Classified by Variable group
  • Sum of Expected
    Std Dev Mean
  • group N Scores Under H0
    Under H0 Score

  • A 5 33.0 25.0
    4.065437 6.60
  • B 4 12.0 20.0
    4.065437 3.00
  • Average scores were used for ties.
  • Wilcoxon Two-Sample Test
  • Statistic 12.0000
  • Normal Approximation
  • Z -1.8448
  • One-Sided Pr lt Z 0.0325
  • Two-Sided Pr gt Z 0.0651
  • t Approximation

26
Wilcoxon rank sum outputExplanations
  • 1. Group levels of variable that defines group
  • 2. N lists of obs in each group
  • 3. Sum of scores sum of Wilcoxon scores in
    each group (these are ranks)
  • 4. Expected/Std Dev under Ho lists Wilcoxon
    scores expected under Ho of no difference between
    groups
  • 5. Mean score average score (rank) for each
    group
  • 6. Statistic sum of scores for group with
    smaller sample size
  • 7. Normal approximation for larger samples

27
Wilcoxon rank sum testInterpretation
  • Ho Distribution of continuous variable is same
    in both groups
  • Ha Distribution not the same
  • Interpretation/write-up We performed the
    Wilcoxon rank sum test to determine if there was
    a signficant difference in tumor size between
    groups A and B. The median tumor sizes were 2.5
    (IQR 2.2, 2.7) and 0.5 (IQR 0,1.65) for groups A
    and B, respectively. While group A resulted in
    more tumor growth than B, this difference was not
    significant at alpha0.05 (p0.1023).

28
Tests for more than 2 samples
  • You want to compare mean/median of a continuous
    variable between 3 or more independent
    categorical groups
  • Examples
  • Age of patients in 4 treatment groups RT, chemo,
    surgery or best supportive care
  • Batting average in 1st base vs. 2nd base vs. 3rd
    base players

29
More than 2 sample tests for normality
  • Proc sort datawork.baseball
  • by position
  • Run
  • Proc univariate datawork.baseball
  • plot normal
  • by position
  • var bat_avg
  • Run
  • Need to test normality in each group!
  • If at least 1 group NOT normal
  • nonparametric analysis

30
Kruskal-Wallis test
  • Proc npar1way databaseball wilcoxon
  • class position
  • var bat_avg
  • Run
  • OUTPUT
  • Kruskal-Wallis Test
  • Chi-Square 10.4955
  • DF 2
  • Pr gt Chi-Square 0.0053

31
Kruskal-Wallis testInterpretation
  • Ho Distribution of continuous variable is same
    in all groups.
  • Ha Distribution different in at least 1 group.
  • Interpretation/write-up We performed the
    Kruskal-Wallis test to determine if there was a
    significant difference in batting average between
    player positions. The median batting averages
    were 286 (IQR 269,319), 201 (IQR 194,252), and
    303 (IQR 278, 337) for 1st, 2nd and 3rd base,
    respectively. Batting average was significantly
    different overall at alpha0.05 (p0.0053).
    Pairwise comparisons were then conducted using
    Wilcoxon rank-sum tests, which found both 1st and
    3rd base players had significantly higher batting
    averages than 2nd base players (1st p0.0226 3rd
    p0.0120).

32
Correlation tests
  • You want to see how 2 continuous variables are
    related to each other (with respect to strength
    and direction)
  • Examples
  • Relationship between weight lost and
  • hours exercising
  • calorie consumption
  • Relationship between baby birth weight and
    mother's age at birth

33
Correlation tests for normality
  • Proc univariate datawt_loss plot normal
  • var kg_lost hours calories
  • Run
  • Need to test normality of each variable that
    contributes to the corrlelation test!
  • If at least 1 variable NOT normal
  • nonparametric analysis

34
Spearman correlation
  • Proc corr datawt_loss spearman fisher
  • var kg_lost
  • with hours calories
  • Run
  • Magnitude, direction and significance are
    equally and separately important
  • Magnitudenumerical value
  • Directionpositive or negative
  • Significancep-value

35
Spearman correlation output
  • Spearman Correlation Coefficients
  • Prob gt r under H0 Rho0
  • Number of Observations

  • kg_lost
  • hours 0.51521
  • 0.0168
  • 21
  • calories -0.56577
  • 0.0061
  • 22

36
Spearman output (cont.)95 Confidence Interval
  • Achieved with fisher option on proc statement
  • Spearman Correlation Statistics (Fisher's z
    Transformation)
  • With
    p Value for
  • Variable Variable 95 Confidence Limits
    H0Rho0
  • kg_lost hours 0.094658
    0.769409 0.0156
  • kg_lost calories -0.792255
    -0.176304 0.0052

37
Spearman correlation testInterpretation
  • Ho There is no linear correlation between the 2
    continuous variables
  • Ha There is a linear correlation
  • Interpretation/write-up We calculated spearman
    correlation coefficients of kg lost with hours
    exercised and calories consumed. At alpha0.05,
    there is significant evidence of a moderate,
    positive linear association between kg lost and
    hours (r0.52) with a 95 CI of 0.09 to 0.77,
    p0.0168. There is a significant negative linear
    association between kg lost and calories
    (r-0.57) with a 95 CI of -0.79 to -0.18,
    p0.0061.

38
To sum it all up
  • No hard and fast rules for determining parametric
    vs. nonparametric
  • Use combination of evidence from previously
    mentioned assessments of normality
  • Descriptive stats always report measure of
    central tendancy with spread
  • Dont do nonparametric just in case without
    attempting a normality assessment
  • Try assessing normality and if you are not able
    to come to a conclusion then nonparametric is a
    safe and conservative bet
  • If you are wrong, you will lose a very little
    amount of power
  • If normality assumptions ARE met, then
    nonparametric tests are not as powerful as their
    parametric counterparts.

39
References
  • Cody, R.P., Smith, J.K. (1997). Applied
    statistics and the SAS programming language,
    fourth edition. Upper Saddle River, NJ Prentice
    Hall.
  • Hatcher H. (2003). Step-by-step basic statistics
    using SAS student guide. Cary, NC SAS
    Institute Inc.
  • Schlotzhauer, S.D., Littell, R.C. (1997). SAS
    system for elementary statistical analysis,
    second edition. Cary, NC SAS Institute Inc.
Write a Comment
User Comments (0)
About PowerShow.com