Analyzing Categorical Data - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Analyzing Categorical Data

Description:

We may want to create a variable, say agegrp, will be 1 if age ... Hair Color:1=Blonde,2=Brown,3=Red,4=Black. We want to recode these numbers. proc format; ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 40
Provided by: mickey9
Category:

less

Transcript and Presenter's Notes

Title: Analyzing Categorical Data


1
Chapter 3
  • Analyzing Categorical Data

2
Recoding Data
  • In some instances, we want to produce data in
  • a different format. For example,
  • We may want to create a variable so that 1
    Male, 2 Female.
  • We may want to create a variable, say agegrp,
    will be 1 if age 50 inclusive and 3 otherwise.
  • There are a few ways to go about this.

3
Example 1
  • data example
  • input g
  • if g1 then genderMale
  • else genderFemale
  • datalines
  • 1
  • 2
  • 2
  • 1
  • 1
  • We created a variable so that 1Male, 2Female.

4
Example 2
  • data example
  • input gender age
  • if gendermale and ageyoung then grp1
  • else if genderfemale and ageyoung then
    grp2
  • else if gendermale and ageold then
    grp3
  • else genderfemale and ageold then grp4
  • datalines
  • male old
  • male old
  • female young
  • female old
  • male young
  • 4 groups created (male and young), (female and
    young), (male and
  • old), (female and old)

5
Example 3 - Inequalities
  • data example
  • input age
  • if age
  • else if age
  • else agegrp3
  • datalines
  • 44
  • 31
  • 15
  • 22
  • 23
  • Other inequalities include , , . is
    used
  • for not equal.

6
Adding Labels
  • Since SAS restricts variable names from being
  • more than 8, we sometimes cant be as
  • descriptive as wed like.
  • We can make labels for the variables which will
  • show up in the output for procedures.

7
Example
  • data new
  • input ageenter gender gpahs
  • label ageenter Age Upon Entering College
  • gpahs High School GPA
  • datalines
  • 20 male 3.4
  • 19 female 2.91
  • 22 male 3.8
  • proc means
  • var ageenter
  • run

8
Proc Format
  • Suppose that categorical variables have been
  • coded with numbers.
  • For example,
  • Major 1Math,2English,3Accounting
  • Gender1Male,2Female
  • Hair Color1Blonde,2Brown,3Red,4Black
  • We want to recode these numbers.

9
  • proc format
  • value majfmt 1Math 2English
    3Accounting
  • value genfmt 1Male 2Female
  • value hairfmt 1Blonde 2Brown
    3Red
  • 4Black
  • run
  • data new
  • input major gender hairc
  • format major majfmt. gender genfmt. hairc
    hairfmt.
  • datalines

10
Two Way Frequency Table
  • To have SAS create a table similar to the
    following
  • use proc freq

11
  • data new
  • input gender party count
  • datalines
  • male dem 20
  • male rep 15
  • female dem 30
  • female rep 40
  • proc freq
  • tables gender party genderparty
  • weight count
  • run

12
Math 210 - Review
  • Suppose we want to test at the
    0.05
  • significance level. We can draw a conclusion in
  • 3 possible ways
  • Compare the test statistic (TS) to the critical
    value (CV)
  • Compare the p-value to the level of significance.
  • Obtain a 95 confidence interval for the
    population mean and check to see if 4 is in the
    interval.

13
Chi-Square Test
  • Example A survey of 436 workers showed that
  • 192 of them feel that it is seriously unethical
    to
  • monitor employee email. When 121
  • senior-level bosses were surveyed, 40 said that
  • it was seriously unethical.

14
(No Transcript)
15
Total Independence
  • Note that
  • The response of a worker doesnt affect the
    response of any other work random sample!
  • The response of a boss doesnt affect the
    response of any other boss random sample!
  • Also, the response of a worker doesnt affect the
    response of any boss and vice versa.

16
  • The question of interest Is the proportion of
  • workers that think its seriously unethical to
  • monitor employee email equal to the proportion
  • of senior-level bosses that think its seriously
  • unethical to monitor employee email?

17
To run the Chi-Square test
  • data chisqex
  • input emptype resp count
  • datalines
  • worker agree 192
  • worker disagree 244
  • boss agree 40
  • boss disagree 81
  • proc freq datachisqex
  • weight count
  • tables emptyperesp / chisq
  • run

18
  • SAS will provide a test statistic and p-value.
    Reject the null hypothesis if the p-value is less
    than the significance level.
  • This same test was seen in Math 210 except a
    different test statistic was used. The standard
    normal distribution was used to calculate the
    p-value rather than the Chi-Square distribution.

19
  • The Chi-Square test can also be used to determine
    if the row and column variables are independent.
  • For example, there might be a dependence between
    hair color and eye color.
  • But there wont be a dependence between hair
    color and university major.

20
McNemars Test For Paired Data
  • Example Forty-five couples are asked whether
  • they approve of one of their U.S. Senators.

21
  • The question is this Is the proportion of wives
  • that support one of their U.S. Senators equal to
  • the proportion of husbands that support one of
  • their U.S. Senators?
  • In the Chi-Square test before, we had
  • independence.

22
  • The response of one husband doesnt affect any
    other husband.
  • The response of one wife doesnt affect the
    response of any other wife.
  • However, the response of one husband may affect
    the response of his own wife and vice versa.
    Seems reasonable that they may talk about
    politics.

23
To run McNemars test
  • data mcnex
  • input husbresp wiferesp count
  • datalines
  • yes yes 20
  • yes no 5
  • no yes 10
  • no no 10
  • proc freq datamcnex
  • weight count
  • tables husbrespwiferesp / agree
  • run

24
  • SAS will supply a test statistic and p-value.
  • As with any statistical test, we reject the null
  • hypothesis if the p-value is less than a
  • pre-determined level of significance, say 0.01,
  • 0.05,etc.

25
Odds Ratio
  • Suppose that there is a p10 chance of rain
  • tomorrow. Then the odds of it raining is defined
  • as
  • So the odds of it raining is 1 to 9 (or 19).
    The
  • odds of it not raining is 9 to 1( or 91).

26
  • Example Suppose that in a random sample of
  • 100 men, 90 have drunk beer and in a sample
  • of 100 women, 20 have drunk beer.

27
  • The proportion of men that drink beer is
    90/1000.9 and the proportion of women that drink
    beer is 20/1000.2
  • The odds that a man drinks beer is 0.9/0.19.
  • The odds that a woman drinks beer is
    0.2/0.80.25.
  • So the odds ratio is OR9/0.2536.

28
  • The odds ratio can be any value between 0 and
    infinity.
  • If OR drink beer.
  • If OR 1, then men and women are equally likely
    to drink beer.
  • If OR 1, then men are more likely than women to
    drink beer.
  • Since OR36, men are much more likely than women
    to drink beer.

29
  • The question of interest Who is more likely to
    drink beer?
  • data oddsr
  • input gender beerresp count
  • datalines
  • men 1-yes 90
  • men 2-no 10
  • women 1-yes 20
  • women 2-no 80
  • proc freq dataoddsr
  • weight count
  • tables genderbeerresp / chisq cmh
  • run

30
  • In the output, SAS will provide a confidence
  • interval for the population odds ratio. If the
  • confidence interval contains 1, then
    statistically
  • speaking we can say that men and women are
  • equally likely to drink beer.

31
  • In the SAS code, notice that I used 1-yes/2-no
  • rather than simply using yes/no.
  • The reason is that I want the odds for both men
  • and women to represent the odds of drinking
  • beer rather than the odds of not drinking beer.
  • N comes before Y alphabetically but by using
  • 1-yes and 2-no, I have the correct ordering.

32
  • If I was interested in obtaining an odds ratio
  • with women in the numerator, Id have to use
  • 1-women/2-men because M comes before W
  • alphabetically.
  • The results wouldnt be any less correct without
  • this trick but in some instances we want results
  • in a particularly meaningful order.

33
Relative Risk (RR)
  • Relative Risk is a statistic commonly used in
  • clinical trials to determine the risk of one
    group
  • getting a certain disease, for instance, versus
  • another group.

34
  • Suppose that 20 of smokers develop lung
  • cancer while 10 of non-smokers develop lung
  • cancer. The relative risk is defined as
  • which is to say that smokers are twice as likely
  • to develop lung cancer than non-smokers.

35
  • Example A group of 90 patients suffering from
  • respiratory problems were selected for a study.
  • They were randomly divided into 2 groups.
  • One group received a new treatment (TRT)
  • and the other group received a placebo. After
  • a period of time, it was determined whether
  • each patient showed improvement.

36

37
  • For the TRT group, the proportion that showed
    improvement is 29/450.6444.
  • For the placebo group, the proportion that showed
    improvement is 14/450.3111.
  • The relative risk for improvement is
    0.6444/0.31112.0713.
  • SAS will provide the RR for improvement but it
    will also provide the RR for not improving.

38
  • data rr
  • input group respimp count
  • datalines
  • 1-trt 1-yes 29
  • 1-trt 2-no 16
  • 2-plac 1-yes 14
  • 2-plac 2-no 31
  • proc freq datarr
  • weight count
  • tables grouprespimp / all nocol nopct
  • run

39
  • Again, notice the trick of using 1-trt/2-plac and
  • 1-yes/2-no.
Write a Comment
User Comments (0)
About PowerShow.com