Title: Categorical Data
1 Categorical Data
2Lesson Objective
- Understand basic rules of probability.
- Calculate marginal and conditional
probabilities. - Determine if two categorical variables are
independent.
3Recall Rule of Thumb
C
Quantitative variables averages or
differences have meaning.
- Ex weight, height, income, age
4Recall Rule of Thumb
C
Categorical variables classify people or
things.
- Ex gender, race, occupation, political
affiliation, country of origin
5Note Sometimes quantitative variables are
expressed as categorical.
Income (Family Economic Income)
Class Definition 1. Less than
30,000 2. 30,000 but less than 100,000
3. 100,000 or more.
6Relationships between variables
7 Relationship between two quantitative
variables?
Is relationship linear (scatterplot)?
- J Use Correlation Least Squares
Regression. - L Data transformations.
8Recall Boxplots
- Best graphical tool for examining the
relationship between a quantitative variable and
a categorical variable,(i.e., comparing
distributions).
Example Weight vs. Country of Origin
Boxplot can be used to answer
Do the distributions of weights vary for
different countries of origin?
9 Relationship between two categorical
variables?
Use two-way frequency tables
- Look at marginal probabilities and
conditional probabilities.
10STATISTICS
- is the science of
- transforming data
- into informationto make decisionsin the face of
uncertainty.
11Probability
How do we measure "uncertainty"?
- A numerical measure of the likelihood that an
outcome or an event occurs. - P(A) probability of event A
12- Three Methods for Assessing Probability
- Classical
- Relative Frequency
- Subjective
13Probability requirements fordiscrete variables
_
_
2. Sum of the probabilities of all possible
outcomes must equal 1. (Binomial, Poisson)
14Conditional probability The chance one event
happens,given that another event will occur.
P(A B)
15- Problem Credit Card Manager
- New credit test to determine credit worthiness.
- Credit test checked against500 previous
customers.
16Credit Test A
Credit History
Failed (F)
Passed (P)
Good (G)
400
350
50
Default (D)
20
80
100
370
130
500
17What is the probability of a customer defaulting?
P(Defaults)
What is the probability of a customer defaulting
given that he fails test A?
P(Defaults given failed test A)
18General Rules
- P(A and B) P(A) P(BA)
- P(B) P(AB)
- P(A or B) P(A) P(B) - P(A and B)
19P(Fails AND Defaults)
P(F) P(DF)
20P(Fails OR Defaults)
P(F) P(D) - P(D AND F)
Note The overlap group would be counted twice
if no subtraction.
21Does knowledge of test A resulthelp you make a
better decision?
P
(
D
)
Do you want to know the test A results before
you give the loan?
Credit test A results and defaulting are
____________ on each other.
22 A Newer Credit Test.Is it even better?
A different sample of 500 credit records
23Credit Test B
Credit History
Failed (F)
Passed (P)
Good (G)
400
340
60
Default (D)
85
15
100
425
75
500
24What is the probability of a customer defaulting?
P(Defaults)
What is the probability of a customer defaulting
given that he fails test B?
P(Defaults given failed test B)
25Does knowledge of test B resulthelp you make a
better decision?
P
(
D
)
Test B tells me .
Credit test B results and defaulting are
of each other.
26Independence
27Two events are independent if the occurrence, or
non-occurrence, of one does not affect the
chances of the other occurring, or not occurring.
- Otherwise, we say the events are dependent.
28If A and B independent, then
- P(A and B) P(A) P(B)
- P(A or B) P(A) P(B) - P(A) P(B)
- P(AB) P(A)
- P(BA) P(B)
Note The condition does NOT changethe
probability.
29Survey of randomly selectedpeople voters in Jan.
2001
Q1 Did you vote in the 2000 election? Q2 Do
you favor an amendment to require a
balanced budget? Q3 To which political party
do you belong ?
30Do you favor amendmentfor a balancedbudget?
Political Party
Yes No Total
Republican Democrat OtherTotal
90 44 48
172148 80
82 104 32
182 218 400
31Sample size
Marginal totals for Party.
Marginal totalsfor opinion.
32What proportionfavor the amend.?
What proportionclaim to be Rep?
What proportionfavor the amend.and are Other?
33What proportionfavor the amend,given those that
claim to be Rep?
Considering onlythose opposed, what
proportionare not Republican?
Of those that claim to be Democrat,what
proportionfavor the amend.
34Conditional Distribution
Restrict subjects to only those that meet a
condition. Within this restricted group, what is
the distribution of some other var.?
Distribution of opinion given those that claim
to be Republican
P( Yes Rep. ) .523P( No Rep. )
.477
35Is there a relationship betweenthe party and
the opinion on the amendment?
- What would you expect to happen if no
relationship existed?
36Three Conditional Distributions
Marginal Distribution P( Yes ) .455,
P( No ) .545
Is there a relationship?Why? or Why not?
37If there is NO relationship(i.e.,
independence)between the party andthe opinion,
then
the three conditional probabilitiesshould be
the close to each other and close to the
marginal probability.
38Three Conditional Probabilities
P( Yes Rep.) .523
Are these close toeach other?
P( Yes Demo) .297
P( Yes Other) .600
Marginal Probability P( Yes ) .455
AND close to the marginal?
Not close therefore, party and the opinion
are ____________.
39Create with Pivot Tablesin Excel.
40Barchart- Clustered
Yes
Rep.
Demo.
Other
Frequency
41Barchart- Stacked
Rep.
Yes
Demo.
Other
Frequency
42Barchart- Percents
Yes
Rep.
Demo.
Other
Percent
43SummaryFor two categorical variables
- Must use conditional probabilities to
determine if a relationship exists. - Cannot use correlation.
- Visual display Stacked percentage
bar charts
44Associations between TWO Variables
numerical graphical
Variables
Quant. vs. Quant
LS regression line, r, r-sq, std error
Scatterplot,residual plots
Quant. vs. Cat.
X-bar and sfor each category
Side-by-side box plots
Two-way table, conditional marginal
distributions
Bar chart stacked, percent.
Cat. vs. Cat.
45The End