Title: Chapter 10 Categorical Data
1Chapter 10 Categorical Data
2Chapter 10 Categorical Data
- Inference on one proportion ?
- Sample size n, number of successes y
- Sample proportion
The second equation is called the Wilson
estimator (1927)
3Confidence Intervals
Example 10.1 Five year survival rate for cancer
patients n870, y330.
Or, without the 4 and 2, i.e.,
4Confidence interval for the extremes
- When y0, (1-a)100 C.I. for ? (0, 1-(a/2)1/n)
(p.457) - Example, 1000 sample without defect parts 95 CI
for defect rate (0, 1-(0.025)0.001) (0, 0.0037) - When y1, (1-a)100 C.I. for ?((a/2)1/n, 1)
(p.457) - Sample size for given error margin E (
)with (1-a) confidence - The most conservative size is let ?0.5. (p.458)
5Example
- In most popularity polls, the accuracy is usually
set at ?3, which means a 95 CI is
what is the required sample size.
The key is that a random sample can be collected.
6Successful stories of polls
1992 US Presidential election predictions
Source, from newspaper a few days before the
election.
7More on polls
Source Nov. 5 (Election day morning) USA Today
Both 2000 and 2004, the candidates (Bush vs Gore,
Kerry) were too close to call (within ?3). The
actual results showed the same.
It is difficult to reduce ?3 by sample size
alone. From mathematics to practice Random
sample, mind change, not telling mind
8The next two elections, 2000 (Bush vs Gore) and
2004 (Bush vs Kerry) were too close to call
before the election. The final results confirmed
this fact. Now the 2008 election.
- This map was drawn by the New York Times 3 - 1
day before the election. All the state
projections were correct. Toss-up states were
extremely close. - It also predicted that Obama would get 52?2 and
McCain 41?2 with 7 undecided. - The actual result is Obama 52.5 and McCain 46.
- The total number of votes was 124,471,000.
9Hypothesis Testing with One Proportion (p.458)
This is large sample result. We need
n?0(1-?0)5.
10Example 10.4
- Car failing rate at the inspection station 30.
Is the failing SUV rate is higher? - N150, Y60 fails.
Conclusion The is a strong evidence (p0.0035)
that the failing rate for SUV was indeed higher.
11Sample size in hypothesis testing The second
part of key question
- Nature cure rate of a disease is 50, a drug is
invented. We want to conduct a clinical trial and
determine whether this drug is effective. How
many patients should I recruit for this clinical
trial? - You tell your boss
- There is no 100 correct statistical decisions.
- If the drug is marginal effective, say with cure
rate 0.5001, it would be very difficult to
detect. - Any reasonable person will agree.
12Key Concept in Statistical DecisionNatural cure
rate 50
Where does the no jump to yes?
13There is no 100 correct statistical decision
Risk Risk of making a wrong decision Accidental
death rate 10-6/day in USA
How many patients should we recruit in the
beginning?
14What you need to ask you boss
- What risk you can take on a wrong claim (to claim
ineffective drug as effective). - What do you considered as a good drug that need
to be detected with high probability. - Let the first answer to be a0.05
- Let the second answer to be if the cure rate
becomes larger than 0.6 (p1), I want at least 0.9
(1-ß) probability to detected.
15The solution (1)
16The solution (2)
17The solution (3)
18Two proportion influence (p.465)
19Hypothesis Testing with Two Proportion (p.466)
This is large sample result. We need both
n?i(1-?i)5.
20Example 10.6
- Two English teaching methods (computer,
traditional) measured by success rate in passing
exams. Find whether there is difference by
hypothesis testing and confidence interval. Use a
0.05. - Data n1125, passing 94 (computer) n2175,
passing 113.
Conclusion We can say that the computer teaching
is better (p0.05) and a 95 CI for the passing
rate difference is 0.106 ? 0.104 (0.02,
0.210) Note The book used one-sided test. Here
two-sided test is used.