Title: STA 3024 Introduction to Statistical 2
1STA 3024Introduction to Statistical 2
2What Should Have Been Covered in STA2023
According to the Course Description
- Basic Probability
- Random Variable
- Sampling Distribution
- Confidence Intervals for single population (mean
and proportion) - Hypothesis testing for single population (mean
and proportion) - Comparison of 2 populations
- Simple Linear regression (usually not covered)
These are in Chapters 1 10 of the Agresti and
Franklins book.
3A key question of statistics
- Key questions easy to understand, not easy to
answer without the knowledge of a special field. - Example 1. A key question of mathematics Three
angles of a triangle - Example 2 A key question of calculus The volume
of a ball is 4?r3/3.
4A Key Question in StatisticsNatural cure rate
of a disease 50A drug is invented and to be
tested. Suppose we have the following response.
How about 2 patients with 2 cures?
How about 100 patients with 100 cures?
5Key Concept in StatisticsNatural cure rate of a
disease 50A drug is invented and to be tested.
Suppose we have the following responses.
Where does the no jump to yes?
6There is no 100 correct statistical decision
Risk Risk of making a wrong decision Accidental
death rate 10-6/day in USA
How many patients should we recruit in the
beginning?
7Another key question
- Suppose I wish to know the percentage of voters
who support public health insurance. - I have no hypothesis to test, but I am interested
in estimating this proportion. - Suppose I asked 100 randomly selected persons and
the yes answer was 65. Or suppose I asked 1000
persons the answer was 650. In both cases, my
answer would be 65, but I know their accuracies
are different, but by how much? Do we need a
large sample to make the estimate even more
accurate?
8Beyond cure rates
- Survival time improved by a drug
- Patient difference in age, gender, tumor size
and/or genetic markers. - Cure rate in medicine affected rate in plants,
accident death rate in car insurance, response
rate in stimulus. - Survival time in medicine fruit weight in
plants, accident payment in car insurance,
response time to a stimulus.
9Applications of StatisticsKey When there are
uncertainty in responseWhen the decision cannot
100 correct.
- Effectiveness of new drugs or treatments
- DNA evidence in court
- Estimating the bowhead whale population
- Corn yield by different fertilizers
- Quality control of light bulbs
- Public opinion by polls
10Successful stories of polls
1992 US Presidential election predictions
Source, from newspaper a few days before the
election.
11More on polls
Source Nov. 5 (Election day morning) USA Today
Both 2000 and 2004, the candidates (Bush vs Gore,
Kerry) were too close to call (within ?3). The
actual results showed the same.
It is difficult to reduce ?3 by sample size
alone. From mathematics to practice Random
sample, mind change, not telling mind
12The next two elections, 2000 (Bush vs Gore) and
2004 (Bush vs Kerry) were too close to call
before the election. The final results confirmed
this fact. Now the 2008 election.
- This map was drawn by the New York Times 3 - 1
day before the election. All the state
projections were correct. Toss-up states were
extremely close. - It also predicted that Obama would get 52?2 and
McCain 41?2 with 7 undecided. - The actual result is Obama 52.5 and McCain 46.
- The total number of votes was 124,471,000.
13Danger of treatment based on screening (I)
- Source New England Journal of Medicine, Sep. 12,
2002, pp. 781-789. - Randomized clinical trials in early prostate
cancer, Radical prostatectomy group (n347),
watchful waiting (n348).(Duration 1989-1999,
median follow-up time 6.2 years)
- It is obvious that there were less death due to
prostate cancer in the surgical group, because
the prostate had been removed. To claim
effectiveness based on 6253 is unreasonable. - No expense and quality of life change is
reflected in this table.
14Danger of treatment based on screening (II)
- Source The lancet, 2000, 355 129-43. The
lancet, 2001, 358 1340-42. - Randomized clinical trials in mammography for
breast cancer. - Malmö (Sweden) study (1988- 97screened
21,088 control 21,195) - Canada study (1981 97 screened 44,925
control 44,910)
15Solution to the key questionWhat you need to
know beforehand?
- What risk you can take on a wrong claim (to claim
ineffective drug as effective). - What do you considered as a good drug that need
to be detected with high probability. - Let the first answer to be a0.05
- Let the second answer to be if the cure rate
becomes larger than 0.6 (p1), I want at least 0.9
(1-ß) probability to detected.
16Two Key Distributions and Their Properties
17A Derivation Used All the Time
18(No Transcript)
19(No Transcript)
20More on the Normal Distribution
21(No Transcript)
22The Binomial Distribution
23Elements in Hypothesis Testing (pp. 413 - 5)
24From page 421
25From page 432
26Examples in the Book
- The therapeutic guess example (pp. 409, 422)
- The dog smell cancer example, p.417
- The American working hours/week example, p. 428
- The anorexic example p. 433
27One-sided or Two sided Test?
- In 2004 survey, 868 working women were asked and
the sampling mean was 39.11 and sample standard
deviation was 14.6. Can we conclude that women
workers, on the average, worked less than 40
hours/week? (pp. 429-431 of the book)
28One Sided or Two Sided Test?
- The key is that hypotheses should be formed
before you look at the data. - The correct sequence is
- You have a hypothesis you wish to confirm or
discard. - You collect data to make a decision.
- If it is a statistical decision, you report your
conclusion with a p-value. - Let use the female working hours example (p.
428). If before we did the survey, we had been
interested to see whether American female workers
works less than 40 hours/week, then it is a
one-sided test. But if before the survey we
wished to know whether female workers worked more
or less than 40 hour, it is a two sided test.
29The Conclusion of a Two Sided Test Is usually
One-sided.
However, you cannot use one-sided test to make
this one-sided conclusion, because the
hypotheses should be formed before you look at
the data. (see next page)
30Form Hypothesis after Seeing Data Can Seriously
Bias the Conclusion
Actually, the risk is much higher if the
hypotheses was formed after you see the two 1s.
(see next page)
31Since we usually do not know the side before
seeing the data, most tests are two sided,
although the final conclusion can be one-sided.
(This is also the books view. See p. 431
(Conclusion on womens working hours).
32When There is not Hypothesis to Test, Just Facts
finding
- 65 yes in a sample of 100, we feel the real
percentage is 65. - 650 yes in a sample of 1000, we feel the real
percentage is 65. - Which one is more accurate?
- Idea In a single observation, we do not know,
but on the whole, the large sample gives a more
accurate estimate. - How to quantify this concept?
- Let Y be the yes answers in a sample of size n
and the true proportion of yes answers in the
population is p.
33Sacrifice for Environment Protection Example (pp.
362-365)
- In 2000 General Social Survey, from n1154
respondents Y518 said yes to the question Are
you willing to pay higher gasoline price to
protect the environment? - The sample proportion is 518/11540.45. How
accurate is this estimate. - We are not 100 sure that the real proportion is
45, but we may be able to say that we have high
confidence that the real proportion is between,
say 0.44 and 0.46.
34Interval Estimate (Confidence Interval, C.I.)
35Interval Estimate for the Mean
36eBay Example (pp.378-79)
- Sale of Palm M515PDA (handheld computer). Seven
buyers chose the buy-it-now option (vs wait for
other bidders). - The data (n7) 235,225,225,240,250,250,210.
- Find a 95 C.I. for the mean.
37Logic Behind the Intervals (p. 359)
38Logic Behind the Intervals (p. 359)
39Confidence interval for the mean (p. 378)
40Any Advantage of Knowing the Derivation?
41Sample Size Determination in Interval Estimation
42Importance in Sample Size Determination
- In most practical situations, an investigator
needs to collect the data. It is unlikely that
the data already exit somewhere for you to fetch.
This is especially true for new ideas that
require new experiments. - Before you collect the data, you need to
determine the sample size. - How to determine the sample size in estimation?
(v) - How to determine the sample size in hypothesis
testing?
43Importance of Type II Error (9.6, pp. 453-457)
44Importance of Type II Error (continued)
45(No Transcript)
46The solution (2)
47The solution (3)