Oct. 17Statistic for the Day: In1996, the percentages of 16-24 yr old high school finishers enrolled in college were 49% for lower income families 63% for middle income families 78% for higher income families - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Oct. 17Statistic for the Day: In1996, the percentages of 16-24 yr old high school finishers enrolled in college were 49% for lower income families 63% for middle income families 78% for higher income families

Description:

Explanatory variable: Sex. Response variable: Body Pierced or Not. Survey question: ... And one more: The odds of cancer in this group are 3/2. ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 52
Provided by: tomhettma
Category:

less

Transcript and Presenter's Notes

Title: Oct. 17Statistic for the Day: In1996, the percentages of 16-24 yr old high school finishers enrolled in college were 49% for lower income families 63% for middle income families 78% for higher income families


1
Oct. 17 Statistic for the DayIn1996, the
percentages of 16-24 yr old high school finishers
enrolled in college were49 for lower income
families63 for middle income families78 for
higher income families
Assignment Review for Exam 2, Wednesday, Oct.
19 Chapters 10, 11, 12, 13, 16
2
Arbys sandwiches
weight
calories 1 Big Montana
309 g 590 2 Giant Roast Beef
224 450 3 Regular Roast Beef
154 320 4 Beef n Cheddar
195 440 5 Super Roast Beef
230 440 6 Junior Roast Beef
125 270 7 Chicken Breast
Fillet 233 500 8 Chicken
Bacon n Swiss 209 550 9 Roast
Chicken Club 228 470 10
Market Fresh Turkey Ranch Bacon 379 830
11 Market Fresh Ultimate BLT 293
780 12 Market Fresh Roast Beef Swiss 357
780 13 Market Fresh Roast Ham Swiss 357
700 14 Market Fresh Roast Turkey Swiss 357
720 15 Market Fresh Chicken Salad
322 770
3
This type of plot, with two measurements per
subject, is called a scatterplot (see p. 166).
4
The correlation measures the strength of the
linear relationship between weight and
calories. More on this in the next class.
5
The best-fitting line through the data is called
the regression line. How should we describe this
line?
6
The intercept is 18 in this case and the slope is
2.1. In this class, you dont need to know how to
calculate the slope and intercept (but see p. 195
if you like formulas).
7
intercept
slope
calories 18 (2.1)(weight in grams)
-------------------------------------------------
For example, if you have a 200g sandwich, on the
average you expect to get about 18
(2.1)(200) 18 420 438 calories -------------
-------------------------------------
For a 350g sandwich 18 (2.1)(350) 18 735
753 calories
8
intercept
slope
calories 18 (2.1)(weight in grams)
For every extra gram of weight, you expect an
increase of 2.1 calories in your Arbys sandwich.
Interpretation of slope Expected increase in
response for every unit increase (increase of
one) in explanatory.
9
Facts about Correlation
  • 1 means perfect increasing linear relationship
  • -1 means perfect decreasing linear relationship
  • 0 means no linear relationship
  • means increasing together
  • - means one increases and the other decreases

10
Strength vs. statistical significance
  • Even a weak relationship can be statistically
    significant (if it is based on a large sample)
  • Even a strong relationship can be statistically
    insignificant (if it is based on a small sample)

11
Regression potential pitfalls
Sometimes we see strong relationship in absurd
examples two seemingly unrelated variables have
a high correlation. This signals the presence of
a third variable that is highly correlated with
the other two (confounding). Remember that
correlation does not imply causation. Also If
you use a regression for prediction, do not
extrapolate too far beyond the range of the
observed data.
12
Vocabulary vs Shoe Size
13
Outliers
Outliers are data that are not compatible with
the bulk of the data. They show up in graphical
displays as detached or stray points. Sometimes
they indicate errors in data input. Some
experts estimate that roughly 5 of all data
entered is in error. Sometimes they are the most
important data points.
14
Put Options (NYTimes, September 26, 2001)
Put options on stocks give buyers the right to
sell stock at a specified price during a certain
time. They rise in value if the underlying stock
falls below the strike price.
The value of puts on airline stocks soared on
Sept. 17 when U.S. stock and options markets
reopened after a four-day closure, as airline
stocks slid as much as 40 percent.
American Airlines was at 32 prior to attack.
Suppose a terrorist buys a put option (at say 5
per share) to have the right to sell at 25. The
price after the attack was at 16. That put
option is now more valuable.
15
R wins machine (D minus R negative for
machine) D wins absentee (D minus R positive for
absentee)
From story on p. 442
16
Outliers affect regression lines and correlation
(these data arent real)
Red line Without A, with B
Black line With A and B
Green line Without A or B
17
Two categorical variables Explanatory variable
SexResponse variable Body Pierced or Not
Survey question Have you pierced any other part
of your body? (Except for ears) Research
Question Is there a significant difference
between women and men at PSU in terms of body
pierces?
18
Data
Response
Body Pierced?
Explanatory
Sex
From STAT 100, fall 2005 (missing responses
omitted)
19
Percentages
Response body pierced? no
yes All female 62.32 37.68
100.00 male 93.90 6.10 100.00
All 74.09 25.91 100.00
62.32 86 / 138 93.90 77 / 82
Research question Is there a significant
difference Between women and men?
(i.e., between 66.67 and 91.35)
20
The Debate
The research advocate claims that there is a
significant difference. The skeptic claims
there is no real difference. The data
differences simply happen by chance, since weve
selected a random sample.
21
The strategy for determining statistical
significance
  • First, figure out what you expect to see if there
    is no difference between females and males
  • Second, figure out how far the data is from what
    is expected.
  • Third, decide if the distance in the second step
    is large.
  • Fourth, if large then claim there is a
    statistically significant difference.

22
Exercise Follow the 4 steps and answer
theResearch Question Is there a statistically
significant difference between males and females
in terms of the percent who have used marijuana?
Data from STAT 100 fall 2005
Rows Sex Columns Marijuana
No Yes All Female 56 76
132 Male 31 46 77 All
87 122 209
23
Step 1 Find expected counts if the skeptic is
correct
This step is based on the marginal totals
(Repeat for B, C, D)
A
24
Step 1 contd
Repeat the process for B (and then C and D)
Or you can simply subtract 132 54.95 77.05
B
25
Step 1 contd
Green Observed counts Red Expected counts if
skeptic is correct.
Marijuana? No Yes
All Female 56 76 132
54.95 77.05 132.00 Male 31
46 77 32.05 44.95 77.00
Total 87 122 209
26
Step 2 How far are the data (observed counts)
from what is expected?
Green Observed counts Red Expected counts if
skeptic is correct.
Chi-Sq 0.020 0.014 0.034
0.025 0.093
27
Step 3 Is the distance in step 2 large?
Something is large when it is in the outer 5
tail of the appropriate distribution.
Chi-squared distribution with 1 degree of freedom
If chi-squared statistic is larger than 3.84, it
is declared large and the research advocate wins.
Our chi-squared value 0.093 (from Step 2)
28
Step 4 If distance is large, claim
statistically significant difference.
Rows Sex Columns marijuana No
Yes All Female 56 76
132 42.4 57.6 100.0 Male
31 46 77 40.3 59.7
100.0
Hence, the difference 57.6 of women versus
59.7 of men is not statistically significant in
this case. (Sample size has been automatically
considered!)
29
How many degrees of freedom here?
Degrees of freedom (df) always equal (Number of
rows 1) (Number of columns 1)
30
Health studies and risk
Research question Do strong electromagnetic
fields cause cancer? 50 dogs randomly split into
two groups no field, yes field The response is
whether they get lymphoma.
Rows mag field Columns cancer
no yes All no 20 5
25 yes 10 15 25 All
30 20 50
31
Terminology and jargon
In the mag field group, 15/25 of the dogs got
cancer. Therefore, the following are all
equivalent
  • 60 of the dogs in this group got cancer.
  • The proportion of dogs in this group that got
    cancer is 0.6.
  • The probability that a dog in this group got
    cancer is 0.6.
  • The risk of cancer in this group is 0.6

And one more The odds of cancer in this group
are 3/2.
32
More terminology and jargon
  • Identify the bad response category In this
    example, cancer
  • Treatment risk 15 / 25 or .60 or 60
  • Baseline risk 5 / 25 or .20 or 20
  • Relative risk Treatment risk over Baseline risk
    .60 / .203 That is, the treatment risk is
    three times as large as the baseline risk.
  • Increased risk By how much does the risk
    increase for treatment as compared to control?
    (.60 - .20) / .20 2 or 200 That is, the
    risk is 200 higher in the treatment group.
  • Odds ratio Ratio of treatment odds to baseline
    odds. (15/10) / (5/20)
    turns out to be 6. That is, the treatment odds
    are six times as large as the baseline odds.

33
Final note When the chi-squared test is
statistically significant then it makes sense to
compute the various risk statements. If there is
no statistical significance then the
skeptic wins. There is no evidence in the data
for differences in risk for the categories of
the explanatory variable.
34
Recall marijuana example
Marijuana? No Yes
All Female 56 76 132
54.95 77.05 132.00 Male 31
46 77 32.05 44.95 77.00
Total 87 122 209
Chi-Sq 0.020 0.014 0.034
0.025 0.093
SO THE SKEPTIC WINS. But what if we observed a
much larger sample? Say, 100 times larger?
35
Marijuana example, larger sample
Marijuana? No Yes
All Female 5600 7600 13200
5495 7705 13200 Male 3100
4600 7700 3205 4495 7700
Total 8700 12200 20900
Chi-Sq 2.0 1.4 3.4 2.5 9.3
NOW THE RESEARCH ADVOCATE WINS.
36
Practical significance
In the marijuana example, 58 of women and 60 of
men reported that they had tried marijuana. This
size of difference, even if it is really in the
population, is probably uninteresting. Yet we
have seen that a large sample size can make it
statistically significant. Hence, in the
interpretation of statistical significance, we
should also address the issue of practical
significance. In other words, we should answer
the skeptics second question WHO CARES?
37
Simpsons paradox (for quantitative variables)
Example 11.4, pp. 204-205
Correlation -.312
38
Simpsons paradox (for quantitative variables)
Example 11.4, pp. 204-205
Correlation -.312
H Correlation .348 S Correlation .637
39
Simpsons paradox for categorical variables, as
seen in video
Overall admitted to City U.
Business (hard)
Law (easy)
Women better in each, but more men apply to
easier law school!
40
Rules For combining probabilities
0 lt Probability lt 1
  • If there are only two possible outcomes, then
    their probabilities must sum to 1.
  • If two events cannot happen at the same time,
    they are called mutually exclusive. The
    probability of at least one happening (one or the
    other) is the sum of their probabilities. Rule
    1 is a special case of this.
  • If two events do not influence each other, they
    are called independent. The probability that
    they happen at the same time is the product of
    their probabilities.
  • If the occurrence of one event forces the
    occurrence of another event, then the probability
    of the second event is always at least as large
    as the probability of the first event.

41
Rule 1 If there are only two possible outcomes,
then their probabilities must sum to 1.
According to Example 3, page 302 P(lost
luggage) 1/176 .0057 Thus, P(luggage not
lost) 1 1/176 175/176 .9943 The point of
rule 1 is that P(lost) P(not lost) 1 so if
we know P(lost), then we can find P(not lost).
Sounds simple, right? It can be surprisingly
powerful.
42
Rule 2 If two events cannot happen at the same
time, they are called mutually exclusive.
In this case, the probability of at least one
happening is the sum of their probabilities.
Rule 1 is a special case of this.
Example 5, page 303 Suppose P(A in stat)
.50 and P(B in stat) .30. Then P( A or B in
stat) .50 .30 .80 Note that the events A
in stat and B in stat are mutually exclusive.
Do you see why?
43
Rule 3 If two events do not influence each
other, they are called independent. In this
case, the probability that they happen at the
same time is the product of their probabilities.
Example 8, page 303 Suppose you believe that
P(A in stat) .5 and P(A in history)
.6. Further, you believe that the two events are
independent, so that they do not influence each
other. Then P(A in stat and A in history)
(.5)(.6) .3
Is this a reasonable assumption?
44
Rule 4 If the occurrence of one event forces the
occurrence of another event, then the probability
of the second event is always at least as large
as the probability of the first event.
If event A forces event B to occur, then P(A) lt
P(B)
Special case P(E and F) lt P(E) P(E and F) lt
P(F) (because E and F forces E to occur).
45
Two laws (only one of them valid)
  • Law of large numbers Over the long haul, we
    expect about 50 heads (this is true).
  • Law of small numbers If weve seen a lot of
    tails in a row, were more likely to see heads on
    the next flip (this is completely bogus).

Remember The law of large numbers OVERWHELMS
it does not COMPENSATE.
46
The game of Odd Man
Consider the odd man game. Three people at
lunch toss a coin. The odd man has to pay the
bill. You are the odd man if you get a head and
the other two have tails or if you get a tail and
the other two have heads. Notice that there will
not always be an odd man this occurs if flips
come up HHH or TTT. P(no odd man) P(HHH or
TTT) P(HHH) P(TTT) since HHH,
TTT are mutually exclusive (1/2)3
(1/2)3 since H,H,H are independent
(as are T,T,T) 1/8 1/8
.25 Thus, P(there is an odd man) 1
P(no odd man) 1 - .25 .75
47
Play until there is an odd man. What is the
probability this will take exactly three tries?
P(odd man occurs on the third try) P(miss,
miss, hit) in that order! Thats the only
way. (See why?) P(miss) P(miss) P(hit)
since each try is independent of the others.
P(miss)2 P(hit) .252 .75 .047
This is the final answer The probability
that the odd man occurs exactly on the third try
(after two unsuccessful tries).
48
Expectation
What if you bet 10 on a game of craps? What is
your expected profit?
  • (Probability of winning 244/495, or 49.3)

You win 10 with probability .493 You lose 10
with probability .507 Expected profit .493(10)
.507(-10) - 0.14
49
Casino winnings, 10,000 games per day
Expectation 1400
50
Casino winnings, 100,000 games a day
Expectation 14,000 Note Now all values are
positive
51
Your winnings, a single game
We already calculated the expectation to be 14
cents. But you cant lose 14 cents in one game
you either win 10 dollars or lose 10 dollars.
Thus, the expected value does not have to be a
possible value for any individual case.
Write a Comment
User Comments (0)
About PowerShow.com