Mar. 15 Statistics for the day: Highest Temp ever recorded in State College: 102 degrees (July 9, 1936 and July 17, 1988) Lowest temp ever recorded in State College: -18 degrees (January 19-20, 1994) - PowerPoint PPT Presentation

About This Presentation
Title:

Mar. 15 Statistics for the day: Highest Temp ever recorded in State College: 102 degrees (July 9, 1936 and July 17, 1988) Lowest temp ever recorded in State College: -18 degrees (January 19-20, 1994)

Description:

Women. Yes. No. Pierced? Gender? Explanatory. Response ... There is a statistically significant difference between men and women. ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 65
Provided by: tomhettma
Learn more at: http://personal.psu.edu
Category:

less

Transcript and Presenter's Notes

Title: Mar. 15 Statistics for the day: Highest Temp ever recorded in State College: 102 degrees (July 9, 1936 and July 17, 1988) Lowest temp ever recorded in State College: -18 degrees (January 19-20, 1994)


1
Mar. 15 Statistics for the dayHighest Temp ever
recorded in State College102 degrees (July 9,
1936 and July 17, 1988) Lowest temp ever
recorded in State College-18 degrees (January
19-20, 1994)
Source http//pasc.met.psu.edu
Review Exam Friday, March 19 Chapters 10, 11, 12,
15, 16, 17
These slides were created by Tom Hettmansperger
and in some cases modified by David Hunter
2
(No Transcript)
3
Best fitting line through the data called the
REGRESSION LINE Strength of relationship
measured by CORRELATON
4
calories -10 60(serving size in oz)
-------------------------------------------------
For example if you have a 6 oz sandwich on the
average you expect to get about -10 60(6)
-10 360 350 calories -----------------------
---------------------------
For a 10 oz sandwich -10 60(10) -10 600
590
5
calories -10 60(serving size in oz)
  • -10 is called the intercept
  • 60 is called the slope
  • One way to interpret slope For every extra oz
    of serving you get an increase of 60 calories

6
Facts about correlation, measured for two
quantitative variables
  • 1 means perfect increasing linear relationship
  • -1 means perfect decreasing linear relationship
  • 0 means no linear relationship
  • means one increases as the other increases
  • - means one increases as the other decreases

7
Outliers
Outliers are data that are not compatible with
the bulk of the data. They show up in graphical
displays as detached or stray points. Sometimes
they indicate errors in data input. Experts
estimate that roughly 5 of all data entered is
in error. Sometimes they are the most
important data points.
8
Example
9
(No Transcript)
10
A bad outlier
11
Another bad outlier
12
The Moral
There can be good outliers Election fraud. We
use them to identify important parts of the
data. Or in analyzing put options for extreme
cases. More often the outliers are bad. They
can depress the correlation and make you think
the relationship is weaker than it really
is. They can increase the correlation and make
it appear that the relationship is stronger than
it really is. IMPORTANT Always look at a
scatter plot as well as compute the correlation.
13
Another problem
Sometimes we see strong relationship in absurd
examples. Two seemingly unrelated variables
have a high correlation. This signals the
presence of a third variable that is highly
correlated with the other two. (Confounding or
interaction)
14
A third variable vocabulary vs shoe size
15
How can we have such high correlation
between shoe size and vocabulary? Easy Both
increase with age and hence age is a hidden
variable. Age is positively correlated with
both shoe size and with vocabulary.
16
Two categorical variables Explanatory variable
GenderResponse variable Body Pierced or Not
Survey question Have you pierced any other part
of your body? (Except for ears) Research
Question Is there a significant difference
between women and men in terms of body pierces?
17
Data
Response
Pierced?
No Yes
Women 84 51 135
Men 96 3 99
180 54 234
Explanatory
Gender?
From Stat 100.2, spring 2004 (missing responses
omitted)
18
Percentages
62.22 84/135 96.97 96/99
Response body pierced? no
yes All female 62.22 37.78
100.00 male 96.97 3.03 100.00 All
76.92 23.08 100.00
Research question Is there a significant
difference Between women and men?
(i.e., between 62.22 and 96.97)
19
The Debate
The research advocate claims that there is a
significant difference. The skeptic claims
there is no real difference. The data
differences simply happen by chance.
20
The strategy for determining statistical
significance
  • First, figure out what you expect to see if there
    is no difference between females and males
  • Second, figure out how far the data is from what
    is expected.
  • Third, decide if the distance in the second step
    is large.
  • Fourth, if large then claim there is a
    statistically significant difference.

21
Research Advocate OK. Suppose there is really
no difference in the population as you, the
Skeptic, claim. We will compare what you, The
Skeptic, expect to see and what you actually do
see in the data.Skeptic How do we figure out
what we expect to see?
22
No Yes
Women 135
Men 99
180 54 234
23
Rows gender Columns body pierces top
lines of numbers are observed bottom lines are
expected (by skeptic) no yes
All female 84 51 135
103.85 31.15 135.00 male 96
3 99 76.15 22.85
99.00 All 180 54 234
180.00 54.00 234.00
24
How to measure the distance between what
the research advocate observes in the table and
what the skeptic expects
Add up the following for each cell
Now how do we decide if 38.85 is large or not?
If it is large enough the skeptic concedes to the
research advocate and agrees there is a
statistically significant difference. How large
is enough?
25
Chi-squared distribution with 1 degree of freedom
If chi-squared statistic is larger than 3.84, it
is declared large and the research advocate wins.
But our chi-squared is 38.85 so the research
advocate easily wins! There is a statistically
significant difference between men and women.
26
Why 1 degree of freedom?
No Yes
Women 136
Men 101
26 211 237
Note that black box is the ONLY one we can fill
arbitrarily. Once that box is filled, all others
are determined by margins!
27
How many degrees of freedom?
Always Sometimes Never
Women One df Two df 136
Men 101
106 105 26 237
Degrees of freedom (df) always equal (Number of
rows 1) times (Number of columns 1)
28
Health studies and risk
Research question Do strong electromagnetic
fields cause cancer? 50 dogs randomly split into
two groups no field, yes field The response is
whether they get lymphoma.
Rows mag field Columns cancer
no yes All no 20 5
25 yes 10 15 25 All
30 20 50
29
Rows mag field Columns cancer observed
above the expected no yes
All no 20 5 25
15.00 10.00 25.00 yes 10 15
25 15.00 10.00 25.00 All
30 20 50 30.00 20.00
50.00 Chi-Square 8.333 (compare to
3.84) Research advocate wins!
30
Terminology and jargon
  • Identify the bad response category yes cancer
  • Risk for categories of explanatory variable
  • Identify treatment category
  • Identify baseline (control) category
  • Treatment risk 15/25 or .60 or 60
  • Baseline risk 5/25 or .20 or 20
  • Relative risk Treatment risk over Baseline risk
    .60/.203
  • So risk due to mag field is 3 times higher
    than baseline risk.
  • One more on the next page

31
Increased risk (percentage change in risk)
So the percentage change is 200 A 200 increase
in treatment risk over baseline risk for getting
cancer.
32
Final note When the chi-squared test is
statistically significant then it makes sense to
compute the various risk statements. If there is
no statistical significance then the
skeptic wins. There is no evidence in the data
for differences in risk for the categories of
the explanatory variable.
33
Research question Is ghost sighting related to
age? Do young and old people differ in ghost
sighting?
  • The skeptic responds by saying he doesnt believe
    that there is any difference between the age
    groups.

We need to see the data to resolve the debate.
Then we can consider assessing the risk.
Exercise 9, p219 of the text.
34
Expected counts are printed below observed
yes no Total young 212 1313
1525 174.9 1350.1 old
465 3913 4378 502.1
3875.9 Total 677 5226 5903 Chi-Sq
7.870 1.020 2.742 0.355
11.987
The research advocate wins and skeptic
loses. There is evidence in the data that there
are differences in the population.
35
The percent of young who saw a ghost 212/1525
.139 Answer 13.9 The proportion of old who
saw a ghost 465/4378 .106 Answer .106 The
risk of young seeing ghost Answer 212/1525
or .139 or 13.9 Odds ratio?
36
Odds
  • The odds of something happening are given by a
    ratio
  • For example, if you flip a fair coin, the odds of
    heads are 1 (or sometimes 1 to 1).
  • An odds ratio is the ratio of two odds!

37
The odds that a young person saw a ghost
212/1313 .161 The odds that an older person
saw a ghost 465/3912 .119 The odds ratio
Answer .161/.106 1.35
38
Relative risk of young person seeing a ghost
compared to older person Answer .139/.106
1.31 We would say that the risk that a younger
person sees a ghost is 1.31 times higher than the
risk that an older person sees a ghost. The
increased risk that a young person sees a ghost
over that of an older person Answer (.139 -
.106)/.106 .31 Hence we would say that young
people have a 31 higher risk of seeing a ghost
than older people.
39
Statistical significance
  • Statistical significance is related to
  • the size of the sample. But that makes
  • sense. More data, more information, more
  • precise inference.
  • So statistical significance is related to two
    things
  • The size of the difference between the
    percentages.
  • Big differences are more likely to show stat.
    significance.
  • 2. The size of the sample. Bigger samples are
    more likely
  • to show statistical significance irrespective of
    the size of
  • the difference in percentages.

40
Practical significance
Even if the difference in percentages is
uninteresting and of no practical interest, the
difference may be statistically significant
because we have a large sample. Hence, in the
interpretation of statistical significance, we
must also address the issue of practical
significance. In other words, you must answer
the skeptics second question WHO CARES?
41
Probability
Relative Frequency
Personal Opinion
Experiment Repeated Sampling
Experience Non-repeatable Event
Physical World Assumptions
Estimate Probability Repeated Sampling
Check by Repeated Sampling
42
Rules For combining probabilities
0 lt Probability lt 1
  • If there are only two possible outcomes, then
  • their probabilities must sum to 1.
  • If two events cannot happen at the same time,
  • they are called mutually exclusive. The
    probability
  • of at least one happening (one or the other) is
    the
  • sum of their probabilities. 1. is a special
    case of this.
  • If two events do not influence each other, they
  • are called independent. The probability that
    they
  • happen at the same time is the product of their
    probabilities.
  • If the occurrence of one event forces the
    occurrence of
  • another event, then the probability of the second
    event is
  • always at least as large as the probability of
    the first event.

43
Are mutually exclusive events independent or
dependent?
  • Remember the tests
  • Two events are mutually exclusive if they cannot
    happen
  • at the same time.
  • Two events are independent if the occurrence of
    one does
  • not alter the probability of the other occurring.
  • Or, another way, if the probability of the
    occurrence of one
  • event changes when we find out whether the other
    event
  • occurred or not.

44
New Rule
Suppose we are considering a series of events.
The probability of at least one of the events
occurring is Pr( at least
one ) 1 Pr( none )
This follows directly from Rule 1 since at least
one or none has to occur.
45
Long Run Behavior
We CANNOT predict individual outcomes. BUT We
CAN predict quite accurately long run
behavior. ----------------------------------------
---------------------------- Standard
example We cannot predict the outcome of a
single toss of a coin very precisely Pr(head)
.50 But in the long run we expect about 50
heads and tails.
46
(No Transcript)
47
Two laws (only one of them valid)
  • Law of large numbers Over the long haul, we
    expect about 50 heads (this is true).
  • Law of small numbers If weve seen a lot of
    tails in a row, were more likely to see heads on
    the next flip (this is completely bogus).

Remember The law of large numbers OVERWHELMS
it does not COMPENSATE.
48
When will it happen? (p264 text)Odd Man
Consider the odd man game. Three people toss a
coin. The odd man has to pay for the
drinks. You are the odd man if you get a head
and the other two have tails or if you get a
tail and the other two have heads. Pr(no odd
man) Pr(HHH or TTT)
Pr(HHH) Pr(TTT) Rule 2
(1/2)3 (1/2)3
Rule 3 1/8
1/8 1/4 .25 Pr(
odd man ) 1 Pr(no odd man) 1 - .25 .75
Rule 1
49
Pr( odd man occurs on the third try) Pr(miss,
miss, hit) Pr(miss)Pr(miss)Pr(hit) Rule
3 Pr(miss)2Pr(hit) .252.75 .047
50
ExpectationInsurance
Example 14 p267 extended.
Suppose my insurance company has 10,000 policy
holders and they are all skateboarders. I
collect a 500 premium each year. I pay off
1500 for a claim of a skate board
accident. From past experience I know 10 ( ie.
1000) will file a claim. How much do I expect to
make per customer?
51
Pr(claim) .10 loss is 1500 - 500 1000
recorded as
-1000 Pr(no claim) .90 gain is
500 ---------------------------------------------
----------------------------- Expected value
.10x(-1000) .90x(500)
-100 450 350
dollars per customer -----------------------------
--------------------------------------------- Expe
cted value for the 10,000 customers
10,000x350
3,500,000 dollars per year
52
Efron Dice
Pr( B beats A ) 2/3
Pr( D beats C ) 2/3
Pr (A beats D ) 2/3
Pr( C beats B ) 2/3
Hence, there is NO best die! You can always pick
a winner.
53
Cancer testing confusion of the inverse
Suppose we have a cancer test for a certain type
of cancer. Sensitivity of the test If you have
cancer then the probability of a positive test is
.98. Pr( given you have C) .98 Specificity
of the test If you do not have cancer then the
probability of a negative test is .95. Pr(-
given you do not have C) .95 Base rate The
percent of the population who has the cancer.
This is the probability that someone has
C. Suppose for our example it is 1. Hence,
Pr(C) .01.
54
Percent table
Positive - Negative
C (Cancer) .98 .02 .01
no C (no Cancer) .05 .95 .99
Base Rate
Sensitivity
Specificity
false positive
false negative
Suppose you go in for a test and it comes back
positive. What is the probability that you have
cancer?
55
Count table from a percent table
-
C .98 .02 .01
no C .05 .95 .99
-
C 98 2 100
no C 495 9405 9,900
593 9407 10,000
Pr(C given a test) 98/593 .165
56
Tree diagrams A possible tool for solving
problems like the rare disease problem
All people like you
.01
.99
With disease
Without disease
.02
.98
.05
.95
Positive
Negative
Positive
Negative
.0495
.0098
Pr (Positive) .0098.0495 .0593
Pr (Disease given Positive) .0098/.0593 .165
57
Recall earlier quiz we didnt have
  • Mary likes earrings and spends time at festivals
    shopping
  • for jewelry. Her boy friend and several of her
    close girl
  • friends have tattoos. They have encouraged her
    to also
  • get a tattoo.
  • Unknown to you, Mary will be sitting next to you
    in the
  • next stat100.2 class.
  • Which of the following do you think is more
    likely and why?
  • Mary is a physics major.
  • Mary is a physics major with pierced ears.

58
An answer of B (Mary is a physics major with
pierced ears) is impossible and illustrates the
Conjunction fallacy assigning higher
probability to a detailed scenario involving the
conjunction of events than to one of the simple
events that make up the conjunction.

A possible cause of this fallacy is
the Representative heuristic leads people to
assign higher probabilities than are warranted to
scenarios that are representative of how we
imagine things would happen.

59
Exercise 1, page 309 (sort of)
Suppose you flip four coins.
  • Which is more likely, HHHH or HTTH?
  • Which is more likely, four total heads or two
    total heads?

Note These questions are not the same! One of
these questions is often mistakenly answered due
to belief in the Law of small numbers (also
known as the Gamblers Fallacy).
60
Flip a coin repeatedly. Which of the following
is more likely?
  • Your first seven flips are HHTHTTH
  • Your first six flips are all heads

(By the way, how do you calculate the exact
probability of each of these events?)
61
Exercise 15, page 311. Whats the difference
between these two statements?
  • Im confident that there is at least one set of
    matching birthdays in this room
  • Im confident that there is at least one person
    in this room whose birthday matches my birthday

Which statement is more likely to be true? How
many possible pairs of people are eligible for
matching in each case? Assume 50 people are in
the room.
62
With 50 people in the room
  • There are 49 possible pairs with me.
  • There are 4948471 1225 total possible
    pairs.
  • Pr (No match with my birthday) (364/365)49.874
  • Pr (No match at all) .030 (and we can estimate
    by (364/365)1225.035)

63
Randomized Response A technique for asking
sensitive questions
Question 1 Have you ever smoked
marijuana? Question 2 Is your mothers
birthday in Jan through May? If your fathers
birthday is in July through Dec, answer question
1. Otherwise answer question 2.
64
Conditional Probabilities no yes Base rate
Q1 1-p p 6/12
Q2 7/12 5/12 6/12
Unconditional Probabilities no yes
Q1 .5(1-p) .5p
Q2 .292 .208
.208.5p
Solve for p .208.5p proportion of observed
yeses in sample
Write a Comment
User Comments (0)
About PowerShow.com