Title: Reading
1Lecture 11
2Reading
- You should have completed chapter 6.
3Evidence of a correlation
- The margin of error for red depends on the number
of red marbles in the sample - 500 red marbles, so -0.05.
- f(LR) 0.65
- P(LR) 0.65 - .05
- So 0.6 to 0.7 of the red population are large.
- 100 non-red marbles, so - 0.10 f(LN) 0.35
- P(LN) 0.35 -0.1
- So 0.25 to 0.45 of the non-red population are
large. fig. 6.3 whole
4Lack of evidence for a correlation
- Sample size 600 fig.
- 500 red, 100 non-red
- 270 / 500 large, f(LR) 270/500 0.54
- P(LR) 0.54-0.05
- So 0.49 to 0.59 of the red population are large.
- 46/100 large, f(LN) 46/100 0.46
- P(LN) 0.46-0.01
- So 0.36 to 0.56 of the non-red population are
large.
5Estimating the Strength of Correlation
- The maximum allowed difference is between the top
of the higher interval and the bottom of the
lower interval. - The minimum allowed difference is between the
bottom of the higher interval and the top of the
lower interval. - Fig. 6.3, 0.7-0.250.45 and 0.60-0.450.15
- Estimated strength of the correlation is (0.45,
0.15) - Fig 6.4, 0.59-0.360.25 and 0.49-0.56
- Estimated strength of the correlation is (0.23,
-0.07) - The negative number indicates that there may be
no correlation.
6- The standard deviation of the difference is less
than the standard deviation of the estimates. - This is because we are getting evidence from the
whole sample of 600 marbles combined. - So rather than a certainty of 95, our estimate
of the strength of correlation is 99 certain.
7Statistical Significance
- A correlation in the sample may be due to a
correlation in the population, or it may be due
to an accident of sampling. - A correlation is statistically significant iff it
is unlikely to be an accident of sampling. - The sample frequencies are statistically
significant iff the corresponding interval
estimates do not overlap. - Iff the interval estimates do overlap, the
differences in sample frequencies are not
statistically significant.
8(No Transcript)
9Evaluating Statistical HypothesesThe Real World
Population
- The data only tell us about the population
actually sampled. - This is different from the population of
interest. - E.g. a phone survey of voting intentions doesnt
include people without phones. - The population actually sampled must have members
who are all equally likely to be in the sample
(random sampling).
10The Sample Data
- When evaluating a hypothesis, it will usually
only be one part of the sample data that is
relevant. That part must be identified.
11The Statistical Model
- Evaluating a statistical hypothesis consists in
comparing the real world to a model. - Sometimes the model is suggested by the data,
though not always. - The possible models we have are proportions,
distributions and correlations.
12Random Sampling
- How well does the study fit random sampling?
- Random sampling
- a) All members of the population have an equal
chance of being selected. - b) There is no correlation between the outcome of
one selection and another. - This is an ideal that is only approximated in
practice.
13Evaluating the Hypothesis
- Assuming random sampling, what does the data tell
us? - What estimate for a proportion do we have, with
what margin of error? - Is there evidence of a correlation?
- Is it strong evidence?
14Summary
- How well does the data support the evaluation of
stage 5? - Is the sample random enough for the hypothesis to
be supported? - This depends on how random the sampling procedure
was, and how strong the evidence is.
15Non-random sampling 1Stratified Sampling
- The sample should contain the same proportion of
each sub-section as the population. - Example If the population of America is 20
hispanic, the sample should be 20 hispanic. - Advantages More precision.
- Can be administratively easier to focus on
certain groups. - Disadvantages Requires identifying appropriate
strata. - More complex to analyse results.
16Non-random Sampling 2Cluster Sampling
- If some areas of the population are out of reach,
sample a cluster of individuals from those within
reach. - Rather than sample one child from 30 schools,
sample 30 children from one school. - Each cluster should be a small scale
representation of the sample. - The individuals within the cluster should be as
heterogenous as possible. (This is a weakness of
the HBSC study.) - Advantage Cheaper / More practical. Saves travel
time. - Disadvantages Higher margin of error.
- Clusters may be similar to each other.
17Case StudyHealth Behaviour in School-Aged
Children
- Aim To discover the behaviour and concerns of
school aged children related to health. - Hopeful outcomes
- Identify specific groups at risk.
- Understanding the factors that dispose people to
develop health problems. - Develop effective intervention strategies.
- (Did they interview the same students at a later
date?)
18- How do you get a random sample of 11-15 year
olds? - They used the school systems.
- This excluded from the study all those not in the
school system - Home-schooled
- In detention centres
- Homeless
19- A sample of over 123,000 from 28 countries was
obtained. - This consisted of students answering survey
questions. - What does this tell us about the proportion of
children who like school a lot?
20The Real World Population
- The population of interest is children in the 28
countries between 11 and 15. - The population sampled is children in school
during the testing period of 11, 13 and 15 who
were competent enough in their national language.
21The Sample Data
- The relevant piece of data is that 24 of
respondents reported liking school a lot (p.169).
22The Statistical Model
- The model suggested is that the proportion of
children who like school a lot is 24.
Dont like school a lot
Like school a lot
0.24
23Random Sampling
- How well does the study fit random sampling?
- Random sampling
- a) All members of the population have an equal
chance of being selected. - Not satisfied. Cluster sampling Only certain
schools were selected. - b) There is no correlation between the outcome of
one selection and another. - Not fully satisifed. There is less variation
within classes than between them e.g. students
grouped by abilities.
24Evaluating the Hypothesis
- Assuming random sampling, what does the data tell
us? - For a sample of 120 000, the margin of error is
about 1 - So the data suggests that 23-25 of children like
school a lot.
25Children 11-15
0.23-0.25 like school a lot
Estimate
Dont like school a lot
Like school a lot
0.24
n 120,000
26Summary
- How well does the data support the evaluation of
stage 5? - Is the sample random enough for the hypothesis to
be supported? - Cluster sampling rather than random sampling was
used. - We have to rely to some extent on the care taken
by the researchers. Given that it is a
large-scale project by academics, it is
reasonable to assume it was close to random
sampling. - Thus, we have good evidence that the proportion
of children who like school a lot is 23-25.
27Problems with survey sampling 1Non-random
sampling
- The Kinsey report (1948, 1953) Sample of
convenience. - 10 of the sample were homosexual.
- 50 of married males had extra-marital sex.
- Self-Selection.
- 25 of the sample were in prison, and 5 were
male prostitutes. - Advantage Much larger sample size than would
otherwise be possible.
28- 1936 election The Literary Digest predicted Alf
Landon, the Republican candidate would win by a
landslide. - But their sample was chosen from phone books and
car registration details. The poor of the 1930s
had neither. - George Gallup constructed his sample more
carefully, and predicted that Roosevelt would win.
29Problems with survey sampling 2False responses
(liars)
- Sensitive subjects. Sex, drugs, bullying.
- Telling people what they want to hear.
- The Shy-Tory Factor
- Polls in the British 1992 election put Labour and
Tories at 38 and 39. - But the Tories won by 7.6.
- An inquiry by the Market Research Society put the
difference down to embarassed Tories.
30Example Why do Fox News polls favour Republicans?
- Fox News shows Obama with a 7 point lead.
http//www.foxnews.com/polls/ - Gallup shows him with a 10 point lead.
www.gallup.org - 1. Its Fox News you talk to him
- 2. Telling people what they want to hear.
31Major expected sources of error in the current
polls
- 1. Racism. As it is not socially acceptable in
most of America to refuse to vote for a black
candidate, voters who refuse to vote for a black
candidate will not say so. The Bradley-effect.
(False responses.) - 2. Samples are selected by phone numbers. Voters
who only have cell-phones may be
under-represented. If there is a correlation
between voters with only cell-phones and voting,
the poll may be inaccurate.
32Exercise 6.13
- Is there a correlation between having a college
education and not drinking?
331. The Real World Population
342. The Sample Data
- Among those with a college education, 75
classified themselves as either light or moderate
drinkers. - 49 with a high school education gave these
responses.
35College education High School Education
Non-drinkers or heavy drinkers
Light or moderate drinkers
0.75
0.49
363. The Statistical Model
- The model suggested is that there is a positive
correlation between having a college education
and being a light or moderate drinker.
374. Random Sampling
- How well does the study fit random sampling?
- In-home interviews.
- Random sampling
- a) All members of the population have an equal
chance of being selected. - Were not told how the homes are selected.
- The homeless are excluded.
- b) There is no correlation between the outcome of
one selection and another. - Were not told, but probably satisfied.
385. Evaluating the Hypothesis
- Assuming random sampling, what does the data tell
us? - For a sample of 500 the margin of error is 4.
- Complication Were not told what proportion of
the population had a college education. - For n250, margin of error is-0.6
39Non-drinkers or heavy drinkers
0.81 0.69
0.55 0.42
Light or moderate drinkers
Non-drinkers or heavy drinkers
0.75
Light or moderate drinkers
0.49
n500 total
40Strength of Correlation
- 0.81-0.42 0.39
- 0.69-0.55 0.14
- So the estimated strength of correlation is
0.39, 0.14. - As this is based on the whole sample of 500, we
can be 99 of this conclusion.
416. Summary
- How well does the data support the evaluation of
stage 5? - Is the sample random enough for the hypothesis to
be supported? - We have to decide based on the report and the
context in which we find the report. Its
reasonable to assume that this was a carefully
conducted study, in which case we have good
evidence for the conclusion that there is a
moderate correlation between having a college
education and light or moderate drinking.