Title: STAT 111 Introductory Statistics
1STAT 111 Introductory Statistics
- Lecture 4 Collecting Data
- May 24, 2004
2Todays Topics
- Relationships between categorical variables
- Collecting Data
- Designing experiments
- Choosing a sample
- Sampling distributions
3Categorical Variables
- Recall that categorical variables separate
individuals into groups. - Weve seen that to see relationships between
quantitative variables, we use scatterplots. - Similarly, to see relationships between
categorical explanatory variables and
quantitative responses, side-by-side boxplots are
quite useful. - What do we use to see the relationship between
two categorical variables, though?
4Contingency Table
- The contingency table is a two-way table with one
variable as the row variable and the other as the
column variable. - The row totals and column totals in a two-way
table give the marginal distributions of two
variables separately. - Conditional distribution of the response variable
for each category of the explanatory variable
could be used to describe the association between
the two variables.
5Contingency Table Example 1
- Titanic data 2201 passengers, only the counts
Column variable
SURVIVED
Total
Count
yes
no
female
126
344
470
male
1364
367
1731
SEX
1490
711
2201
Total
Row variable
6Joint and Marginal Distributions
Joint Distribution
Marginal distribution of SURVIVED
Marginal distribution of SEX
7Conditional Distributions
Conditional distribution of survival given gender
Conditional distribution of gender given survival
8Example from Contingency Table 1
- Joint distribution
- P( Male surviving ) 16.67
- P( Female surviving ) 15.63
- Marginal distribution
- P( Surviving ) 32.30
- P( Male ) 78.65
- Conditional distribution
Given a female
Given a male
yes no
Survival 73.19 26.81
yes no
Survival 21.20 78.80
9Example from Contingency Table 1
- We see that of the people on board the ship,
female survivors and male survivors made up
roughly the same percentage. - But the number of females on board was
substantially smaller than the number of males. - Looking at each category, we see that the
percentage of females that survived is higher
than the percentage of males that survived. - Survival and gender seem to be associated.
10Lurking Variables
- We know that lurking variables can produce
nonsensical relationships between two
quantitative variables. - Does the same hold true for relationships between
categorical variables? - Example We have the number of delayed and
on-time flights for two airlines, Alaska Airlines
(AA) and America West (AW). Which one has more
flights that leave on-time?
11Lurking Variables (cont.)
- Looking at the contingency table below, it looks
like America West has a larger percentage of
on-time flights. But
Status
Count
delay
on-time
Row
AA
501
3274
3775
13.27
86.73
AW
787
6438
7225
Airline
10.89
89.11
1288
9712
11000
12Lurking Variables (cont.)
- Lets look at the data for the individual cities.
Los Angeles
Phoenix
San Diego
Seattle
San Francisco
13Lurking Variables (cont.)
- For each individual city, the percentage of
flights that are on-time is higher for Alaska
Airlines than it is for America West. - On the other hand, the percentage of flights that
are on-time is higher for America West than for
Alaska Airlines when we look at the aggregate. - Whats going on here?
14Lurking Variables (cont.)
- An association or comparison that holds for all
of several groups can reverse direction when the
data are combined to form a single group. This
reversal is Simpsons paradox. - Simpsons paradox is an extreme form of the fact
that observed associations can be misleading in
the presence of lurking variables. - Our case is an example of Simpsons paradox, so
what is the lurking variable here?
15Lurking Variables (cont.)
- The lurking variable here is the city, and in
particular, the weather of that city. - Of the five cities listed, Seattle has the worst
weather, so flights tend to be more delayed in
this airport. Phoenix, on the other hand, is not
plagued with bad weather, so flights tend to be
more on-time. - Most of Alaska Airlines flights involve Seattle,
whereas America Wests flights mostly involve
Phoenix!
16Contingency Tables Wrap-up
- Most often, the contingency tables youll see
will be of categorical variables with two levels
each. - Naturally, we can extend this to categorical
variables with more than two levels. - Also, we can consider a contingency table
involving three variables what we do in this
case is create a series of contingency tables
involving only the first two variables, one table
for each of the levels of the third variable.
17Collecting Data
- Weve discussed previously the idea of
exploratory data analysis. - What do we see in our data?
- Formal statistical inference is another type of
data analysis. - Here, we are more interested in answering
specific questions with a known degree of
confidence. - Either way, successful statistical analysis
requires our data to be both reliable and
accurate.
18Collecting Data (cont.)
- The reliability and accuracy of our data depend
on the method we use to collect our data. This
method is known as a design. - Some popular sources of data are
- Available data from libraries and the internet
(Available data are data that were produced in
the past for some other purpose but that may help
answer a present question.) - Observational studies
- Experimental studies
19Observational vs Experimental Studies
- In an observational study, we observe individuals
and measure variables of interest, but we do not
attempt to influence the responses. - In an experiment, we deliberately impose some
treatment on individuals in order to observe
their responses. - An observational study is generally poor at
gauging the effect of an intervention, but in
many situations, we have to use an observational
study.
20Sample Surveys
- The sample survey is one specific type of
observational study. - Why is it preferred to a census?
- Financial constraints
- Time
- A sampling survey can be conducted using
- Personal interviews
- Telephone interviews
- Self-administered questionnaires
21Experiments
- Experimental units individuals on which our
experiment is conducted - Subjects human experimental units
- Treatment specific experimental condition
applied to our units - In principle, experiments can give good evidence
of causation.
22Principles in Designing Experiments
- Control the effects of lurking variables on the
response easiest way to do this is by comparing
two or more treatments. This can help reduce the
bias in a study. - Randomize use chance to assign experimental
units to treatments. - Replicate each treatment on many units to reduce
chance variation in the results.
23More on Experiments
- In an experiment, we hope a difference in the
responses so large that it is unlikely to happen
because of chance variation alone. - In other words, we are looking for a
statistically significant effect. - This terms frequently appears in reports of
studies and tells you that the investigators
found good evidence for the effect they were
seeking. - The most serious weakness of experiments, though,
is their lack of realism.
24Types of Experimental Designs
- Completely randomized design experimental units
are allocated at random among treatments.
Simplest design for experiments. - Block design blocks of experimental units are
formed random assignments of units to treatments
is carried out separately within each block. - Matched pairs design special type of block
design that compares only two treatments by
choosing blocks of two units that are as closely
matched as possible.
25Review Population vs Sample
- Population the entire group of individuals that
we want information about - Sample the part of the population we actually
examine in order to gather information - Parameter a value that describes the population.
It is fixed, but generally unknown. - Statistic a value that describes the sample. It
is observed once a sample is obtained and can be
used to estimate an unknown parameter. - We generally require that the sample be a good
representative of the population.
26Sampling Designs
- Voluntary response sample
- Biased sample scheme scheme
- Simple random sample
- Stratified random sample
- Cluster sample (one-stage and two-stage)
27Sampling Designs
- A voluntary response sample consists of people
who choose themselves by responding to a general
appeal. - This type of sample is invariably biased
(contains a systematic error) and is not usually
representative of the general population. Why? - The people who are willing to respond are the
only ones included in this sample, and usually
those are the ones with very strong opinions. - So what we get are the extreme cases.
28Sampling Designs (cont.)
- Better sampling designs choose individuals by
random chance so that the bias is eliminated. - A simple random sample (SRS) of size n consists
of n individuals from the population chosen in
such a way that every set of n individuals has an
equal chance to be the sample actually selected. - How do we select an SRS?
- Assign a number to each individual in the
population. - Randomly select sample numbers by using a random
numbers table or software package.
29Sampling Designs (cont.)
- A probability sample is a sample chosen by chance
and is the general framework for designs that use
chance to choose a sample. Possible samples and
the probability of each possible sample occurring
must be known. - The SRS is the simplest type of probability
sample it gives each member of the population an
equal chance of selection. - More complex designs are better for sampling from
large populations.
30Sampling Designs (cont.)
- To select a stratified random sample, divide the
population into groups of similar individuals,
called strata. Then choose a separate SRS in each
stratum and combine these SRSs to form the full
sample.
31Sampling Designs (cont.)
- We typically choose the strata based on facts we
know prior to taking the sampling. - Strata for sampling are similar to blocks in
experiments. - Overall, using a stratified random sample, we can
acquire information about - The whole population
- Each stratum
- The relationships among the strata
32Sampling Design (cont.)
- The SRS and stratified random sample both select
individuals from the population. - On the other hand, the cluster sample selects
groups or clusters of individuals from the
population. A cluster is also referred to as a
primary sampling unit (PSU). - In a one-stage cluster sample, all individuals
within the selected clusters are selected. - In a two-stage cluster sample, a SRS of the
individuals within each selected cluster is drawn.
33Sampling Designs (cont.)
- A two-stage cluster sample is an example of a
multistage sampling design. - This is a more complex design in which, as the
name suggests, a sample is obtained by sampling
in multiple stages. - Basically, any sort of combination of an SRS,
stratified random sample, and cluster sample can
create a multistage sample.
34Errors Non-sampling vs Sampling
- Non-sampling errors occur due to mistakes made
during the process of data acquisition. - Increasing sample size will not reduce this type
of error. - There are three types of non-sampling errors
- Errors in data acquisition, e.g., response bias
- Nonresponse errors
- Selection bias, such as undercoverage
35Error in Data Acquisition
Population
Sampling error Data acquisition error
Sample
36Nonresponse Error
Population
No response here...
may lead to biased results here.
Sample
37Selection Bias
Population
When parts of the population cannot be selected...
the sample cannot represent the whole population.
Sample
38Sampling Error
- Sampling error refers to differences between the
sample and the population, because of the
specific observations that happen to be selected. - Sampling error is expected to occur when making a
statement about the population based on the
sample taken.
39Population
Population mean
Sampling error
The sample mean
Sample
40Sampling Distributions
- The sampling distribution of a statistic is the
distribution of values taken by the statistic in
all possible samples of the same size from the
same population. - The bias of a statistic is the difference between
the mean of its sampling distribution and the
population parameter no bias unbiased. - The variability is described by the spread of its
sampling distribution determined by the design
and size of the sample.
41High bias, low variability
Low bias, high variability
High bias, high variability
Low bias, low variability
42More on Sampling Errors
- We are often concerned with how to manage the
bias and variability of a statistic. - To reduce the bias, we use random sampling.
- Generally speaking, estimates drawn from an SRS
are unbiased (which is why the SRS is so
attractive). - To reduce the variability of a statistic from an
SRS, increase the sample size. - There is a trade-off between bias and variability
, however (i.e., we cannot make both very small).