STAT 111 Introductory Statistics - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

STAT 111 Introductory Statistics

Description:

STAT 111 Introductory Statistics Lecture 4: Collecting Data May 24, 2004 Today s Topics Relationships between categorical variables Collecting Data Designing ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 43

Provided by: s3Amazona2

Category:

more less

Transcript and Presenter's Notes

Title: STAT 111 Introductory Statistics

1
STAT 111 Introductory Statistics

Lecture 4 Collecting Data
May 24, 2004

2
Todays Topics

Relationships between categorical variables
Collecting Data
Designing experiments
Choosing a sample
Sampling distributions

3
Categorical Variables

Recall that categorical variables separate
individuals into groups.
Weve seen that to see relationships between
quantitative variables, we use scatterplots.
Similarly, to see relationships between
categorical explanatory variables and
quantitative responses, side-by-side boxplots are
quite useful.
What do we use to see the relationship between
two categorical variables, though?

4
Contingency Table

The contingency table is a two-way table with one
variable as the row variable and the other as the
column variable.
The row totals and column totals in a two-way
table give the marginal distributions of two
variables separately.
Conditional distribution of the response variable
for each category of the explanatory variable
could be used to describe the association between
the two variables.

5
Contingency Table Example 1

Titanic data 2201 passengers, only the counts

Column variable
SURVIVED
Total
Count
yes
no
female
126
344
470
male
1364
367
1731
SEX
1490
711
2201
Total
Row variable
6
Joint and Marginal Distributions
Joint Distribution
Marginal distribution of SURVIVED
Marginal distribution of SEX
7
Conditional Distributions
Conditional distribution of survival given gender
Conditional distribution of gender given survival
8
Example from Contingency Table 1

Joint distribution
P( Male surviving ) 16.67
P( Female surviving ) 15.63
Marginal distribution
P( Surviving ) 32.30
P( Male ) 78.65
Conditional distribution

Given a female
Given a male
yes no
Survival 73.19 26.81
yes no
Survival 21.20 78.80
9
Example from Contingency Table 1

We see that of the people on board the ship,
female survivors and male survivors made up
roughly the same percentage.
But the number of females on board was
substantially smaller than the number of males.
Looking at each category, we see that the
percentage of females that survived is higher
than the percentage of males that survived.
Survival and gender seem to be associated.

10
Lurking Variables

We know that lurking variables can produce
nonsensical relationships between two
quantitative variables.
Does the same hold true for relationships between
categorical variables?
Example We have the number of delayed and
on-time flights for two airlines, Alaska Airlines
(AA) and America West (AW). Which one has more
flights that leave on-time?

11
Lurking Variables (cont.)

Looking at the contingency table below, it looks
like America West has a larger percentage of
on-time flights. But

Status
Count
delay
on-time
Row
AA
501
3274
3775
13.27
86.73
AW
787
6438
7225
Airline
10.89
89.11
1288
9712
11000
12
Lurking Variables (cont.)

Lets look at the data for the individual cities.

Los Angeles
Phoenix
San Diego
Seattle
San Francisco
13
Lurking Variables (cont.)

For each individual city, the percentage of
flights that are on-time is higher for Alaska
Airlines than it is for America West.
On the other hand, the percentage of flights that
are on-time is higher for America West than for
Alaska Airlines when we look at the aggregate.
Whats going on here?

14
Lurking Variables (cont.)

An association or comparison that holds for all
of several groups can reverse direction when the
data are combined to form a single group. This
reversal is Simpsons paradox.
Simpsons paradox is an extreme form of the fact
that observed associations can be misleading in
the presence of lurking variables.
Our case is an example of Simpsons paradox, so
what is the lurking variable here?

15
Lurking Variables (cont.)

The lurking variable here is the city, and in
particular, the weather of that city.
Of the five cities listed, Seattle has the worst
weather, so flights tend to be more delayed in
this airport. Phoenix, on the other hand, is not
plagued with bad weather, so flights tend to be
more on-time.
Most of Alaska Airlines flights involve Seattle,
whereas America Wests flights mostly involve
Phoenix!

16
Contingency Tables Wrap-up

Most often, the contingency tables youll see
will be of categorical variables with two levels
each.
Naturally, we can extend this to categorical
variables with more than two levels.
Also, we can consider a contingency table
involving three variables what we do in this
case is create a series of contingency tables
involving only the first two variables, one table
for each of the levels of the third variable.

17
Collecting Data

Weve discussed previously the idea of
exploratory data analysis.
What do we see in our data?
Formal statistical inference is another type of
data analysis.
Here, we are more interested in answering
specific questions with a known degree of
confidence.
Either way, successful statistical analysis
requires our data to be both reliable and
accurate.

18
Collecting Data (cont.)

The reliability and accuracy of our data depend
on the method we use to collect our data. This
method is known as a design.
Some popular sources of data are
Available data from libraries and the internet
(Available data are data that were produced in
the past for some other purpose but that may help
answer a present question.)
Observational studies
Experimental studies

19
Observational vs Experimental Studies

In an observational study, we observe individuals
and measure variables of interest, but we do not
attempt to influence the responses.
In an experiment, we deliberately impose some
treatment on individuals in order to observe
their responses.
An observational study is generally poor at
gauging the effect of an intervention, but in
many situations, we have to use an observational
study.

20
Sample Surveys

The sample survey is one specific type of
observational study.
Why is it preferred to a census?
Financial constraints
Time
A sampling survey can be conducted using
Personal interviews
Telephone interviews
Self-administered questionnaires

21
Experiments

Experimental units individuals on which our
experiment is conducted
Subjects human experimental units
Treatment specific experimental condition
applied to our units
In principle, experiments can give good evidence
of causation.

22
Principles in Designing Experiments

Control the effects of lurking variables on the
response easiest way to do this is by comparing
two or more treatments. This can help reduce the
bias in a study.
Randomize use chance to assign experimental
units to treatments.
Replicate each treatment on many units to reduce
chance variation in the results.

23
More on Experiments

In an experiment, we hope a difference in the
responses so large that it is unlikely to happen
because of chance variation alone.
In other words, we are looking for a
statistically significant effect.
This terms frequently appears in reports of
studies and tells you that the investigators
found good evidence for the effect they were
seeking.
The most serious weakness of experiments, though,
is their lack of realism.

24
Types of Experimental Designs

Completely randomized design experimental units
are allocated at random among treatments.
Simplest design for experiments.
Block design blocks of experimental units are
formed random assignments of units to treatments
is carried out separately within each block.
Matched pairs design special type of block
design that compares only two treatments by
choosing blocks of two units that are as closely
matched as possible.

25
Review Population vs Sample

Population the entire group of individuals that
we want information about
Sample the part of the population we actually
examine in order to gather information
Parameter a value that describes the population.
It is fixed, but generally unknown.
Statistic a value that describes the sample. It
is observed once a sample is obtained and can be
used to estimate an unknown parameter.
We generally require that the sample be a good
representative of the population.

26
Sampling Designs

Voluntary response sample
Biased sample scheme scheme
Simple random sample
Stratified random sample
Cluster sample (one-stage and two-stage)

27
Sampling Designs

A voluntary response sample consists of people
who choose themselves by responding to a general
appeal.
This type of sample is invariably biased
(contains a systematic error) and is not usually
representative of the general population. Why?
The people who are willing to respond are the
only ones included in this sample, and usually
those are the ones with very strong opinions.
So what we get are the extreme cases.

28
Sampling Designs (cont.)

Better sampling designs choose individuals by
random chance so that the bias is eliminated.
A simple random sample (SRS) of size n consists
of n individuals from the population chosen in
such a way that every set of n individuals has an
equal chance to be the sample actually selected.
How do we select an SRS?
Assign a number to each individual in the
population.
Randomly select sample numbers by using a random
numbers table or software package.

29
Sampling Designs (cont.)

A probability sample is a sample chosen by chance
and is the general framework for designs that use
chance to choose a sample. Possible samples and
the probability of each possible sample occurring
must be known.
The SRS is the simplest type of probability
sample it gives each member of the population an
equal chance of selection.
More complex designs are better for sampling from
large populations.

30
Sampling Designs (cont.)

To select a stratified random sample, divide the
population into groups of similar individuals,
called strata. Then choose a separate SRS in each
stratum and combine these SRSs to form the full
sample.

31
Sampling Designs (cont.)

We typically choose the strata based on facts we
know prior to taking the sampling.
Strata for sampling are similar to blocks in
experiments.
Overall, using a stratified random sample, we can
acquire information about
The whole population
Each stratum
The relationships among the strata

32
Sampling Design (cont.)

The SRS and stratified random sample both select
individuals from the population.
On the other hand, the cluster sample selects
groups or clusters of individuals from the
population. A cluster is also referred to as a
primary sampling unit (PSU).
In a one-stage cluster sample, all individuals
within the selected clusters are selected.
In a two-stage cluster sample, a SRS of the
individuals within each selected cluster is drawn.

33
Sampling Designs (cont.)

A two-stage cluster sample is an example of a
multistage sampling design.
This is a more complex design in which, as the
name suggests, a sample is obtained by sampling
in multiple stages.
Basically, any sort of combination of an SRS,
stratified random sample, and cluster sample can
create a multistage sample.

34
Errors Non-sampling vs Sampling

Non-sampling errors occur due to mistakes made
during the process of data acquisition.
Increasing sample size will not reduce this type
of error.
There are three types of non-sampling errors
Errors in data acquisition, e.g., response bias
Nonresponse errors
Selection bias, such as undercoverage

35
Error in Data Acquisition
Population
Sampling error Data acquisition error
Sample
36
Nonresponse Error
Population
No response here...
may lead to biased results here.
Sample
37
Selection Bias
Population
When parts of the population cannot be selected...
the sample cannot represent the whole population.
Sample
38
Sampling Error

Sampling error refers to differences between the
sample and the population, because of the
specific observations that happen to be selected.
Sampling error is expected to occur when making a
statement about the population based on the
sample taken.

39
Population
Population mean
Sampling error
The sample mean
Sample
40
Sampling Distributions

The sampling distribution of a statistic is the
distribution of values taken by the statistic in
all possible samples of the same size from the
same population.
The bias of a statistic is the difference between
the mean of its sampling distribution and the
population parameter no bias unbiased.
The variability is described by the spread of its
sampling distribution determined by the design
and size of the sample.

41
High bias, low variability
Low bias, high variability
High bias, high variability
Low bias, low variability
42
More on Sampling Errors

We are often concerned with how to manage the
bias and variability of a statistic.
To reduce the bias, we use random sampling.
Generally speaking, estimates drawn from an SRS
are unbiased (which is why the SRS is so
attractive).
To reduce the variability of a statistic from an
SRS, increase the sample size.
There is a trade-off between bias and variability
, however (i.e., we cannot make both very small).