STAT 111 Introductory Statistics - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

STAT 111 Introductory Statistics

Description:

STAT 111 Introductory Statistics Lecture 4: Collecting Data May 24, 2004 Today s Topics Relationships between categorical variables Collecting Data Designing ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 43
Provided by: s3Amazona2
Category:

less

Transcript and Presenter's Notes

Title: STAT 111 Introductory Statistics


1
STAT 111 Introductory Statistics
  • Lecture 4 Collecting Data
  • May 24, 2004

2
Todays Topics
  • Relationships between categorical variables
  • Collecting Data
  • Designing experiments
  • Choosing a sample
  • Sampling distributions

3
Categorical Variables
  • Recall that categorical variables separate
    individuals into groups.
  • Weve seen that to see relationships between
    quantitative variables, we use scatterplots.
  • Similarly, to see relationships between
    categorical explanatory variables and
    quantitative responses, side-by-side boxplots are
    quite useful.
  • What do we use to see the relationship between
    two categorical variables, though?

4
Contingency Table
  • The contingency table is a two-way table with one
    variable as the row variable and the other as the
    column variable.
  • The row totals and column totals in a two-way
    table give the marginal distributions of two
    variables separately.
  • Conditional distribution of the response variable
    for each category of the explanatory variable
    could be used to describe the association between
    the two variables.

5
Contingency Table Example 1
  • Titanic data 2201 passengers, only the counts

Column variable
SURVIVED
Total
Count
yes
no
female
126
344
470
male
1364
367
1731
SEX
1490
711
2201
Total
Row variable
6
Joint and Marginal Distributions
Joint Distribution
Marginal distribution of SURVIVED
Marginal distribution of SEX
7
Conditional Distributions
Conditional distribution of survival given gender
Conditional distribution of gender given survival
8
Example from Contingency Table 1
  • Joint distribution
  • P( Male surviving ) 16.67
  • P( Female surviving ) 15.63
  • Marginal distribution
  • P( Surviving ) 32.30
  • P( Male ) 78.65
  • Conditional distribution

Given a female
Given a male
yes no
Survival 73.19 26.81
yes no
Survival 21.20 78.80
9
Example from Contingency Table 1
  • We see that of the people on board the ship,
    female survivors and male survivors made up
    roughly the same percentage.
  • But the number of females on board was
    substantially smaller than the number of males.
  • Looking at each category, we see that the
    percentage of females that survived is higher
    than the percentage of males that survived.
  • Survival and gender seem to be associated.

10
Lurking Variables
  • We know that lurking variables can produce
    nonsensical relationships between two
    quantitative variables.
  • Does the same hold true for relationships between
    categorical variables?
  • Example We have the number of delayed and
    on-time flights for two airlines, Alaska Airlines
    (AA) and America West (AW). Which one has more
    flights that leave on-time?

11
Lurking Variables (cont.)
  • Looking at the contingency table below, it looks
    like America West has a larger percentage of
    on-time flights. But

Status
Count
delay
on-time
Row
AA
501
3274
3775
13.27
86.73
AW
787
6438
7225
Airline
10.89
89.11
1288
9712
11000
12
Lurking Variables (cont.)
  • Lets look at the data for the individual cities.

Los Angeles
Phoenix
San Diego
Seattle
San Francisco
13
Lurking Variables (cont.)
  • For each individual city, the percentage of
    flights that are on-time is higher for Alaska
    Airlines than it is for America West.
  • On the other hand, the percentage of flights that
    are on-time is higher for America West than for
    Alaska Airlines when we look at the aggregate.
  • Whats going on here?

14
Lurking Variables (cont.)
  • An association or comparison that holds for all
    of several groups can reverse direction when the
    data are combined to form a single group. This
    reversal is Simpsons paradox.
  • Simpsons paradox is an extreme form of the fact
    that observed associations can be misleading in
    the presence of lurking variables.
  • Our case is an example of Simpsons paradox, so
    what is the lurking variable here?

15
Lurking Variables (cont.)
  • The lurking variable here is the city, and in
    particular, the weather of that city.
  • Of the five cities listed, Seattle has the worst
    weather, so flights tend to be more delayed in
    this airport. Phoenix, on the other hand, is not
    plagued with bad weather, so flights tend to be
    more on-time.
  • Most of Alaska Airlines flights involve Seattle,
    whereas America Wests flights mostly involve
    Phoenix!

16
Contingency Tables Wrap-up
  • Most often, the contingency tables youll see
    will be of categorical variables with two levels
    each.
  • Naturally, we can extend this to categorical
    variables with more than two levels.
  • Also, we can consider a contingency table
    involving three variables what we do in this
    case is create a series of contingency tables
    involving only the first two variables, one table
    for each of the levels of the third variable.

17
Collecting Data
  • Weve discussed previously the idea of
    exploratory data analysis.
  • What do we see in our data?
  • Formal statistical inference is another type of
    data analysis.
  • Here, we are more interested in answering
    specific questions with a known degree of
    confidence.
  • Either way, successful statistical analysis
    requires our data to be both reliable and
    accurate.

18
Collecting Data (cont.)
  • The reliability and accuracy of our data depend
    on the method we use to collect our data. This
    method is known as a design.
  • Some popular sources of data are
  • Available data from libraries and the internet
    (Available data are data that were produced in
    the past for some other purpose but that may help
    answer a present question.)
  • Observational studies
  • Experimental studies

19
Observational vs Experimental Studies
  • In an observational study, we observe individuals
    and measure variables of interest, but we do not
    attempt to influence the responses.
  • In an experiment, we deliberately impose some
    treatment on individuals in order to observe
    their responses.
  • An observational study is generally poor at
    gauging the effect of an intervention, but in
    many situations, we have to use an observational
    study.

20
Sample Surveys
  • The sample survey is one specific type of
    observational study.
  • Why is it preferred to a census?
  • Financial constraints
  • Time
  • A sampling survey can be conducted using
  • Personal interviews
  • Telephone interviews
  • Self-administered questionnaires

21
Experiments
  • Experimental units individuals on which our
    experiment is conducted
  • Subjects human experimental units
  • Treatment specific experimental condition
    applied to our units
  • In principle, experiments can give good evidence
    of causation.

22
Principles in Designing Experiments
  • Control the effects of lurking variables on the
    response easiest way to do this is by comparing
    two or more treatments. This can help reduce the
    bias in a study.
  • Randomize use chance to assign experimental
    units to treatments.
  • Replicate each treatment on many units to reduce
    chance variation in the results.

23
More on Experiments
  • In an experiment, we hope a difference in the
    responses so large that it is unlikely to happen
    because of chance variation alone.
  • In other words, we are looking for a
    statistically significant effect.
  • This terms frequently appears in reports of
    studies and tells you that the investigators
    found good evidence for the effect they were
    seeking.
  • The most serious weakness of experiments, though,
    is their lack of realism.

24
Types of Experimental Designs
  • Completely randomized design experimental units
    are allocated at random among treatments.
    Simplest design for experiments.
  • Block design blocks of experimental units are
    formed random assignments of units to treatments
    is carried out separately within each block.
  • Matched pairs design special type of block
    design that compares only two treatments by
    choosing blocks of two units that are as closely
    matched as possible.

25
Review Population vs Sample
  • Population the entire group of individuals that
    we want information about
  • Sample the part of the population we actually
    examine in order to gather information
  • Parameter a value that describes the population.
    It is fixed, but generally unknown.
  • Statistic a value that describes the sample. It
    is observed once a sample is obtained and can be
    used to estimate an unknown parameter.
  • We generally require that the sample be a good
    representative of the population.

26
Sampling Designs
  • Voluntary response sample
  • Biased sample scheme scheme
  • Simple random sample
  • Stratified random sample
  • Cluster sample (one-stage and two-stage)

27
Sampling Designs
  • A voluntary response sample consists of people
    who choose themselves by responding to a general
    appeal.
  • This type of sample is invariably biased
    (contains a systematic error) and is not usually
    representative of the general population. Why?
  • The people who are willing to respond are the
    only ones included in this sample, and usually
    those are the ones with very strong opinions.
  • So what we get are the extreme cases.

28
Sampling Designs (cont.)
  • Better sampling designs choose individuals by
    random chance so that the bias is eliminated.
  • A simple random sample (SRS) of size n consists
    of n individuals from the population chosen in
    such a way that every set of n individuals has an
    equal chance to be the sample actually selected.
  • How do we select an SRS?
  • Assign a number to each individual in the
    population.
  • Randomly select sample numbers by using a random
    numbers table or software package.

29
Sampling Designs (cont.)
  • A probability sample is a sample chosen by chance
    and is the general framework for designs that use
    chance to choose a sample. Possible samples and
    the probability of each possible sample occurring
    must be known.
  • The SRS is the simplest type of probability
    sample it gives each member of the population an
    equal chance of selection.
  • More complex designs are better for sampling from
    large populations.

30
Sampling Designs (cont.)
  • To select a stratified random sample, divide the
    population into groups of similar individuals,
    called strata. Then choose a separate SRS in each
    stratum and combine these SRSs to form the full
    sample.

31
Sampling Designs (cont.)
  • We typically choose the strata based on facts we
    know prior to taking the sampling.
  • Strata for sampling are similar to blocks in
    experiments.
  • Overall, using a stratified random sample, we can
    acquire information about
  • The whole population
  • Each stratum
  • The relationships among the strata

32
Sampling Design (cont.)
  • The SRS and stratified random sample both select
    individuals from the population.
  • On the other hand, the cluster sample selects
    groups or clusters of individuals from the
    population. A cluster is also referred to as a
    primary sampling unit (PSU).
  • In a one-stage cluster sample, all individuals
    within the selected clusters are selected.
  • In a two-stage cluster sample, a SRS of the
    individuals within each selected cluster is drawn.

33
Sampling Designs (cont.)
  • A two-stage cluster sample is an example of a
    multistage sampling design.
  • This is a more complex design in which, as the
    name suggests, a sample is obtained by sampling
    in multiple stages.
  • Basically, any sort of combination of an SRS,
    stratified random sample, and cluster sample can
    create a multistage sample.

34
Errors Non-sampling vs Sampling
  • Non-sampling errors occur due to mistakes made
    during the process of data acquisition.
  • Increasing sample size will not reduce this type
    of error.
  • There are three types of non-sampling errors
  • Errors in data acquisition, e.g., response bias
  • Nonresponse errors
  • Selection bias, such as undercoverage

35
Error in Data Acquisition
Population
Sampling error Data acquisition error
Sample
36
Nonresponse Error
Population
No response here...
may lead to biased results here.
Sample
37
Selection Bias
Population
When parts of the population cannot be selected...
the sample cannot represent the whole population.
Sample
38
Sampling Error
  • Sampling error refers to differences between the
    sample and the population, because of the
    specific observations that happen to be selected.
  • Sampling error is expected to occur when making a
    statement about the population based on the
    sample taken.

39
Population
Population mean
Sampling error
The sample mean
Sample
40
Sampling Distributions
  • The sampling distribution of a statistic is the
    distribution of values taken by the statistic in
    all possible samples of the same size from the
    same population.
  • The bias of a statistic is the difference between
    the mean of its sampling distribution and the
    population parameter no bias unbiased.
  • The variability is described by the spread of its
    sampling distribution determined by the design
    and size of the sample.

41
High bias, low variability
Low bias, high variability
High bias, high variability
Low bias, low variability
42
More on Sampling Errors
  • We are often concerned with how to manage the
    bias and variability of a statistic.
  • To reduce the bias, we use random sampling.
  • Generally speaking, estimates drawn from an SRS
    are unbiased (which is why the SRS is so
    attractive).
  • To reduce the variability of a statistic from an
    SRS, increase the sample size.
  • There is a trade-off between bias and variability
    , however (i.e., we cannot make both very small).
Write a Comment
User Comments (0)
About PowerShow.com