Statistics

About This Presentation

Title:

Statistics

Description:

They decide to weigh 9 packages of ground meat labeled as 1 pound packages ... So...the question of how much information to gather is very important ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 58

Provided by: sandyb2

Learn more at: https://pages.stern.nyu.edu

more less

Transcript and Presenter's Notes

Title: Statistics

1
Statistics Data Analysis

Course Number B01.1305
Course Section 31
Meeting Time Wednesday 6-850 pm

CLASS 5
2
Class 5 Outline

Understand random sampling and systematic bias
Derive theoretical distribution of summary
statistics
Understand the Central Limit Theorem
Use a normal probability plot to assess normality

3
Review of Last Class

Special Distributions
Counting problems
Binomial distribution problems
Normal distribution problems

4
CHAPTER 6

Random Sampling and Sampling Distributions

5
Chapter Goals

Explain why in many situations a sample is the
only way learn something about a population
Explain the various methods of selecting a
sample
Define and construct sampling distribution of
sample means
Understand sources of bias or under-representation
in data

6
A Scenario

Its 900 AM on Wednesday and your boss sent you
and email asking how your firms customers would
react to a new price discounting program
Your report is due tomorrow
It takes 10 minutes to interview a single
customer in your database of almost 2,000
What will you do????
Draw a sample of the customers
How will you draw the sample?
Need a representative sample
Does your database hold a representative sample???

7
Background

Some previous chapters emphasized methods for
describing data
Created frequency distributions, computed
averages and measures of dispersion
Started to lay foundation for inference by
studying probability
Counting, Binomial, and Normal Distributions
Probability distributions encompass all possible
outcomes of an experiment and the probability
associated with each outcome
So far, weve learned how to describe something
that has already occurred or evaluate something
that might occur

8
How are these similar

QC department needs to check the tensile strength
of steel wire
Five small pieces are selected every 5 hours
Tensile strength of each piece is determined
Marketing needs to determine the sales potential
of a new drug named HappyPill.
452 consumers were asked to try it for a week
Each consumer completed a questionnaire
Polling agency selections 2,000 voters at random
and asked their approval rating of the President
In the study of insider trading, 25 CEOs were
identified by the SEC and their trades were
monitored for three years

9
Why Sample???

Destructive nature of some tests
Physical Impossibility of checking all items
Cost of studying all items
Adequacy of sample results
Contacting whole population would be too
time-consuming

10
Types of Samples

Cross-sectional samples are taken from an
underlying population at a particular time
Time-series samples are taken over time from a
random process
Enumerative Studies sampling from a
well-defined population
Analytic Studies look at the results of a
random process to predict future behavior

11
Why Sample???

We often need to know something about a large
population.
What is the average income of all Stern
students?
Its often too expensive and time-consuming to
examine the entire population
Solution Choose a small random sample and use
the methods of statistical inference to draw
conclusions about the population
Sampling lets us dramatically cut the costs of
gathering information, but requires care. We need
to ensure that the sample is representative of
the population of interest
But how can any small sample be completely
representative?

12
Why Sample (cont.)

IT IS IMPORTANT TO REALIZE THAT SOME INFORMATION
IS LOST IF WE ONLY EXAMINE A SAMPLE OF THE ENTIRE
POPULATION
Why not just use the sample mean in place of µ?
For example, suppose that the average income of
100 randomly selected Stern students was 62,154
Can we conclude that the average income of ALL
Stern students (µ) is 62,154?
Can we conclude that µ gt 60,000?
Fortunately, we can use probability theory to
understand how the process of taking a random
sample will blur the information in a population
But first, we need to understand why and how the
information is blurred

13
Sampling Variability

Although the average income of all Stern Langone
students is a fixed number, the average of a
sample of 100 students depends on precisely which
sample is taken. In other words, the sample mean
is subject to sampling variability
The problem is that by reporting sample mean
alone, we dont take account of the variability
caused by the sampling procedure. If we had
polled different students, we might have gotten a
different average income
It would be a serious mistake to ignore this
sampling variability, and simply assume that the
mean income of all students is the same as the
average of the 100 incomes given in the sample

14
Populations and Samples

You are considering opening an Atomic Wings in
Bethlehem, PA
POPULATION All residents
SAMPLE
Every 35th person at the mall
Every 2,000th person in the phone book
Every person who leaves Burger King
Dont forget to include the college students!!!

15
Choosing a Representative Sample

REPRESENTATIVE Each characteristic occurs in
the same percentage of the time in the sample as
in the population
BIAS Not representative
Bias will exist if there is a systematic tendency
to over/under represent some part of the
population
By deliberately not sampling based on any
specific characteristic, a randomly selected
sample will typically be free from bias
Randomly selecting subjects lets you make
probability statements about the results

16
Examples of Bias

Selection Bias
A telephone survey of households conducted
entirely between 9 a.m. to 5 p.m.
Using a customer complaint database to query on
the new discount program
Nonresponse Bias Sample member refuses to
participate
Every market research program
Operational Definitions Guiding a response
Do you agree that taxes are too high in New York

17
Simple Random Sampling

Process where each possible sample of a given
size has the same probability of being selected
Example IBM reported sales of 64.792 Billion
and a net loss of 2.827 Billion for 1991.
The number of individual transactions was
enormous
The auditors used statistics because to choose a
representative sample of transactions to check in
detail

18
Choosing a Random Sample

Number every member in the population 1N
Use a random process to select the sample
R, flipping a coin, random number tablewhatever
is appropriate
In this class we will use the computer

19
Sampling Statistics and Distributions

Once a sample is drawn, we summarize it with
sample statistics
The value of any summary statistic will vary from
sample to sample (a big problemno?)
A sample statistic is itself a random variable
Hence, it has a theoretical probability
distribution called the sampling distribution
We can find the mean and standard deviation of
many random samples

20
Definition
21
Example

Suppose the long-run average of the number of
Medicare claims submitted per week to a regional
office is 62,000, and the standard deviation is
7,000.
If we assume that the weekly claims submissions
during a 4-week period constitute a random sample
of size 4, what are the expected value and
standard error of the average weekly number of
claims over a 4-week period?
NOTE Standard error denotes the theoretically
derived standard deviation of the sampling
distribution of a statistic.

22
Standard Error

Standard Deviation of the statistic
Is interpreted just as you would any standard
deviation
Indicates approximately how far the observed
value of the statistic is from its mean
Literally it indicated the standard deviation
you would find if you took a very large number of
samples, found the sample average for each one,
and worked with these sample averages as a data
set

23
Example

Suppose n200 randomly selected shoppers
interviewed in a mall say they plan to spend on
an average of 19.42 today with a standard
deviation of 8.63
This tells you what shoppers typically plan to
spend, and that a typical, individual shopper
plans to spend about 8.63 more or less than this
amount
So far, this is no more that a description of the
individuals interviewed
We can say something about the unknown population
mean, which is the mean amount that all shoppers
in the mall today plan to spend, including those
not interviewed.
What is the standard error of the mean?
This tells us the variability when we use the
sample average of 19.42, as an estimate of the
unknown population mean

24
Sampling Distributions for Means and Sums

If a population distribution is Normal, then the
sampling distribution of sample means is also
Normal
Example A timber company is planning to harvest
400 trees from a very large stand.
Yield is determined by its diameter
Distribution of diameters is normal with mean 44
inches and standard deviation of 4 inches
Find the probability that the average diameter of
the harvest trees is between 43.5 and 44.5
inches.

25
Example

Its OK if each beer isnt exactly 12 oz so long
as the average volume isnt too low or too high.
In your production facility, you know that the
volume of each beer follows a Normal
distribution, has a standard deviation of 0.5
ounces, representing variability about their mean
of 12.01 oz.
Any case (24 beers) that has an average weight
per beer less than 11.75 ounces will be rejected.
What fraction of cases will be rejected this way?
First find the mean and standard deviation of the
average of n24 beers

26
Central Limit Theorem

For any population, the sampling distribution of
the sample mean is approximately normal if the
sample size is sufficiently large

27
Simulation Example

Use R to draw 1000 samples each, with sample
sizes 4, 10, 30, and 60 from a highly
right-skewed distribution having mean and
standard deviation both equal to 1.
Display a histogram of the sample means
datanumeric(0)
for (i in 11000) datai mean( rexp(4) )
hist(data)
What type of process might follow this
distribution???

28
Example of Use

An agency of the Commerce Department in a certain
state wishes to check the accuracy of weights in
supermarkets
They decide to weigh 9 packages of ground meat
labeled as 1 pound packages
They will investigate any supermarket where the
average weight of the packages is less than 15.5
oz
Assuming that the standard deviation of package
weights is 0.6 oz, what is the probability they
will investigate an honest market?

29
Normal Probability Plot

Plots actual versus expected values, assuming a
normal distribution
Nearly normal data will plot as a near straight
line
Right-skewed data plot as a curve, with the slope
getting steeper as one moves to the right
Left-skewed data plot as a curve, with the slope
getting flatter as one moves to the right
Symmetric but outlier-prone data plot as an
S-shape, with the slope steepest at both sides

30
R Examples

data rnorm(1000) do not worry about the r
commands
hist(data)
qqnorm(data)
qqline(data)
data rexp(1000)
hist(data)
qqnorm(data)
qqline(data)
data 1-rlnorm(1000)30
hist(data)
qqnorm(data)
qqline(data)
data rnorm(1000) data15 data27
hist(data)
qqnorm(data)
qqline(data)

31
Point and Interval Estimation

Chapter 7

32
Review

Basic problem of statistical theory is how to
infer a population or process value given only
sample data
Any sample statistic will vary from sample to
sample
Any sample statistic will differ from the true,
population value
Must consider random error in sample statistic
estimation

33
Chapter Goals

Summarize sample data
Choosing an estimator
Unbiased estimator
Constructing confidence intervals for means with
known standard deviation
Constructing confidence intervals for
proportions
Determining how large a sample is needed
Constructing confidence intervals when standard
deviation is not known
Understanding key underlying assumptions
underlying confidence interval methods

34
Reminder Statistical Inference

Problem of Inferential Statistics
Make inferences about one or more population
parameters based on observable sample data
Forms of Inference
Point estimation single best guess regarding a
population parameter
Interval estimation Specifies a reasonable
range for the value of the parameter
Hypothesis testing Isolating a particular
possible value for the parameter and testing if
this value is plausible given the available data

35
Point Estimators

Computing a single statistic from the sample data
to estimate a population parameter
Choosing a point estimator
What is the shape of the distribution?
Do you suspect outliers exist?
Plausible choices
Mean
Median
Mode
Trimmed Mean

36
Technical Definitions
37
Example

I used R to draw 1,000 samples, each of size 30,
from a normally distributed population having
mean 50 and standard deviation 10.
For each sample the mean and median are
computed.
data.mean numeric(0)
data.median numeric(0)
for(i in 11000)
data rnorm(30, mean50, sd10)
data.meani mean(data)
data.mediani median(data)
Do these statistics appear unbiased?
Which is more efficient?

38
Expressing Uncertainty
39
Confidence Interval

An interval with random endpoints which contains
the parameter of interest (in this case, µ) with
a pre-specified probability, denoted by 1 - a.
The confidence interval automatically provides a
margin of error to account for the sampling
variability of the sample statistic.
Example A machine is supposed to fill 12 ounce
bottles of Guinness. To see if the machine is
working properly, we randomly select 100 bottles
recently filled by the machine, and find that the
average amount of Guinness is 11.95 ounces. Can
we conclude that the machine is not working
properly?

No! By simply reporting the sample mean, we are
neglecting the fact that the amount of beer
varies from bottle to bottle and that the value
of the sample mean depends on the luck of the
draw
It is possible that a value as low as 11.75 is
within the range of natural variability for the
sample mean, even if the average amount for all
bottles is in fact µ 12 ounces.
Suppose we know from past experience that the
amounts of beer in bottles filled by the machine
have a standard deviation of s 0.05 ounces.
Since n 100, we can assume (using the Central
Limit Theorem) that the sample mean is normally
distributed with mean µ (unknown) and standard
error 0.005
What does the Empirical Rule tell us about the
average volume of the sample mean?

41
Why does it work?
42
Using the Empirical Rule Assuming Normality
43
Confidence Intervals

Statistics is never having to say you're
certain.
(Tee shirt, American Statistical Association).
Any sample statistic will vary from sample to
sample
Point estimates are almost inevitably in error to
some degree
Thus, we need to specify a probable range or
interval estimate for the parameter

44
Confidence Interval
45
Example

An airline needs an estimate of the average
number of passengers on a newly scheduled flight
Its experience is that data for the first month
of flights are unreliable, but thereafter the
passenger load settles down
The mean passenger load is calculated for the
first 20 weekdays of the second month after
initiation of this particular flight
If the sample mean is 112 and the population
standard deviation is assumed to be 25, find a
90 confidence interval for the true, long-run
average number of passengers on this flight

46
Interpretation

The significance level of the confidence interval
refers to the process of constructing confidence
intervals
Each particular confidence interval either does
or does not include the true value of the
parameter being estimated
We cant say that this particular estimate is
correct to within the error
So, we say that we have a XX confidence that the
population parameter is contained in the interval
Orthe interval is the result of a process that
in the long run has a XX probability of being
correct

47
Imagine Many Samples
48
Getting Realistic

The population standard deviation is rarely known
Usually both the mean and standard deviation must
be estimated from the sample
Estimate ? with s
Howeverwith this added source of random errors,
we need to handle this problem using the
t-distribution (later on)

49
Confidence Intervals for Proportions

We can also construct confidence intervals for
proportions of successes
Recall that the expected value and standard error
for the number of successes in a sample are
How can we construct a confidence interval for a
proportion?

50
Example

Suppose that in a sample of 2,200 households with
one or more television sets, 471 watch a
particular networks show at a given time.
Find a 95 confidence interval for the population
proportion of households watching this show.

51
Example

The 1992 presidential election looked like a very
close three-way race at the time when news polls
reported that of 1,105 registered voters
surveyed
Perot 33
Bush 31
Clinton 28
Construct a 95 confidence interval for Perot?
What is the margin of error?
What happened here?

52
Example

A survey conducted found that out of 800 people,
46 thought that Clintons first approved budget
represented a major change in the direction of
the country.
Another 45 thought it did not represent a major
change.
Compute a 95 confidence interval for the percent
of people who had a positive response.
What is the margin of error?
Interpret

53
Choosing a Sample Size

Gathering information for a statistical study can
be expensive, time consuming, etc.
Sothe question of how much information to gather
is very important
When considering a confidence interval for a
population mean ?, there are three quantities to
consider

54
Choosing a Sample Size (cont)

Tolerability Width The margin of acceptable
error
?3
? 10,000
Derive the required sample size using
Margin of error (tolerability width)
Level of Significance (z-value)
Standard deviation (given, assumed, or
calculated)

55
Example

Union officials are concerned about reports of
inferior wages being paid to employees of a
company under its jurisdiction
How large a sample is needs to obtain a 90
confidence interval for the population mean
hourly wage ? with width equal to 1.00? Assume
that ?4.

56
Example

A direct-mail company must determine its credit
policies very carefully.
The firm suspects that advertisements in a
certain magazine have led to an excessively high
rate of write-offs.
The firm wants to establish a 90 confidence
interval for this magazines write-off proportion
that is accurate to ? 2.0
How many accounts must be sampled to guarantee
this goal?
If this many accounts are sampled and 10 of the
sampled accounts are determined to be write-offs,
what is the resulting 90 confidence interval?
What kind of difference do we see by using an
observed proportion over a conservative guess?