Statistics

1 / 57
About This Presentation
Title:

Statistics

Description:

They decide to weigh 9 packages of ground meat labeled as 1 pound packages ... So...the question of how much information to gather is very important ... – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 58
Provided by: sandyb2

less

Transcript and Presenter's Notes

Title: Statistics


1
Statistics Data Analysis
  • Course Number B01.1305
  • Course Section 31
  • Meeting Time Wednesday 6-850 pm

CLASS 5
2
Class 5 Outline
  • Understand random sampling and systematic bias
  • Derive theoretical distribution of summary
    statistics
  • Understand the Central Limit Theorem
  • Use a normal probability plot to assess normality

3
Review of Last Class
  • Special Distributions
  • Counting problems
  • Binomial distribution problems
  • Normal distribution problems

4
CHAPTER 6
  • Random Sampling and Sampling Distributions

5
Chapter Goals
  • Explain why in many situations a sample is the
    only way learn something about a population
  • Explain the various methods of selecting a
    sample
  • Define and construct sampling distribution of
    sample means
  • Understand sources of bias or under-representation
    in data

6
A Scenario
  • Its 900 AM on Wednesday and your boss sent you
    and email asking how your firms customers would
    react to a new price discounting program
  • Your report is due tomorrow
  • It takes 10 minutes to interview a single
    customer in your database of almost 2,000
  • What will you do????
  • Draw a sample of the customers
  • How will you draw the sample?
  • Need a representative sample
  • Does your database hold a representative sample???

7
Background
  • Some previous chapters emphasized methods for
    describing data
  • Created frequency distributions, computed
    averages and measures of dispersion
  • Started to lay foundation for inference by
    studying probability
  • Counting, Binomial, and Normal Distributions
  • Probability distributions encompass all possible
    outcomes of an experiment and the probability
    associated with each outcome
  • So far, weve learned how to describe something
    that has already occurred or evaluate something
    that might occur

8
How are these similar
  • QC department needs to check the tensile strength
    of steel wire
  • Five small pieces are selected every 5 hours
  • Tensile strength of each piece is determined
  • Marketing needs to determine the sales potential
    of a new drug named HappyPill.
  • 452 consumers were asked to try it for a week
  • Each consumer completed a questionnaire
  • Polling agency selections 2,000 voters at random
    and asked their approval rating of the President
  • In the study of insider trading, 25 CEOs were
    identified by the SEC and their trades were
    monitored for three years

9
Why Sample???
  • Destructive nature of some tests
  • Physical Impossibility of checking all items
  • Cost of studying all items
  • Adequacy of sample results
  • Contacting whole population would be too
    time-consuming

10
Types of Samples
  • Cross-sectional samples are taken from an
    underlying population at a particular time
  • Time-series samples are taken over time from a
    random process
  • Enumerative Studies sampling from a
    well-defined population
  • Analytic Studies look at the results of a
    random process to predict future behavior

11
Why Sample???
  • We often need to know something about a large
    population.
  • What is the average income of all Stern
    students?
  • Its often too expensive and time-consuming to
    examine the entire population
  • Solution Choose a small random sample and use
    the methods of statistical inference to draw
    conclusions about the population
  • Sampling lets us dramatically cut the costs of
    gathering information, but requires care. We need
    to ensure that the sample is representative of
    the population of interest
  • But how can any small sample be completely
    representative?

12
Why Sample (cont.)
  • IT IS IMPORTANT TO REALIZE THAT SOME INFORMATION
    IS LOST IF WE ONLY EXAMINE A SAMPLE OF THE ENTIRE
    POPULATION
  • Why not just use the sample mean in place of µ?
  • For example, suppose that the average income of
    100 randomly selected Stern students was 62,154
  • Can we conclude that the average income of ALL
    Stern students (µ) is 62,154?
  • Can we conclude that µ gt 60,000?
  • Fortunately, we can use probability theory to
    understand how the process of taking a random
    sample will blur the information in a population
  • But first, we need to understand why and how the
    information is blurred

13
Sampling Variability
  • Although the average income of all Stern Langone
    students is a fixed number, the average of a
    sample of 100 students depends on precisely which
    sample is taken. In other words, the sample mean
    is subject to sampling variability
  • The problem is that by reporting sample mean
    alone, we dont take account of the variability
    caused by the sampling procedure. If we had
    polled different students, we might have gotten a
    different average income
  • It would be a serious mistake to ignore this
    sampling variability, and simply assume that the
    mean income of all students is the same as the
    average of the 100 incomes given in the sample

14
Populations and Samples
  • You are considering opening an Atomic Wings in
    Bethlehem, PA
  • POPULATION All residents
  • SAMPLE
  • Every 35th person at the mall
  • Every 2,000th person in the phone book
  • Every person who leaves Burger King
  • Dont forget to include the college students!!!

15
Choosing a Representative Sample
  • REPRESENTATIVE Each characteristic occurs in
    the same percentage of the time in the sample as
    in the population
  • BIAS Not representative
  • Bias will exist if there is a systematic tendency
    to over/under represent some part of the
    population
  • By deliberately not sampling based on any
    specific characteristic, a randomly selected
    sample will typically be free from bias
  • Randomly selecting subjects lets you make
    probability statements about the results

16
Examples of Bias
  • Selection Bias
  • A telephone survey of households conducted
    entirely between 9 a.m. to 5 p.m.
  • Using a customer complaint database to query on
    the new discount program
  • Nonresponse Bias Sample member refuses to
    participate
  • Every market research program
  • Operational Definitions Guiding a response
  • Do you agree that taxes are too high in New York

17
Simple Random Sampling
  • Process where each possible sample of a given
    size has the same probability of being selected
  • Example IBM reported sales of 64.792 Billion
    and a net loss of 2.827 Billion for 1991.
  • The number of individual transactions was
    enormous
  • The auditors used statistics because to choose a
    representative sample of transactions to check in
    detail

18
Choosing a Random Sample
  • Number every member in the population 1N
  • Use a random process to select the sample
  • R, flipping a coin, random number tablewhatever
    is appropriate
  • In this class we will use the computer

19
Sampling Statistics and Distributions
  • Once a sample is drawn, we summarize it with
    sample statistics
  • The value of any summary statistic will vary from
    sample to sample (a big problemno?)
  • A sample statistic is itself a random variable
  • Hence, it has a theoretical probability
    distribution called the sampling distribution
  • We can find the mean and standard deviation of
    many random samples

20
Definition
21
Example
  • Suppose the long-run average of the number of
    Medicare claims submitted per week to a regional
    office is 62,000, and the standard deviation is
    7,000.
  • If we assume that the weekly claims submissions
    during a 4-week period constitute a random sample
    of size 4, what are the expected value and
    standard error of the average weekly number of
    claims over a 4-week period?
  • NOTE Standard error denotes the theoretically
    derived standard deviation of the sampling
    distribution of a statistic.

22
Standard Error
  • Standard Deviation of the statistic
  • Is interpreted just as you would any standard
    deviation
  • Indicates approximately how far the observed
    value of the statistic is from its mean
  • Literally it indicated the standard deviation
    you would find if you took a very large number of
    samples, found the sample average for each one,
    and worked with these sample averages as a data
    set

23
Example
  • Suppose n200 randomly selected shoppers
    interviewed in a mall say they plan to spend on
    an average of 19.42 today with a standard
    deviation of 8.63
  • This tells you what shoppers typically plan to
    spend, and that a typical, individual shopper
    plans to spend about 8.63 more or less than this
    amount
  • So far, this is no more that a description of the
    individuals interviewed
  • We can say something about the unknown population
    mean, which is the mean amount that all shoppers
    in the mall today plan to spend, including those
    not interviewed.
  • What is the standard error of the mean?
  • This tells us the variability when we use the
    sample average of 19.42, as an estimate of the
    unknown population mean

24
Sampling Distributions for Means and Sums
  • If a population distribution is Normal, then the
    sampling distribution of sample means is also
    Normal
  • Example A timber company is planning to harvest
    400 trees from a very large stand.
  • Yield is determined by its diameter
  • Distribution of diameters is normal with mean 44
    inches and standard deviation of 4 inches
  • Find the probability that the average diameter of
    the harvest trees is between 43.5 and 44.5
    inches.

25
Example
  • Its OK if each beer isnt exactly 12 oz so long
    as the average volume isnt too low or too high.
  • In your production facility, you know that the
    volume of each beer follows a Normal
    distribution, has a standard deviation of 0.5
    ounces, representing variability about their mean
    of 12.01 oz.
  • Any case (24 beers) that has an average weight
    per beer less than 11.75 ounces will be rejected.
  • What fraction of cases will be rejected this way?
  • First find the mean and standard deviation of the
    average of n24 beers

26
Central Limit Theorem
  • For any population, the sampling distribution of
    the sample mean is approximately normal if the
    sample size is sufficiently large

27
Simulation Example
  • Use R to draw 1000 samples each, with sample
    sizes 4, 10, 30, and 60 from a highly
    right-skewed distribution having mean and
    standard deviation both equal to 1.
  • Display a histogram of the sample means
  • datanumeric(0)
  • for (i in 11000) datai mean( rexp(4) )
  • hist(data)
  • What type of process might follow this
    distribution???

28
Example of Use
  • An agency of the Commerce Department in a certain
    state wishes to check the accuracy of weights in
    supermarkets
  • They decide to weigh 9 packages of ground meat
    labeled as 1 pound packages
  • They will investigate any supermarket where the
    average weight of the packages is less than 15.5
    oz
  • Assuming that the standard deviation of package
    weights is 0.6 oz, what is the probability they
    will investigate an honest market?

29
Normal Probability Plot
  • Plots actual versus expected values, assuming a
    normal distribution
  • Nearly normal data will plot as a near straight
    line
  • Right-skewed data plot as a curve, with the slope
    getting steeper as one moves to the right
  • Left-skewed data plot as a curve, with the slope
    getting flatter as one moves to the right
  • Symmetric but outlier-prone data plot as an
    S-shape, with the slope steepest at both sides

30
R Examples
  • data rnorm(1000) do not worry about the r
    commands
  • hist(data)
  • qqnorm(data)
  • qqline(data)
  • data rexp(1000)
  • hist(data)
  • qqnorm(data)
  • qqline(data)
  • data 1-rlnorm(1000)30
  • hist(data)
  • qqnorm(data)
  • qqline(data)
  • data rnorm(1000) data15 data27
  • hist(data)
  • qqnorm(data)
  • qqline(data)

31
Point and Interval Estimation
  • Chapter 7

32
Review
  • Basic problem of statistical theory is how to
    infer a population or process value given only
    sample data
  • Any sample statistic will vary from sample to
    sample
  • Any sample statistic will differ from the true,
    population value
  • Must consider random error in sample statistic
    estimation

33
Chapter Goals
  • Summarize sample data
  • Choosing an estimator
  • Unbiased estimator
  • Constructing confidence intervals for means with
    known standard deviation
  • Constructing confidence intervals for
    proportions
  • Determining how large a sample is needed
  • Constructing confidence intervals when standard
    deviation is not known
  • Understanding key underlying assumptions
    underlying confidence interval methods

34
Reminder Statistical Inference
  • Problem of Inferential Statistics
  • Make inferences about one or more population
    parameters based on observable sample data
  • Forms of Inference
  • Point estimation single best guess regarding a
    population parameter
  • Interval estimation Specifies a reasonable
    range for the value of the parameter
  • Hypothesis testing Isolating a particular
    possible value for the parameter and testing if
    this value is plausible given the available data

35
Point Estimators
  • Computing a single statistic from the sample data
    to estimate a population parameter
  • Choosing a point estimator
  • What is the shape of the distribution?
  • Do you suspect outliers exist?
  • Plausible choices
  • Mean
  • Median
  • Mode
  • Trimmed Mean

36
Technical Definitions
37
Example
  • I used R to draw 1,000 samples, each of size 30,
    from a normally distributed population having
    mean 50 and standard deviation 10.
  • For each sample the mean and median are
    computed.
  • data.mean numeric(0)
  • data.median numeric(0)
  • for(i in 11000)
  • data rnorm(30, mean50, sd10)
  • data.meani mean(data)
  • data.mediani median(data)
  • Do these statistics appear unbiased?
  • Which is more efficient?

38
Expressing Uncertainty
39
Confidence Interval
  • An interval with random endpoints which contains
    the parameter of interest (in this case, µ) with
    a pre-specified probability, denoted by 1 - a.
  • The confidence interval automatically provides a
    margin of error to account for the sampling
    variability of the sample statistic.
  • Example A machine is supposed to fill 12 ounce
    bottles of Guinness. To see if the machine is
    working properly, we randomly select 100 bottles
    recently filled by the machine, and find that the
    average amount of Guinness is 11.95 ounces. Can
    we conclude that the machine is not working
    properly?

40
  • No! By simply reporting the sample mean, we are
    neglecting the fact that the amount of beer
    varies from bottle to bottle and that the value
    of the sample mean depends on the luck of the
    draw
  • It is possible that a value as low as 11.75 is
    within the range of natural variability for the
    sample mean, even if the average amount for all
    bottles is in fact µ 12 ounces.
  • Suppose we know from past experience that the
    amounts of beer in bottles filled by the machine
    have a standard deviation of s 0.05 ounces.
  • Since n 100, we can assume (using the Central
    Limit Theorem) that the sample mean is normally
    distributed with mean µ (unknown) and standard
    error 0.005
  • What does the Empirical Rule tell us about the
    average volume of the sample mean?

41
Why does it work?
42
Using the Empirical Rule Assuming Normality
43
Confidence Intervals
  • Statistics is never having to say you're
    certain.
  • (Tee shirt, American Statistical Association).
  • Any sample statistic will vary from sample to
    sample
  • Point estimates are almost inevitably in error to
    some degree
  • Thus, we need to specify a probable range or
    interval estimate for the parameter

44
Confidence Interval
45
Example
  • An airline needs an estimate of the average
    number of passengers on a newly scheduled flight
  • Its experience is that data for the first month
    of flights are unreliable, but thereafter the
    passenger load settles down
  • The mean passenger load is calculated for the
    first 20 weekdays of the second month after
    initiation of this particular flight
  • If the sample mean is 112 and the population
    standard deviation is assumed to be 25, find a
    90 confidence interval for the true, long-run
    average number of passengers on this flight

46
Interpretation
  • The significance level of the confidence interval
    refers to the process of constructing confidence
    intervals
  • Each particular confidence interval either does
    or does not include the true value of the
    parameter being estimated
  • We cant say that this particular estimate is
    correct to within the error
  • So, we say that we have a XX confidence that the
    population parameter is contained in the interval
  • Orthe interval is the result of a process that
    in the long run has a XX probability of being
    correct

47
Imagine Many Samples
48
Getting Realistic
  • The population standard deviation is rarely known
  • Usually both the mean and standard deviation must
    be estimated from the sample
  • Estimate ? with s
  • Howeverwith this added source of random errors,
    we need to handle this problem using the
    t-distribution (later on)

49
Confidence Intervals for Proportions
  • We can also construct confidence intervals for
    proportions of successes
  • Recall that the expected value and standard error
    for the number of successes in a sample are
  • How can we construct a confidence interval for a
    proportion?

50
Example
  • Suppose that in a sample of 2,200 households with
    one or more television sets, 471 watch a
    particular networks show at a given time.
  • Find a 95 confidence interval for the population
    proportion of households watching this show.

51
Example
  • The 1992 presidential election looked like a very
    close three-way race at the time when news polls
    reported that of 1,105 registered voters
    surveyed
  • Perot 33
  • Bush 31
  • Clinton 28
  • Construct a 95 confidence interval for Perot?
  • What is the margin of error?
  • What happened here?

52
Example
  • A survey conducted found that out of 800 people,
    46 thought that Clintons first approved budget
    represented a major change in the direction of
    the country.
  • Another 45 thought it did not represent a major
    change.
  • Compute a 95 confidence interval for the percent
    of people who had a positive response.
  • What is the margin of error?
  • Interpret

53
Choosing a Sample Size
  • Gathering information for a statistical study can
    be expensive, time consuming, etc.
  • Sothe question of how much information to gather
    is very important
  • When considering a confidence interval for a
    population mean ?, there are three quantities to
    consider

54
Choosing a Sample Size (cont)
  • Tolerability Width The margin of acceptable
    error
  • ?3
  • ? 10,000
  • Derive the required sample size using
  • Margin of error (tolerability width)
  • Level of Significance (z-value)
  • Standard deviation (given, assumed, or
    calculated)

55
Example
  • Union officials are concerned about reports of
    inferior wages being paid to employees of a
    company under its jurisdiction
  • How large a sample is needs to obtain a 90
    confidence interval for the population mean
    hourly wage ? with width equal to 1.00? Assume
    that ?4.

56
Example
  • A direct-mail company must determine its credit
    policies very carefully.
  • The firm suspects that advertisements in a
    certain magazine have led to an excessively high
    rate of write-offs.
  • The firm wants to establish a 90 confidence
    interval for this magazines write-off proportion
    that is accurate to ? 2.0
  • How many accounts must be sampled to guarantee
    this goal?
  • If this many accounts are sampled and 10 of the
    sampled accounts are determined to be write-offs,
    what is the resulting 90 confidence interval?
  • What kind of difference do we see by using an
    observed proportion over a conservative guess?

57
Homework 5
  • Hildebrand/Ott
  • 6.4
  • 6.5
  • 6.8
  • 6.16
  • 6.17
  • 6.46
  • In (a) create a normal probability plot also and
    interpret
  • 7.1
  • 7.2
  • 7.14
  • 7.17
  • 7.18
  • 7.20
  • 7.21
  • 7.30
  • Read Chapter 11
  • Verzani
Write a Comment
User Comments (0)