Review of Basic Statistical Concepts - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Review of Basic Statistical Concepts

Description:

Title: The Forecast Process, Data Considerations, and Model Selection Author: WIU Last modified by: WIU Created Date: 9/2/2003 3:43:05 PM Document presentation format – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 70
Provided by: wiu
Learn more at: http://faculty.wiu.edu
Category:

less

Transcript and Presenter's Notes

Title: Review of Basic Statistical Concepts


1
Review of Basic Statistical Concepts
  • Farideh Dehkordi-Vakil

2
Inferential Statistics
  • Introduction to Inference
  • The purpose of inference is to draw conclusions
    from data.
  • Conclusions take into account the natural
    variability in the data, therefore formal
    inference relies on probability to describe
    chance variation.
  • We will go over the two most prominent types of
    formal statistical inference
  • Confidence Intervals for estimating the value of
    a population parameter.
  • Tests of significance which asses the evidence
    for a claim.
  • Both types of inference are based on the sampling
    distribution of statistics.

3
Inferential Statistics
  • Since both methods of formal inference are based
    on sampling distributions, they require
    probability model for the data.
  • The model is most secure and inference is most
    reliable when the data are produced by a properly
    randomized design.
  • When we use statistical inference we assume that
    the data come from a randomly selected sample or
    a randomized experiment.

4
Inferential Statistics
  • A market research firm interviews a random sample
    of 2500 adults. Results 66 find shopping for
    cloths frustrating and time consuming.
  • That is the truth about the 2500 people in the
    sample.
  • What is the truth about almost 210 million
    American adults who make up the population?
  • Since the sample was chosen at random, it is
    reasonable to think that these 2500 people
    represent the entire population pretty well.

5
Inferential Statistics
  • Therefore, the market researchers turn the fact
    that 66 of sample find shopping frustrating into
    an estimate that about 66 of all adults feel
    this way.
  • Using a fact about a sample to estimate the truth
    about the whole population is called statistical
    inference.
  • To think about inference, we must keep straight
    whether a number describes a sample or a
    population.

6
Inferential Statistics
  • Parameters and Statistics
  • A parameter is a number that describes the
    population.
  • A parameter is a fixed number, but in practice we
    do not know its value.
  • A statistic is a number that describes a sample.
  • The value of a statistic is known when we have
    taken a sample, but it can change from sample to
    sample.
  • We often use statistic to estimate an unknown
    parameter.

7
Inferential Statistics
  • Changing consumer attitudes towards shopping are
    of great interest to retailers and makers of
    consumer goods.
  • One trend of concern to marketers is that fewer
    people enjoy shopping than in the past.
  • A market research firm conducts an annual survey
    of consumer attitudes.
  • The population is all Us residents aged 18 and
    over.

8
ExampleConsumer attitude towards shopping
  • A recent survey asked a nationwide random sample
    of 2500 adults if they agreed or disagreed that
    I like buying new cloths, but shopping is often
    frustrating and time consuming.
  • Of the respondents, 1650 said they agreed.
  • The proportion of the sample who agreed that
    cloths shopping is often frustrating is

9
ExampleConsumer attitude towards shopping
  • The number .66 is a statistic.
  • The corresponding parameter is the proportion
    (call it P) of all adult U.S. residents who would
    have said agree if asked the same question.
  • We dont know the value of parameter P, so we use
    as its estimate.

10
Inferential Statistics
  • If the marketing firm took a second random sample
    of 2500 adults, the new sample would have
    different people in it.
  • It is almost certain that there would not be
    exactly 1650 positive responses.
  • That is, the value of will vary from sample
    to sample.
  • Random samples eliminate bias from the act of
    choosing a sample, but they can still be wrong
    because of the variability that results when we
    choose at random.

11
Inferential Statistics
  • The first advantage of choosing at random is that
    it eliminates bias.
  • The second advantage is that if we take lots of
    random samples of the same size from the same
    population, the variation from sample to sample
    will follow a predictable pattern.
  • All statistical inference is based on one idea
    to see how trustworthy a procedure is, ask what
    would happen if we repeated it many times.

12
Inferential Statistics
  • Suppose that exactly 60 of adults find shopping
    for cloths frustrating and time consuming.
  • That is, the truth about the population is that
    P 0.6.
  • What if we select an SRS of size 100 from this
    population and use the sample proportion to
    estimate the unknown value of the population
    proportion P?

13
Inferential Statistics
  • To answer this question
  • Take a large number of samples of size 100 from
    this population.
  • Calculate the sample proportion for each
    sample.
  • Make a histogram of the values of .
  • Examine the distribution displayed in the
    histogram for shape, center, and spread, ass well
    as outliers or other deviations.

14
Inferential Statistics
  • The result of many SRS have a regular pattern.
  • Here we draw 1000 SRS of size 100 from the same
    population.
  • The histogram shows the distribution of the 1000
    sample proportions

15
Inferential Statistics
  • Sampling Distribution
  • The sampling distribution of a statistic is the
    distribution of values taken by the statistic in
    all possible samples of the same size from the
    same population.

16
ExampleMean income of American households
  • What is the mean income of households in the
    United States?
  • The Bureau of Labor Statistics contacted a random
    sample of 55,000 households in March 2001 for the
    current population survey.
  • The mean income of the 55,000 households for the
    year 2000 was
  • 57,045 is a statistic that describes the CPS
    sample households.

17
ExampleMean income of American households
  • We use it to estimate an unknown parameter, the
    mean income of all 106 million American
    households.
  • We know that would take several different
    values if the Bureau of Labor Statistics had
    taken several samples in March 2001.
  • We also know that this sampling variability
    follows a regular pattern that can tell us how
    accurate the sample result is likely to be.
  • That pattern obeys the laws of probability.

18
Normal Density Curve
  • These density curves, called normal curves, are
  • Symmetric
  • Single peaked
  • Bell shaped
  • Normal curves describe normal distributions.

19
Normal Density Curve
  • The exact density curve for a particular normal
    distribution is described by giving its mean ?
    and its standard deviation ?.
  • The mean is located at the center of the
    symmetric curve and it is the same as the median.
  • The standard deviation ? controls the spread of a
    normal curve.

20
Normal Density Curve
21
The 68-95-99.7 Rule
  • Although there are many normal curve, They all
    have common properties. In particular, all Normal
    distributions obey the following rule.
  • In a normal distribution with mean ? and standard
    deviation ?
  • 68 of the observations fall within ? of the mean
    ?.
  • 95 of the observations fall within 2? of ?.
  • 99.7 of the observations fall within 3? of ?.

22
The 68-95-99.7 Rule
23
The 68-95-99.7 Rule
24
Inferential Statistics
  • Standardizing and z-score
  • If x is an observation that has mean ? and
    standard deviation ?, the standardized value of x
    is
  • A standardized value is often called z-score.

25
Standard Normal Distribution
  • The standard Normal distribution is the Normal
    distribution N(0, 1) with mean
  • ? 0 and standard deviation ? 1.

26
Standard Normal Distribution
  • If a variable x has any normal distribution N(?,
    ?) with mean ? and standard deviation ?, then the
    standardized variable
  • has the standard Normal distribution.

27
The Standard Normal Table
  • Table A is a table area under the standard Normal
    curve. The table entry for each value z is the
    area under the curve to the left of z.

28
The Standard Normal Table
  • What the area under the standard normal curve to
    the right of
  • z - 2.15?
  • Compact notation
  • z lt -2.15
  • P 1 - .0158 .9842

29
The Standard Normal Table
  • What is the area under the standard normal curve
    between z 0 and z 2.3?
  • Compact notation
  • 0 lt z lt 2.3
  • P .9893 - .5 .4893

30
ExampleAnnual rate of return on stock indexes
  • The annual rate of return on stock indexes (which
    combine many individual stocks) is approximately
    Normal. Since 1954, the SP 500 stock index has
    had a mean yearly return of about 12, with
    standard deviation of 16.5. Take this Normal
    distribution to be the distribution of yearly
    returns over a long period. The market is down
    for the year if the return on the index is less
    than zero. In what proportion of years is the
    market down?

31
ExampleAnnual rate of return on stock indexes
  • State the problem
  • Call the annual rate of return for S P
    500-stocks Index x. The variable x has the N(12,
    16.5) distribution. We want the proportion of
    years with X lt 0.
  • Standardize
  • Subtract the mean, then divide by the standard
    deviation, to turn x into a standard Normal z

32
ExampleAnnual rate of return on stock indexes
  • Draw a picture to show the standard normal curve
    with the area of interest shaded.
  • Use the table
  • The proportion of observations less than
  • - 0.73 is .2327.
  • The market is down on an annual basis about
    23.27 of the time.

33
ExampleAnnual rate of return on stock indexes
  • What percent of years have annual return between
    12 and 50?
  • State the problem
  • Standardize

34
ExampleAnnual rate of return on stock indexes
  • Draw a picture.
  • Use table.
  • The area between 0 and 2.30 is the area below
    2.30 minus the area below 0.
  • 0.9893- .50 .4893

35
Estimating with Confidence
  • Community banks are banks with less than a
    billion dollars of assets. There are
    approximately 7500 such banks in the United
    States. In many studies of the industry these
    banks are considered separately from banks that
    have more than a billion dollars of assets. The
    latter banks are called large institutions. The
    community bankers Council of the American bankers
    Association (ABA) conducts an annual survey of
    community banks. For the 110 banks that make up
    the sample in a recent survey, the mean assets
    are 220 (in millions of dollars). What can
    we say about ?, the mean assets of all community
    banks?

36
Estimating with Confidence
  • The sample mean is the natural estimator of
    the unknown population mean ?.
  • We know that
  • is an unbiased estimator of ?.
  • The law of large numbers says that the sample
    mean must approach the population mean as the
    size of the sample grows.
  • Therefore, the value 220 appears to be a
    reasonable estimate of the mean assets ? for all
    community banks.
  • But, how reliable is this estimate?

37
Estimating with Confidence
  • An estimate without an indication of its
    variability is of limited value.
  • Questions about variation of an estimator is
    answered by looking at the spread of its sampling
    distribution.
  • According to Central Limit theorem
  • If the entire population of community bank assets
    has mean ? and standard deviation ?, then in
    repeated samples of size 110 the sample mean
    approximately follows the N(?, ???110)
    distribution

38
Estimating with Confidence
  • Suppose that the true standard deviation ? is
    equal to the sample standard deviation s 161.
  • This is not realistic, although it will give
    reasonably accurate results for samples as large
    as 100. Later on we will learn how to proceed
    when ? is not known.
  • Therefore, by Central Limit theorem. In repeated
    sampling the sample mean is approximately
    normal, centered at the unknown population mean
    ??,with standard deviation

39
Confidence Interval
  • A level C confidence interval for a parameter has
    two parts
  • An interval calculated from the data, usually of
    the form
  • Estimate ? margin of error
  • A confidence Level C, which gives the probability
    that the interval will capture the true parameter
    value in repeated samples.

40
Confidence Interval
  • We use the sampling distribution of the sample
    mean to construct a level C confidence
    interval for the mean ? of a population.
  • We assume that data are a SRS of size n.
  • The sampling distribution is exactly N(
    ) when the population has the N(?, ?)
    distribution.
  • The central Limit theorem says that this same
    sampling distribution is approximately correct
    for large samples whenever the population mean
    and standard deviation are ? and ?.

41
Confidence Interval for a Population Mean
  • Choose a SRS of size n from a population having
    unknown mean ? and known standard deviation ?. A
    level C confidence interval for ? is
  • Here z is the critical value with area C
    between z and z under the standard Normal
    curve. The quantity
  • is the margin of error. The interval is exact
    when the population distribution is normal and is
    approximately correct when n is large in other
    cases.

42
Example Banks loan to-deposit ration
  • The ABA survey of community banks also asked
    about the loan-to-deposit ratio (LTDR), a banks
    total loans as a percent of its total deposits.
    The mean LTDR for the 110 banks in the sample is
  • and the standard deviation is s
    12.3. This sample is sufficiently large for us to
    use s as the population ? here. Find a 95
    confidence interval for the mean LTDR for
    community banks.

43
Tests of Significance
  • Confidence intervals are appropriate when our
    goal is to estimate a population parameter.
  • The second type of inference is directed at
    assessing the evidence provided by the data in
    favor of some claim about the population.
  • A significance test is a formal procedure for
    comparing observed data with a hypothesis whose
    truth we want to assess.
  • The hypothesis is a statement about the
    parameters in a population or model.
  • The results of a test are expressed in terms of a
    probability that measures how well the data and
    the hypothesis agree.

44
Example Banks net income
  • The community bank survey described in previously
    also asked about net income and reported the
    percent change in net income between the first
    half of last year and the first half of this
    year. The mean change for the 110 banks in the
    sample is Because the sample size
    is large, we are willing to use the sample
    standard deviation s 26.4 as if it were the
    population standard deviation ?. The large sample
    size also makes it reasonable to assume that
    is approximately normal.

45
Example Banks net income
  • Is the 8.1 mean increase in a sample good
    evidence that the net income for all banks has
    changed?
  • The sample result might happen just by chance
    even if the true mean change for all banks is ?
    0.
  • To answer this question we asks another
  • Suppose that the truth about the population is
    that ? 0 (this is our hypothesis)
  • What is the probability of observing a sample
    mean at least as far from zero as 8.1?

46
Example Banks net income
  • The answer is
  • Because this probability is so small, we see that
    the sample mean is incompatible with
    a population mean of ? 0.
  • We conclude that the income of community banks
    has changed since last year.

47
Example Banks net income
  • The fact that the calculated probability is very
    small leads us to conclude that the average
    percent change in income is not in fact zero.
    Here is why.
  • If the true mean is ? 0, we would see a sample
    mean as far away as 8.1 only six times per 10000
    samples.
  • So there are only two possibilities
  • ? 0 and we have observed something very
    unusual, or
  • ? is not zero but has some other value that makes
    the observed data more probable

48
Example Banks net income
  • We calculated a probability taking the first of
    these choices as true (? 0 ). That probability
    guides our final choice.
  • If the probability is very small, the data dont
    fit the first possibility and we conclude that
    the mean is not in fact zero.

49
Tests of Significance Formal details
  • The first step in a test of significance is to
    state a claim that we will try to find evidence
    against.
  • Null Hypothesis H0
  • The statement being tested in a test of
    significance is called the null hypothesis.
  • The test of significance is designed to assess
    the strength of the evidence against the null
    hypothesis.
  • Usually the null hypothesis is a statement of no
    effect or no difference. We abbreviate null
    hypothesis as H0.

50
Tests of Significance Formal details
  • A null hypothesis is a statement about a
    population, expressed in terms of some parameter
    or parameters.
  • The null hypothesis in our bank survey example is
  • H0 ? 0
  • It is convenient also to give a name to the
    statement we hope or suspect is true instead of
    H0.
  • This is called the alternative hypothesis and is
    abbreviated as Ha.
  • In our bank survey example the alternative
    hypothesis states that the percent change in net
    income is not zero. We write this as
  • Ha ? ? 0

51
Tests of Significance Formal details
  • Since Ha expresses the effect that we hope to
    find evidence for we often begin with Ha and then
    set up H0 as the statement that the Hoped-for
    effect is not present.
  • Stating Ha is not always straight forward.
  • It is not always clear whether Ha should be
    one-sided or two-sided.
  • The alternative Ha ? ? 0 in the bank net income
    example is two-sided.
  • In any give year, income may increase or
    decrease, so we include both possibilities in the
    alternative hypothesis.

52
Tests of Significance Formal details
  • Test statistics
  • We will learn the form of significance tests in a
    number of common situations. Here are some
    principles that apply to most tests and that help
    in understanding the form of tests
  • The test is based on a statistic that estimate
    the parameter appearing in the hypotheses.
  • Values of the estimate far from the parameter
    value specified by H0 gives evidence against H0.

53
Example banks income
  • The test statistic
  • In our banking example The null hypothesis is
    H0 ? 0, and a sample gave the
    . The test statistic for this problem is the
    standardized version of
  • This statistic is the distance between the sample
    mean and the hypothesized population mean in the
    standard scale of z-scores.

54
Tests of Significance Formal details
  • The test of significance assesses the evidence
    against the null hypothesis and provides a
    numerical summary of this evidence in terms of
    probability.
  • P-value
  • The probability, computed assuming that H0 is
    true, that the test statistic would take a value
    extreme or more extreme than that actually
    observed is called the P-value of the test. The
    smaller the p-value, the stronger the evidence
    against H0 provided by the data.
  • To calculate the P-value, we must use the
    sampling distribution of the test statistic.

55
Example banks income
  • The P-value
  • In our banking example we found that the test
    statistic for testing H0 ? 0 versus Ha ? ?
    0 is
  • If the null hypothesis is true, we expect z to
    take a value not far from 0.
  • Because the alternative is two-sided, values of z
    far from 0 in either direction count ass evidence
    against H0. So the P-value is

56
Example banks income
  • The p-value for banks income.
  • The two-sided p-value is the probability (when H0
    is true) that takes a value at least as far
    from 0 as the actually observed value.

57
Tests of Significance Formal details
  • We know that smaller P-values indicate stronger
    evidence against the null hypothesis.
  • But how strong is strong evidence?
  • One approach is to announce in advance how much
    evidence against H0 we will require to reject H0.
  • We compare the P-value with a level that says
    this evidence is strong enough.
  • The decisive level is called the significance
    level.
  • It is denoted be the Greek letter ?.

58
Tests of Significance Formal details
  • If we choose ? 0.05, we are requiring that the
    data give evidence against H0 so strong that that
    it would happen no more than 5 of the time (1 in
    20) when H0 is true.
  • Statistical significance
  • If the p-value is as small or smaller than ?, we
    say that the data are statistically significant
    at level ?.

59
Tests of Significance Formal details
  • You need not actually find the p-value to asses
    significance at a fixed level ?.
  • You need only to compare the observed statistic z
    with a critical value that marks off area ? in
    one or both tails of the standard Normal curve.

60
Test for a Population Mean
  • There are four steps in carrying out a
    significance test
  • State the hypothesis.
  • Calculate the test statistic.
  • Find the p-value.
  • State your conclusion in the context of your
    specific setting.

61
Test for a Population Mean
  • Once you have stated your hypotheses and
    identified the proper test, you can do steps 2
    and 3 by following a recipe. Here is the recipe
  • We have a SRS of size n drawn from a normal
    population with unknown mean ?. We want to test
    the hypothesis that ? has a specified value. Call
    the specified value ?0. The Null hypothesis is
  • H0 ? ?0

62
Test for a Population Mean
  • The test is based on the sample mean .
    because Normal calculations require standardized
    variable, we will use as our test statistic the
    standardized sample mean
  • This one-sample z statistic has the standard
    Normal distribution when H0 is true.
  • The P-value of the test is the probability that z
    takes a value at least as extreme as the value
    for our sample.
  • What counts as extreme is determined by the
    alternative hypothesis Ha.

63
Example Blood pressures of executives
  • The medical director of a large company is
    concerned about the effects of stress on the
    companys younger executives. According to the
    National Center for health Statistics, the mean
    systolic blood pressure for males 35 to 44 years
    of age is 128 and the standard deviation in this
    population is 15. The medical director examines
    the records of 72 executives in this age group
    and finds that their mean systolic blood pressure
    is . Is this evidence that the
    mean blood pressure for all the companys young
    male executives is higher than the national
    average?

64
Example Blood pressures of executives
  • Hypotheses
  • H0 ? 128
  • Ha ? gt 128
  • Test statistic
  • P-value

65
Example Blood pressures of executives
  • Conclusion
  • About 14 of the time, a SRS of size 72 from the
    general male population would have a mean blood
    pressure as high as that of executive sample. The
    observed is not significantly
    higher than the national average.

66
The t-distribution
  • Suppose we have a simple random sample of size n
    from a Normally distributed population with mean
    ? and standard deviation ?.
  • The standardized sample mean, or one-sample z
    statistic
  • has the standard Normal distribution N(0, 1).
  • When we substitute the standard deviation of the
    mean (standard error) s /?n for the ?/?n, the
    statistic does not have a Normal distribution.

67
The t-distribution
  • It has a distribution called t-distribution.
  • The t-distribution
  • Suppose that a SRS of size n is drawn from a N(?,
    ?) population. Then the one sample t statistic
  • has the t-distribution with n-1 degrees of
    freedom.
  • There is a different t distribution for each
    sample size.
  • A particular t distribution is specified by
    giving the degrees of freedom.

68
The t-distribution
  • We use t(k) to stand for t distribution with k
    degrees of freedom.
  • The density curves of the t-distributions are
    symmetric about 0 and are bell shaped.
  • The spread of t distribution is a bit greater
    than that of standard Normal distribution.
  • As degrees of freedom k increase, t(k) density
    curve approaches the N(0, 1) curve.

69
The one Sample t Confidence Interval
  • Suppose that an SRS of size n is drawn from a
    population having unknown mean ?. A level C
    confidence interval for ? is
  • Where t is the value for the t (n-1) density
    curve with area C between t and t. The margin
    of error is
  • This interval is exact when the population
    distribution is Normal and is approximately
    correct for large n in other cases.
Write a Comment
User Comments (0)
About PowerShow.com