Review of Top 10 Concepts in Statistics (reordered slightly for review the interactive session) - PowerPoint PPT Presentation

About This Presentation
Title:

Review of Top 10 Concepts in Statistics (reordered slightly for review the interactive session)

Description:

Review of Top 10 Concepts in Statistics (reordered slightly for review the interactive session) NOTE: This Power Point file is not an introduction, but rather a ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 112
Provided by: Lap5138
Learn more at: http://ocw.smithw.org
Category:

less

Transcript and Presenter's Notes

Title: Review of Top 10 Concepts in Statistics (reordered slightly for review the interactive session)


1
Review of Top 10 Conceptsin Statistics(reordered
slightly for review the interactive session)
  • NOTE This Power Point file is not an
    introduction, but rather a checklist of topics to
    review

2
Top Ten 10
  • Qualitative vs. Quantitative

3
Qualitative
  • Categorical data
  • success vs. failure
  • ethnicity
  • marital status
  • color
  • zip code
  • 4 star hotel in tour guide

4
Qualitative
  • If you need an average, do not calculate the
    mean
  • However, you can compute the mode (average
    person is married, buys a blue car made in
    America)

5
Quantitative
  • Two cases
  • Case 1 discrete
  • Case 2 continuous

6
Discrete
  • (1) integer values (0,1,2,)
  • (2) example binomial
  • (3) finite number of possible values
  • (4) counting
  • (5) number of brothers
  • (6) number of cars arriving at gas station

7
Continuous
  • Real numbers, such as decimal values (22.22)
  • Examples Z, t
  • Infinite number of possible values
  • Measurement
  • Miles per gallon, distance, duration of time

8
Graphical Tools
  • Pie chart or bar chart qualitative
  • Joint frequency table qualitative (relate
    marital status vs zip code)
  • Scatter diagram quantitative (distance from CSUN
    vs duration of time to reach CSUN)

9
Hypothesis TestingConfidence Intervals
  • Quantitative Mean
  • Qualitative Proportion

10
Top Ten 9
  • Population vs. Sample

11
Population
  • Collection of all items (all light bulbs made at
    factory)
  • Parameter measure of population
  • (1) population mean (average number of hours in
    life of all bulbs)
  • (2) population proportion ( of all bulbs that
    are defective)

12
Sample
  • Part of population (bulbs tested by inspector)
  • Statistic measure of sample estimate of
    parameter
  • (1) sample mean (average number of hours in life
    of bulbs tested by inspector)
  • (2) sample proportion ( of bulbs in sample that
    are defective)

13
Top Ten 1
  • Descriptive Statistics

14
Measures of Central Location
  • Mean
  • Median
  • Mode

15
Mean
  • Population mean µ Sx/N (516)/3 12/3 4
  • Algebra Sx Nµ 34 12
  • Sample mean x-bar Sx/n
  • Example the number of hours spent on the
    Internet 4, 8, and 9
  • x-bar (489)/3 7 hours
  • Do NOT use if the number of observations is small
    or with extreme values
  • Ex Do NOT use if 3 houses were sold this week,
    and one was a mansion

16
Median
  • Median middle value
  • Example 5,1,6
  • Step 1 Sort data 1,5,6
  • Step 2 Middle value 5
  • When there is an even number of observation,
    median is computed by averaging the two
    observations in the middle.
  • OK even if there are extreme values
  • Home sales 100K,200K,900K, so
  • mean 400K, but median 200K

17
Mode
  • Mode most frequent value
  • Ex female, male, female
  • Mode female
  • Ex 1,1,2,3,5,8
  • Mode 1
  • It may not be a very good measure, see the
    following example

18
Measures of Central Location - Example
  • Sample 0, 0, 5, 7, 8, 9, 12, 14, 22, 23
  • Sample Mean x-bar Sx/n 100/10 10
  • Median (89)/2 8.5
  • Mode 0

19
Relationship
  • Case 1 if probability distribution symmetric
    (ex. bell-shaped, normal distribution),
  • Mean Median Mode
  • Case 2 if distribution positively skewed to
    right (ex. incomes of employers in large firm a
    large number of relatively low-paid workers and a
    small number of high-paid executives),
  • Mode lt Median lt Mean

20
Relationship contd
  • Case 3 if distribution negatively skewed to left
    (ex. The time taken by students to write exams
    few students hand their exams early and majority
    of students turn in their exam at the end of
    exam),
  • Mean lt Median lt Mode

21
Dispersion Measures of Variability
  • How much spread of data
  • How much uncertainty
  • Measures
  • Range
  • Variance
  • Standard deviation

22
Range
  • Range Max-Min gt 0
  • But range affected by unusual values
  • Ex Santa Monica has a high of 105 degrees and a
    low of 30 once a century, but range would be
    105-30 75

23
Standard Deviation (SD)
  • Better than range because all data used
  • Population SD Square root of variance sigma s
  • SD gt 0

24
Empirical Rule
  • Applies to mound or bell-shaped curves
  • Ex normal distribution
  • 68 of data within one SD of mean
  • 95 of data within two SD of mean
  • 99.7 of data within three SD of mean

25
Standard Deviation Square Root of Variance
26
Sample Standard Deviation
x
6 6-8-2 (-2)(-2) 4
6 6-8-2 4
7 7-8-1 (-1)(-1) 1
8 8-80 0
13 13-85 (5)(5) 25
Sum40 Sum0 Sum 34
Mean40/58
27
Standard Deviation
  • Total variation 34
  • Sample variance 34/4 8.5
  • Sample standard deviation
  • square root of 8.5 2.9

28
Measures of Variability - Example
  • The hourly wages earned by a sample of five
    students are
  • 7, 5, 11, 8, and 6
  • Range 11 5 6
  • Variance
  • Standard deviation

29
Graphical Tools
  • Line chart trend over time
  • Scatter diagram relationship between two
    variables
  • Bar chart frequency for each category
  • Histogram frequency for each class of measured
    data (graph of frequency distr.)
  • Box plot graphical display based on quartiles,
    which divide data into 4 parts

30
Top Ten 8
  • Variation Creates Uncertainty

31
No Variation
  • Certainty, exact prediction
  • Standard deviation 0
  • Variance 0
  • All data exactly same
  • Example all workers in minimum wage job

32
High Variation
  • Uncertainty, unpredictable
  • High standard deviation
  • Ex 1 Workers in downtown L.A. have variation
    between CEOs and garment workers
  • Ex 2 New York temperatures in spring range from
    below freezing to very hot

33
Comparing Standard Deviations
  • Temperature Example
  • Beach city small standard deviation (single
    temperature reading close to mean)
  • High Desert city High standard deviation (hot
    days, cool nights in spring)

34
Standard Error of the Mean
  • Standard deviation of sample mean
  • standard deviation/square root of n
  • Ex standard deviation 10, n 4, so standard
    error of the mean 10/2 5
  • Note that 5lt10, so standard error lt standard
    deviation.
  • As n increases, standard error decreases.

35
Sampling Distribution
  • Expected value of sample mean population mean,
    but an individual sample mean could be smaller or
    larger than the population mean
  • Population mean is a constant parameter, but
    sample mean is a random variable
  • Sampling distribution is distribution of sample
    means

36
Example
  • Mean age of all students in the building is
    population mean
  • Each classroom has a sample mean
  • Distribution of sample means from all classrooms
    is sampling distribution

37
Central Limit Theorem (CLT)
  • If population standard deviation is known,
    sampling distribution of sample means is normal
    if n gt 30
  • CLT applies even if original population is skewed

38
Top Ten 5
  • Expected Value

39
Expected Value
  • Expected Value E(x) SxP(x)
  • x1P(x1) x2P(x2)
  • Expected value is a weighted average, also a
    long-run average

40
Example
  • Find the expected age at high school graduation
    if 11 were 17 years old, 80 were 18 years old,
    and 5 were 19 years old
  • Step 1 1180596

41
Step 2
x P(x) x ? P(x)
17 11/96.115 17(.115)1.955
18 80/96.833 18(.833)14.994
19 5/96.052 19(.052).988
E(x) 17.937
42
Top Ten 4
  • Linear Regression

43
Linear Regression
  • Regression equation
  • dependent variablepredicted value
  • x independent variable
  • b0y-intercept predicted value of y if x0
  • b1sloperegression coefficient
  • change in y per unit change in x

44
Slope vs Correlation
  • Positive slope (b1gt0) positive correlation
    between x and y (y increase if x increase)
  • Negative slope (b1lt0) negative correlation (y
    decrease if x increase)
  • Zero slope (b10) no correlation(predicted value
    for y is mean of y), no linear relationship
    between x and y

45
Simple Linear Regression
  • Simple one independent variable, one dependent
    variable
  • Linear graph of regression equation is straight
    line

46
Example
  • y salary (female manager, in thousands of
    dollars)
  • x number of children
  • n number of observations

47
Given Data
x y
2 48
1 52
4 33

48
Totals
x y
2 48
1 52
4 33 n3
Sum7 Sum133
49
Slope (b1) -6.5
  • Method of Least Squares formulas not on BUS 302
    exam
  • b1 -6.5 given

Interpretation If one female manager has 1 more
child than another, salary is 6,500 lower that
is, salary of female managers is expected to
decrease by -6.5 (in thousand of dollars) per
child
50
Intercept (b0)
  • b0 44.33 (-6.5)(2.33) 59.5
  • If number of children is zero, expected salary is
    59,500

51
Regression Equation
52
Forecast Salary If 3 Children
  • 59.5 6.5(3) 40
  • 40,000 expected salary

53
Standard Error of Estimate
54
Standard Error of Estimate
(1)x (2)y (3) 59.5-6.5x (4) (2)-(3)
2 48 46.5 1.5 2.25
1 52 53 -1 1
4 33 33.5 -.5 .25
SSE3.5
55
Standard Error of Estimate
Actual salary typically 1,900 away from expected
salary
56
Coefficient of Determination
  • R2 of total variation in y that can be
    explained by variation in x
  • Measure of how close the linear regression line
    fits the points in a scatter diagram
  • R2 1 max. possible value perfect linear
    relationship between y and x (straight line)
  • R2 0 min. value no linear relationship

57
Sources of Variation (V)
  • Total V Explained V Unexplained V
  • SS Sum of Squares V
  • Total SS Regression SS Error SS
  • SST SSR SSE
  • SSR Explained V, SSE Unexplained

58
Coefficient of Determination
  • R2 SSR
    SST
  • R2 197 .98
    200.5
  • Interpretation 98 of total variation in salary
    can be explained by variation in number of
    children

59
0 lt R2 lt 1
  • 0 No linear relationship since SSR0
    (explained variation 0)
  • 1 Perfect relationship since SSR SST
    (unexplained variation SSE 0), but does not
    prove cause and effect

60
RCorrelation Coefficient
  • Case 1 slope (b1) lt 0
  • R lt 0
  • R is negative square root of coefficient of
    determination

61
Our Example
  • Slope b1 -6.5
  • R2 .98
  • R -.99

62
Case 2 Slope gt 0
  • R is positive square root of coefficient of
    determination
  • Ex R2 .49
  • R .70
  • R has no interpretation
  • R overstates relationship

63
Caution
  • Nonlinear relationship (parabola, hyperbola, etc)
    can NOT be measured by R2
  • In fact, you could get R20 with a nonlinear
    graph on a scatter diagram

64
Summary Correlation Coefficient
  • Case 1 If b1 gt 0, R is the positive square root
    of the coefficient of determination
  • Ex1 y 43x, R2.36 R .60
  • Case 2 If b1 lt 0, R is the negative square root
    of the coefficient of determination
  • Ex2 y 80-10x, R2.49 R -.70
  • NOTE! Ex2 has stronger relationship, as measured
    by coefficient of determination

65
Extreme Values
  • R1 perfect positive correlation
  • R -1 perfect negative correlation
  • R0 zero correlation

66
MS Excel Output
Correlation Coefficient (-0.9912) Note that you
need to change the sign because the sign of slope
(b1) is negative (-6.5)
Coefficient of Determination
Standard Error of Estimate
Regression Coefficient
67
Top Ten 6
  • What Distribution to Use?

68
Use Binomial Distribution If
  • Random variable (x) is number of successes in n
    trials
  • Each trial is success or failure
  • Independent trials
  • Constant probability of success (p) on each trial
  • Sampling with replacement (in practice, people
    may use binomial w/o replacement, but theory is
    with replacement)

69
Success vs. Failure
  • The binomial experiment can result in only one of
    two possible outcomes
  • Male vs. Female
  • Defective vs. Non-defective
  • Yes or No
  • Pass (8 or more right answers) vs. Fail (fewer
    than 8)
  • Buy drink (21 or over) vs. Cannot buy drink

70
Binomial Is Discrete
  • Integer values
  • 0,1,2,n
  • Binomial is often skewed, but may be symmetric

71
Normal Distribution
  • Continuous, bell-shaped, symmetric
  • Meanmedianmode
  • Measurement (dollars, inches, years)
  • Cumulative probability under normal curve use Z
    table if you know population mean and population
    standard deviation
  • Sample mean use Z table if you know population
    standard deviation and either normal population
    or n gt 30

72
t Distribution
  • Continuous, mound-shaped, symmetric
  • Applications similar to normal
  • More spread out than normal
  • Use t if normal population but population
    standard deviation not known
  • Degrees of freedom df n-1 if estimating the
    mean of one population
  • t approaches z as df increases

73
Normal or t Distribution?
  • Use t table if normal population but population
    standard deviation (s) is not known
  • If you are given the sample standard deviation
    (s), use t table, assuming normal population

74
Top Ten 3
  • Confidence Intervals Mean and Proportion

75
Confidence Interval
  • A confidence interval is a range of values within
    which the population parameter is expected to
    occur.

76
Factors for Confidence Interval
  • The factors that determine the width of a
    confidence interval are
  1. The sample size, n
  2. The variability in the population, usually
    estimated by standard deviation.
  3. The desired level of confidence.

77
Confidence Interval Mean
  • Use normal distribution (Z table if)
  • population standard deviation (sigma) known and
    either (1) or (2)
  • Normal population
  • Sample size gt 30

78
Confidence Interval Mean
  • If normal table, then

79
Normal Table
  • Tail .5(1 confidence level)
  • NOTE! Different statistics texts have different
    normal tables
  • This review uses the tail of the bell curve
  • Ex 95 confidence tail .5(1-.95) .025
  • Z.025 1.96

80
Example
  • n49, Sx490, s2, 95 confidence
  • 9.44 lt µ lt 10.56

81
Another Example
  • One of SOM professors wants to estimate the mean
    number of hours worked per week by students. A
    sample of 49 students showed a mean of 24 hours.
    It is assumed that the population standard
    deviation is 4 hours. What is the population
    mean?

82
Another Example contd
  • 95 percent confidence interval for the
    population mean.

The confidence limits range from 22.88 to 25.12.
We estimate with 95 percent confidence that the
average number of hours worked per week by
students lies between these two values.
83
Confidence Interval Mean t distribution
  • Use if normal population but population standard
    deviation (s) not known
  • If you are given the sample standard deviation
    (s), use t table, assuming normal population
  • If one population, n-1 degrees of freedom

84
Confidence Interval Mean t distribution
85
Confidence Interval Proportion
  • Use if success or failure
  • (ex defective or not-defective,
  • satisfactory or unsatisfactory)
  • Normal approximation to binomial ok if
  • (n)(p) gt 5 and (n)(1-p) gt 5, where
  • n sample size
  • p population proportion
  • NOTE NEVER use the t table if proportion!!

86
Confidence Interval Proportion
  • Ex 8 defectives out of 100, so p .08 and
  • n 100, 95 confidence

87
Confidence Interval Proportion
  • A sample of 500 people who own their house
    revealed that 175 planned to sell their homes
    within five years. Develop a 98 confidence
    interval for the proportion of people who plan to
    sell their house within five years.

88
Interpretation
  • If 95 confidence, then 95 of all confidence
    intervals will include the true population
    parameter
  • NOTE! Never use the term probability when
    estimating a parameter!! (ex Do NOT say
    Probability that population mean is between 23
    and 32 is .95 because parameter is not a random
    variable. In fact, the population mean is a fixed
    but unknown quantity.)

89
Point vs Interval Estimate
  • Point estimate statistic (single number)
  • Ex sample mean, sample proportion
  • Each sample gives different point estimate
  • Interval estimate range of values
  • Ex Population mean sample mean error
  • Parameter statistic error

90
Width of Interval
  • Ex sample mean 23, error 3
  • Point estimate 23
  • Interval estimate 23 3, or (20,26)
  • Width of interval 26-20 6
  • Wide interval Point estimate unreliable

91
Wide Confidence Interval If
  • (1) small sample size(n)
  • (2) large standard deviation
  • (3) high confidence interval (ex 99 confidence
    interval wider than 95 confidence interval)
  • If you want narrow interval, you need a large
    sample size or small standard deviation or low
    confidence level.

92
Top Ten 7
  • P-value

93
P-value
  • P-value probability of getting a sample
    statistic as extreme (or more extreme) than the
    sample statistic you got from your sample, given
    that the null hypothesis is true

94
P-value Example one tail test
  • H0 µ 40
  • HA µ gt 40
  • Sample mean 43
  • P-value P(sample mean gt 43, given H0 true)
  • Meaning probability of observing a sample mean
    as large as 43 when the population mean is 40
  • How to use it Reject H0 if p-value lt a
    (significance level)

95
Two Cases
  • Suppose a .05
  • Case 1 suppose p-value .02, then reject H0
    (unlikely H0 is true you believe population mean
    gt 40)
  • Case 2 suppose p-value .08, then do not reject
    H0 (H0 may be true you have reason to believe
    that the population mean may be 40)

96
P-value Example two tail test
  • H0 µ 70
  • HA µ ? 70
  • Sample mean 72
  • If two-tails, then P-value
  • 2 ? P(sample mean gt 72)2(.04).08
  • If a .05, p-value gt a, so do not reject H0

97
Top Ten 2
  • Hypothesis Testing

98
H0 Null Hypothesis
  • Population meanµ
  • Population proportionp
  • A statement about the value of a population
    parameter
  • Never include sample statistic (such as, x-bar)
    in hypothesis

99
HA or H1 Alternative Hypothesis
  • ONE TAIL ALTERNATIVE
  • Right tail µgtnumber(smog ck)
  • pgtfraction(defectives)
  • Left tail µltnumber(weight in box of crackers)
  • pltfraction(unpopular Presidents
    approval low)

100
One-Tailed Tests
  • A test is one-tailed when the alternate
    hypothesis, H1 or HA, states a direction, such as
  • H1 The mean yearly salaries earned by full-time
    employees is more than 45,000. (µgt45,000)
  • H1 The average speed of cars traveling on
    freeway is less than 75 miles per hour. (µlt75)
  • H1 Less than 20 percent of the customers pay
    cash for their gasoline purchase. (p lt0.2)

101
Two-Tail Alternative
  • Population mean not equal to number (too hot or
    too cold)
  • Population proportion not equal to fraction (
    alcohol too weak or too strong)

102
Two-Tailed Tests
  • A test is two-tailed when no direction is
    specified in the alternate hypothesis
  • H1 The mean amount of time spent for the
    Internet is not equal to 5 hours. (µ ? 5).
  • H1 The mean price for a gallon of gasoline is
    not equal to 2.54. (µ ? 2.54).

103
Reject Null Hypothesis (H0) If
  • Absolute value of test statistic gt critical
    value
  • Reject H0 if Z Value gt critical Z
  • Reject H0 if t Value gt critical t
  • Reject H0 if p-value lt significance level (alpha)
  • Note that direction of inequality is reversed!
  • Reject H0 if very large difference between sample
    statistic and population parameter in H0

Test statistic A value, determined from sample
information, used to determine whether or not to
reject the null hypothesis. Critical value The
dividing point between the region where the null
hypothesis is rejected and the region where it is
not rejected.
104
Example Smog Check
  • H0 µ 80
  • HA µ gt 80
  • If test statistic 2.2 and critical value 1.96,
    reject H0, and conclude that the population mean
    is likely gt 80
  • If test statistic 1.6 and critical value
    1.96, do not reject H0, and reserve judgment
    about H0

105
Type I vs Type II Error
  • Alphaa P(type I error) Significance level
    probability that you reject true null hypothesis
  • Beta ß P(type II error) probability you do
    not reject a null hypothesis, given H0 false
  • Ex H0 Defendant innocent
  • a P(jury convicts innocent person)
  • ß P(jury acquits guilty person)

106
Type I vs Type II Error
H0 true H0 false
Reject H0 Alpha a P(type I error) 1 ß (Correct Decision)
Do not reject H0 1 a (Correct Decision) Beta ß P(type II error)
107
Example Smog Check
  • H0 µ 80
  • HA µ gt 80
  • If p-value 0.01 and alpha 0.05, reject H0,
    and conclude that the population mean is likely gt
    80
  • If p-value 0.07 and alpha 0.05, do not reject
    H0, and reserve judgment about H0

108
Test Statistic
  • When testing for the population mean from a large
    sample and the population standard deviation is
    known, the test statistic is given by

109
Example
  • The processors of Best Mayo indicate on the
    label that the bottle contains 16 ounces of mayo.
    The standard deviation of the process is 0.5
    ounces. A sample of 36 bottles from last hours
    production showed a mean weight of 16.12 ounces
    per bottle. At the .05 significance level, can
    we conclude that the mean amount per bottle is
    greater than 16 ounces?

110
Example contd
  • 1. State the null and the alternative hypotheses
  • H0 µ 16, H1 µ gt 16

2. Select the level of significance. In this
case, we selected the .05 significance level.
  • 3. Identify the test statistic. Because we know
    the population standard deviation, the test
    statistic is z.
  • 4. State the decision rule.
  • Reject H0 if zgt 1.645 ( z0.05)

111
Example contd
  • 5. Compute the value of the test statistic
  • 6. Conclusion Do not reject the null hypothesis.
    We cannot conclude the mean is greater than 16
    ounces.
Write a Comment
User Comments (0)
About PowerShow.com