Review of Top 10 Concepts in Statistics (reordered slightly for review the interactive session)

About This Presentation

Title:

Review of Top 10 Concepts in Statistics (reordered slightly for review the interactive session)

Description:

Review of Top 10 Concepts in Statistics (reordered slightly for review the interactive session) NOTE: This Power Point file is not an introduction, but rather a ... – PowerPoint PPT presentation

Number of Views:174

Avg rating:3.0/5.0

Slides: 112

Provided by: Lap5138

Learn more at: http://ocw.smithw.org

Category:

more less

Transcript and Presenter's Notes

Title: Review of Top 10 Concepts in Statistics (reordered slightly for review the interactive session)

1
Review of Top 10 Conceptsin Statistics(reordered
slightly for review the interactive session)

NOTE This Power Point file is not an
introduction, but rather a checklist of topics to
review

2
Top Ten 10

Qualitative vs. Quantitative

3
Qualitative

Categorical data
success vs. failure
ethnicity
marital status
color
zip code
4 star hotel in tour guide

4
Qualitative

If you need an average, do not calculate the
mean
However, you can compute the mode (average
person is married, buys a blue car made in
America)

5
Quantitative

Two cases
Case 1 discrete
Case 2 continuous

6
Discrete

(1) integer values (0,1,2,)
(2) example binomial
(3) finite number of possible values
(4) counting
(5) number of brothers
(6) number of cars arriving at gas station

7
Continuous

Real numbers, such as decimal values (22.22)
Examples Z, t
Infinite number of possible values
Measurement
Miles per gallon, distance, duration of time

8
Graphical Tools

Pie chart or bar chart qualitative
Joint frequency table qualitative (relate
marital status vs zip code)
Scatter diagram quantitative (distance from CSUN
vs duration of time to reach CSUN)

9
Hypothesis TestingConfidence Intervals

Quantitative Mean
Qualitative Proportion

10
Top Ten 9

Population vs. Sample

11
Population

Collection of all items (all light bulbs made at
factory)
Parameter measure of population
(1) population mean (average number of hours in
life of all bulbs)
(2) population proportion ( of all bulbs that
are defective)

12
Sample

Part of population (bulbs tested by inspector)
Statistic measure of sample estimate of
parameter
(1) sample mean (average number of hours in life
of bulbs tested by inspector)
(2) sample proportion ( of bulbs in sample that
are defective)

13
Top Ten 1

Descriptive Statistics

14
Measures of Central Location

Mean
Median
Mode

15
Mean

Population mean µ Sx/N (516)/3 12/3 4
Algebra Sx Nµ 34 12
Sample mean x-bar Sx/n
Example the number of hours spent on the
Internet 4, 8, and 9
x-bar (489)/3 7 hours
Do NOT use if the number of observations is small
or with extreme values
Ex Do NOT use if 3 houses were sold this week,
and one was a mansion

16
Median

Median middle value
Example 5,1,6
Step 1 Sort data 1,5,6
Step 2 Middle value 5
When there is an even number of observation,
median is computed by averaging the two
observations in the middle.
OK even if there are extreme values
Home sales 100K,200K,900K, so
mean 400K, but median 200K

17
Mode

Mode most frequent value
Ex female, male, female
Mode female
Ex 1,1,2,3,5,8
Mode 1
It may not be a very good measure, see the
following example

18
Measures of Central Location - Example

Sample 0, 0, 5, 7, 8, 9, 12, 14, 22, 23
Sample Mean x-bar Sx/n 100/10 10
Median (89)/2 8.5
Mode 0

19
Relationship

Case 1 if probability distribution symmetric
(ex. bell-shaped, normal distribution),
Mean Median Mode
Case 2 if distribution positively skewed to
right (ex. incomes of employers in large firm a
large number of relatively low-paid workers and a
small number of high-paid executives),
Mode lt Median lt Mean

20
Relationship contd

Case 3 if distribution negatively skewed to left
(ex. The time taken by students to write exams
few students hand their exams early and majority
of students turn in their exam at the end of
exam),
Mean lt Median lt Mode

21
Dispersion Measures of Variability

How much spread of data
How much uncertainty
Measures
Range
Variance
Standard deviation

22
Range

Range Max-Min gt 0
But range affected by unusual values
Ex Santa Monica has a high of 105 degrees and a
low of 30 once a century, but range would be
105-30 75

23
Standard Deviation (SD)

Better than range because all data used
Population SD Square root of variance sigma s
SD gt 0

24
Empirical Rule

Applies to mound or bell-shaped curves
Ex normal distribution
68 of data within one SD of mean
95 of data within two SD of mean
99.7 of data within three SD of mean

25
Standard Deviation Square Root of Variance
26
Sample Standard Deviation
x
6 6-8-2 (-2)(-2) 4
6 6-8-2 4
7 7-8-1 (-1)(-1) 1
8 8-80 0
13 13-85 (5)(5) 25
Sum40 Sum0 Sum 34
Mean40/58
27
Standard Deviation

Total variation 34
Sample variance 34/4 8.5
Sample standard deviation
square root of 8.5 2.9

28
Measures of Variability - Example

The hourly wages earned by a sample of five
students are
7, 5, 11, 8, and 6
Range 11 5 6
Variance
Standard deviation

29
Graphical Tools

Line chart trend over time
Scatter diagram relationship between two
variables
Bar chart frequency for each category
Histogram frequency for each class of measured
data (graph of frequency distr.)
Box plot graphical display based on quartiles,
which divide data into 4 parts

30
Top Ten 8

Variation Creates Uncertainty

31
No Variation

Certainty, exact prediction
Standard deviation 0
Variance 0
All data exactly same
Example all workers in minimum wage job

32
High Variation

Uncertainty, unpredictable
High standard deviation
Ex 1 Workers in downtown L.A. have variation
between CEOs and garment workers
Ex 2 New York temperatures in spring range from
below freezing to very hot

33
Comparing Standard Deviations

Temperature Example
Beach city small standard deviation (single
temperature reading close to mean)
High Desert city High standard deviation (hot
days, cool nights in spring)

34
Standard Error of the Mean

Standard deviation of sample mean
standard deviation/square root of n
Ex standard deviation 10, n 4, so standard
error of the mean 10/2 5
Note that 5lt10, so standard error lt standard
deviation.
As n increases, standard error decreases.

35
Sampling Distribution

Expected value of sample mean population mean,
but an individual sample mean could be smaller or
larger than the population mean
Population mean is a constant parameter, but
sample mean is a random variable
Sampling distribution is distribution of sample
means

36
Example

Mean age of all students in the building is
population mean
Each classroom has a sample mean
Distribution of sample means from all classrooms
is sampling distribution

37
Central Limit Theorem (CLT)

If population standard deviation is known,
sampling distribution of sample means is normal
if n gt 30
CLT applies even if original population is skewed

38
Top Ten 5

Expected Value

39
Expected Value

Expected Value E(x) SxP(x)
x1P(x1) x2P(x2)
Expected value is a weighted average, also a
long-run average

40
Example

Find the expected age at high school graduation
if 11 were 17 years old, 80 were 18 years old,
and 5 were 19 years old
Step 1 1180596

41
Step 2
x P(x) x ? P(x)
17 11/96.115 17(.115)1.955
18 80/96.833 18(.833)14.994
19 5/96.052 19(.052).988
E(x) 17.937
42
Top Ten 4

Linear Regression

43
Linear Regression

Regression equation
dependent variablepredicted value
x independent variable
b0y-intercept predicted value of y if x0
b1sloperegression coefficient
change in y per unit change in x

44
Slope vs Correlation

Positive slope (b1gt0) positive correlation
between x and y (y increase if x increase)
Negative slope (b1lt0) negative correlation (y
decrease if x increase)
Zero slope (b10) no correlation(predicted value
for y is mean of y), no linear relationship
between x and y

45
Simple Linear Regression

Simple one independent variable, one dependent
variable
Linear graph of regression equation is straight
line

46
Example

y salary (female manager, in thousands of
dollars)
x number of children
n number of observations

47
Given Data
x y
2 48
1 52
4 33

48
Totals
x y
2 48
1 52
4 33 n3
Sum7 Sum133
49
Slope (b1) -6.5

Method of Least Squares formulas not on BUS 302
exam
b1 -6.5 given

Interpretation If one female manager has 1 more
child than another, salary is 6,500 lower that
is, salary of female managers is expected to
decrease by -6.5 (in thousand of dollars) per
child
50
Intercept (b0)

b0 44.33 (-6.5)(2.33) 59.5

If number of children is zero, expected salary is
59,500

51
Regression Equation
52
Forecast Salary If 3 Children

59.5 6.5(3) 40
40,000 expected salary

53
Standard Error of Estimate
54
Standard Error of Estimate
(1)x (2)y (3) 59.5-6.5x (4) (2)-(3)
2 48 46.5 1.5 2.25
1 52 53 -1 1
4 33 33.5 -.5 .25
SSE3.5
55
Standard Error of Estimate
Actual salary typically 1,900 away from expected
salary
56
Coefficient of Determination

R2 of total variation in y that can be
explained by variation in x
Measure of how close the linear regression line
fits the points in a scatter diagram
R2 1 max. possible value perfect linear
relationship between y and x (straight line)
R2 0 min. value no linear relationship

57
Sources of Variation (V)

Total V Explained V Unexplained V
SS Sum of Squares V
Total SS Regression SS Error SS
SST SSR SSE
SSR Explained V, SSE Unexplained

58
Coefficient of Determination

R2 SSR
SST
R2 197 .98
200.5
Interpretation 98 of total variation in salary
can be explained by variation in number of
children

59
0 lt R2 lt 1

0 No linear relationship since SSR0
(explained variation 0)
1 Perfect relationship since SSR SST
(unexplained variation SSE 0), but does not
prove cause and effect

60
RCorrelation Coefficient

Case 1 slope (b1) lt 0
R lt 0
R is negative square root of coefficient of
determination

61
Our Example

Slope b1 -6.5
R2 .98
R -.99

62
Case 2 Slope gt 0

R is positive square root of coefficient of
determination
Ex R2 .49
R .70
R has no interpretation
R overstates relationship

63
Caution

Nonlinear relationship (parabola, hyperbola, etc)
can NOT be measured by R2
In fact, you could get R20 with a nonlinear
graph on a scatter diagram

64
Summary Correlation Coefficient

Case 1 If b1 gt 0, R is the positive square root
of the coefficient of determination
Ex1 y 43x, R2.36 R .60
Case 2 If b1 lt 0, R is the negative square root
of the coefficient of determination
Ex2 y 80-10x, R2.49 R -.70
NOTE! Ex2 has stronger relationship, as measured
by coefficient of determination

65
Extreme Values

R1 perfect positive correlation
R -1 perfect negative correlation
R0 zero correlation

66
MS Excel Output
Correlation Coefficient (-0.9912) Note that you
need to change the sign because the sign of slope
(b1) is negative (-6.5)
Coefficient of Determination
Standard Error of Estimate
Regression Coefficient
67
Top Ten 6

What Distribution to Use?

68
Use Binomial Distribution If

Random variable (x) is number of successes in n
trials
Each trial is success or failure
Independent trials
Constant probability of success (p) on each trial
Sampling with replacement (in practice, people
may use binomial w/o replacement, but theory is
with replacement)

69
Success vs. Failure

The binomial experiment can result in only one of
two possible outcomes
Male vs. Female
Defective vs. Non-defective
Yes or No
Pass (8 or more right answers) vs. Fail (fewer
than 8)
Buy drink (21 or over) vs. Cannot buy drink

70
Binomial Is Discrete

Integer values
0,1,2,n
Binomial is often skewed, but may be symmetric

71
Normal Distribution

Continuous, bell-shaped, symmetric
Meanmedianmode
Measurement (dollars, inches, years)
Cumulative probability under normal curve use Z
table if you know population mean and population
standard deviation
Sample mean use Z table if you know population
standard deviation and either normal population
or n gt 30

72
t Distribution

Continuous, mound-shaped, symmetric
Applications similar to normal
More spread out than normal
Use t if normal population but population
standard deviation not known
Degrees of freedom df n-1 if estimating the
mean of one population
t approaches z as df increases

73
Normal or t Distribution?

Use t table if normal population but population
standard deviation (s) is not known
If you are given the sample standard deviation
(s), use t table, assuming normal population

74
Top Ten 3

Confidence Intervals Mean and Proportion

75
Confidence Interval

A confidence interval is a range of values within
which the population parameter is expected to
occur.

76
Factors for Confidence Interval

The factors that determine the width of a
confidence interval are

The sample size, n
The variability in the population, usually
estimated by standard deviation.
The desired level of confidence.

77
Confidence Interval Mean

Use normal distribution (Z table if)
population standard deviation (sigma) known and
either (1) or (2)
Normal population
Sample size gt 30

78
Confidence Interval Mean

If normal table, then

79
Normal Table

Tail .5(1 confidence level)
NOTE! Different statistics texts have different
normal tables
This review uses the tail of the bell curve
Ex 95 confidence tail .5(1-.95) .025
Z.025 1.96

80
Example

n49, Sx490, s2, 95 confidence
9.44 lt µ lt 10.56

81
Another Example

One of SOM professors wants to estimate the mean
number of hours worked per week by students. A
sample of 49 students showed a mean of 24 hours.
It is assumed that the population standard
deviation is 4 hours. What is the population
mean?

82
Another Example contd

95 percent confidence interval for the
population mean.

The confidence limits range from 22.88 to 25.12.
We estimate with 95 percent confidence that the
average number of hours worked per week by
students lies between these two values.
83
Confidence Interval Mean t distribution

Use if normal population but population standard
deviation (s) not known
If you are given the sample standard deviation
(s), use t table, assuming normal population
If one population, n-1 degrees of freedom

84
Confidence Interval Mean t distribution
85
Confidence Interval Proportion

Use if success or failure
(ex defective or not-defective,
satisfactory or unsatisfactory)
Normal approximation to binomial ok if
(n)(p) gt 5 and (n)(1-p) gt 5, where
n sample size
p population proportion
NOTE NEVER use the t table if proportion!!

86
Confidence Interval Proportion

Ex 8 defectives out of 100, so p .08 and
n 100, 95 confidence

87
Confidence Interval Proportion

A sample of 500 people who own their house
revealed that 175 planned to sell their homes
within five years. Develop a 98 confidence
interval for the proportion of people who plan to
sell their house within five years.

88
Interpretation

If 95 confidence, then 95 of all confidence
intervals will include the true population
parameter
NOTE! Never use the term probability when
estimating a parameter!! (ex Do NOT say
Probability that population mean is between 23
and 32 is .95 because parameter is not a random
variable. In fact, the population mean is a fixed
but unknown quantity.)

89
Point vs Interval Estimate

Point estimate statistic (single number)
Ex sample mean, sample proportion
Each sample gives different point estimate
Interval estimate range of values
Ex Population mean sample mean error
Parameter statistic error

90
Width of Interval

Ex sample mean 23, error 3
Point estimate 23
Interval estimate 23 3, or (20,26)
Width of interval 26-20 6
Wide interval Point estimate unreliable

91
Wide Confidence Interval If

(1) small sample size(n)
(2) large standard deviation
(3) high confidence interval (ex 99 confidence
interval wider than 95 confidence interval)
If you want narrow interval, you need a large
sample size or small standard deviation or low
confidence level.

92
Top Ten 7

P-value

93
P-value

P-value probability of getting a sample
statistic as extreme (or more extreme) than the
sample statistic you got from your sample, given
that the null hypothesis is true

94
P-value Example one tail test

H0 µ 40
HA µ gt 40
Sample mean 43
P-value P(sample mean gt 43, given H0 true)
Meaning probability of observing a sample mean
as large as 43 when the population mean is 40
How to use it Reject H0 if p-value lt a
(significance level)

95
Two Cases

Suppose a .05
Case 1 suppose p-value .02, then reject H0
(unlikely H0 is true you believe population mean
gt 40)
Case 2 suppose p-value .08, then do not reject
H0 (H0 may be true you have reason to believe
that the population mean may be 40)

96
P-value Example two tail test

H0 µ 70
HA µ ? 70
Sample mean 72
If two-tails, then P-value
2 ? P(sample mean gt 72)2(.04).08
If a .05, p-value gt a, so do not reject H0

97
Top Ten 2

Hypothesis Testing

98
H0 Null Hypothesis

Population meanµ
Population proportionp
A statement about the value of a population
parameter
Never include sample statistic (such as, x-bar)
in hypothesis

99
HA or H1 Alternative Hypothesis

ONE TAIL ALTERNATIVE
Right tail µgtnumber(smog ck)
pgtfraction(defectives)
Left tail µltnumber(weight in box of crackers)
pltfraction(unpopular Presidents
approval low)

100
One-Tailed Tests

A test is one-tailed when the alternate
hypothesis, H1 or HA, states a direction, such as

H1 The mean yearly salaries earned by full-time
employees is more than 45,000. (µgt45,000)
H1 The average speed of cars traveling on
freeway is less than 75 miles per hour. (µlt75)
H1 Less than 20 percent of the customers pay
cash for their gasoline purchase. (p lt0.2)

101
Two-Tail Alternative

Population mean not equal to number (too hot or
too cold)
Population proportion not equal to fraction (
alcohol too weak or too strong)

102
Two-Tailed Tests

A test is two-tailed when no direction is
specified in the alternate hypothesis

H1 The mean amount of time spent for the
Internet is not equal to 5 hours. (µ ? 5).
H1 The mean price for a gallon of gasoline is
not equal to 2.54. (µ ? 2.54).

103
Reject Null Hypothesis (H0) If

Absolute value of test statistic gt critical
value
Reject H0 if Z Value gt critical Z
Reject H0 if t Value gt critical t
Reject H0 if p-value lt significance level (alpha)
Note that direction of inequality is reversed!
Reject H0 if very large difference between sample
statistic and population parameter in H0

Test statistic A value, determined from sample
information, used to determine whether or not to
reject the null hypothesis. Critical value The
dividing point between the region where the null
hypothesis is rejected and the region where it is
not rejected.
104
Example Smog Check

H0 µ 80
HA µ gt 80
If test statistic 2.2 and critical value 1.96,
reject H0, and conclude that the population mean
is likely gt 80
If test statistic 1.6 and critical value
1.96, do not reject H0, and reserve judgment
about H0

105
Type I vs Type II Error

Alphaa P(type I error) Significance level
probability that you reject true null hypothesis
Beta ß P(type II error) probability you do
not reject a null hypothesis, given H0 false
Ex H0 Defendant innocent
a P(jury convicts innocent person)
ß P(jury acquits guilty person)

106
Type I vs Type II Error
H0 true H0 false
Reject H0 Alpha a P(type I error) 1 ß (Correct Decision)
Do not reject H0 1 a (Correct Decision) Beta ß P(type II error)
107
Example Smog Check

H0 µ 80
HA µ gt 80
If p-value 0.01 and alpha 0.05, reject H0,
and conclude that the population mean is likely gt
80
If p-value 0.07 and alpha 0.05, do not reject
H0, and reserve judgment about H0

108
Test Statistic

When testing for the population mean from a large
sample and the population standard deviation is
known, the test statistic is given by

109
Example

The processors of Best Mayo indicate on the
label that the bottle contains 16 ounces of mayo.
The standard deviation of the process is 0.5
ounces. A sample of 36 bottles from last hours
production showed a mean weight of 16.12 ounces
per bottle. At the .05 significance level, can
we conclude that the mean amount per bottle is
greater than 16 ounces?

110
Example contd

1. State the null and the alternative hypotheses
H0 µ 16, H1 µ gt 16

2. Select the level of significance. In this
case, we selected the .05 significance level.

3. Identify the test statistic. Because we know
the population standard deviation, the test
statistic is z.
4. State the decision rule.
Reject H0 if zgt 1.645 ( z0.05)

111
Example contd