Title: Review of Top 10 Concepts in Statistics (reordered slightly for review the interactive session)
1Review of Top 10 Conceptsin Statistics(reordered
slightly for review the interactive session)
- NOTE This Power Point file is not an
introduction, but rather a checklist of topics to
review
2Top Ten 10
- Qualitative vs. Quantitative
3Qualitative
- Categorical data
- success vs. failure
- ethnicity
- marital status
- color
- zip code
- 4 star hotel in tour guide
4Qualitative
- If you need an average, do not calculate the
mean - However, you can compute the mode (average
person is married, buys a blue car made in
America)
5Quantitative
- Two cases
- Case 1 discrete
- Case 2 continuous
6Discrete
- (1) integer values (0,1,2,)
- (2) example binomial
- (3) finite number of possible values
- (4) counting
- (5) number of brothers
- (6) number of cars arriving at gas station
7Continuous
- Real numbers, such as decimal values (22.22)
- Examples Z, t
- Infinite number of possible values
- Measurement
- Miles per gallon, distance, duration of time
8Graphical Tools
- Pie chart or bar chart qualitative
- Joint frequency table qualitative (relate
marital status vs zip code) - Scatter diagram quantitative (distance from CSUN
vs duration of time to reach CSUN)
9Hypothesis TestingConfidence Intervals
- Quantitative Mean
- Qualitative Proportion
10Top Ten 9
11Population
- Collection of all items (all light bulbs made at
factory) - Parameter measure of population
- (1) population mean (average number of hours in
life of all bulbs) - (2) population proportion ( of all bulbs that
are defective)
12Sample
- Part of population (bulbs tested by inspector)
- Statistic measure of sample estimate of
parameter - (1) sample mean (average number of hours in life
of bulbs tested by inspector) - (2) sample proportion ( of bulbs in sample that
are defective)
13Top Ten 1
14Measures of Central Location
15Mean
- Population mean µ Sx/N (516)/3 12/3 4
- Algebra Sx Nµ 34 12
- Sample mean x-bar Sx/n
- Example the number of hours spent on the
Internet 4, 8, and 9 - x-bar (489)/3 7 hours
- Do NOT use if the number of observations is small
or with extreme values - Ex Do NOT use if 3 houses were sold this week,
and one was a mansion
16Median
- Median middle value
- Example 5,1,6
- Step 1 Sort data 1,5,6
- Step 2 Middle value 5
- When there is an even number of observation,
median is computed by averaging the two
observations in the middle. - OK even if there are extreme values
- Home sales 100K,200K,900K, so
- mean 400K, but median 200K
17Mode
- Mode most frequent value
- Ex female, male, female
- Mode female
- Ex 1,1,2,3,5,8
- Mode 1
- It may not be a very good measure, see the
following example
18Measures of Central Location - Example
- Sample 0, 0, 5, 7, 8, 9, 12, 14, 22, 23
- Sample Mean x-bar Sx/n 100/10 10
- Median (89)/2 8.5
- Mode 0
19Relationship
- Case 1 if probability distribution symmetric
(ex. bell-shaped, normal distribution), - Mean Median Mode
- Case 2 if distribution positively skewed to
right (ex. incomes of employers in large firm a
large number of relatively low-paid workers and a
small number of high-paid executives), - Mode lt Median lt Mean
20Relationship contd
- Case 3 if distribution negatively skewed to left
(ex. The time taken by students to write exams
few students hand their exams early and majority
of students turn in their exam at the end of
exam), - Mean lt Median lt Mode
21Dispersion Measures of Variability
- How much spread of data
- How much uncertainty
- Measures
- Range
- Variance
- Standard deviation
22Range
- Range Max-Min gt 0
- But range affected by unusual values
- Ex Santa Monica has a high of 105 degrees and a
low of 30 once a century, but range would be
105-30 75
23Standard Deviation (SD)
- Better than range because all data used
- Population SD Square root of variance sigma s
- SD gt 0
24Empirical Rule
- Applies to mound or bell-shaped curves
- Ex normal distribution
- 68 of data within one SD of mean
- 95 of data within two SD of mean
- 99.7 of data within three SD of mean
25Standard Deviation Square Root of Variance
26Sample Standard Deviation
x
6 6-8-2 (-2)(-2) 4
6 6-8-2 4
7 7-8-1 (-1)(-1) 1
8 8-80 0
13 13-85 (5)(5) 25
Sum40 Sum0 Sum 34
Mean40/58
27Standard Deviation
- Total variation 34
- Sample variance 34/4 8.5
- Sample standard deviation
- square root of 8.5 2.9
28Measures of Variability - Example
- The hourly wages earned by a sample of five
students are - 7, 5, 11, 8, and 6
- Range 11 5 6
- Variance
-
- Standard deviation
29Graphical Tools
- Line chart trend over time
- Scatter diagram relationship between two
variables - Bar chart frequency for each category
- Histogram frequency for each class of measured
data (graph of frequency distr.) - Box plot graphical display based on quartiles,
which divide data into 4 parts
30Top Ten 8
- Variation Creates Uncertainty
31No Variation
- Certainty, exact prediction
- Standard deviation 0
- Variance 0
- All data exactly same
- Example all workers in minimum wage job
32High Variation
- Uncertainty, unpredictable
- High standard deviation
- Ex 1 Workers in downtown L.A. have variation
between CEOs and garment workers - Ex 2 New York temperatures in spring range from
below freezing to very hot
33Comparing Standard Deviations
- Temperature Example
- Beach city small standard deviation (single
temperature reading close to mean) - High Desert city High standard deviation (hot
days, cool nights in spring)
34Standard Error of the Mean
- Standard deviation of sample mean
- standard deviation/square root of n
- Ex standard deviation 10, n 4, so standard
error of the mean 10/2 5 - Note that 5lt10, so standard error lt standard
deviation. - As n increases, standard error decreases.
35Sampling Distribution
- Expected value of sample mean population mean,
but an individual sample mean could be smaller or
larger than the population mean - Population mean is a constant parameter, but
sample mean is a random variable - Sampling distribution is distribution of sample
means
36Example
- Mean age of all students in the building is
population mean - Each classroom has a sample mean
- Distribution of sample means from all classrooms
is sampling distribution
37Central Limit Theorem (CLT)
- If population standard deviation is known,
sampling distribution of sample means is normal
if n gt 30 - CLT applies even if original population is skewed
38Top Ten 5
39Expected Value
- Expected Value E(x) SxP(x)
- x1P(x1) x2P(x2)
- Expected value is a weighted average, also a
long-run average
40Example
- Find the expected age at high school graduation
if 11 were 17 years old, 80 were 18 years old,
and 5 were 19 years old - Step 1 1180596
41Step 2
x P(x) x ? P(x)
17 11/96.115 17(.115)1.955
18 80/96.833 18(.833)14.994
19 5/96.052 19(.052).988
E(x) 17.937
42Top Ten 4
43Linear Regression
- Regression equation
- dependent variablepredicted value
- x independent variable
- b0y-intercept predicted value of y if x0
- b1sloperegression coefficient
- change in y per unit change in x
44Slope vs Correlation
- Positive slope (b1gt0) positive correlation
between x and y (y increase if x increase) - Negative slope (b1lt0) negative correlation (y
decrease if x increase) - Zero slope (b10) no correlation(predicted value
for y is mean of y), no linear relationship
between x and y
45Simple Linear Regression
- Simple one independent variable, one dependent
variable - Linear graph of regression equation is straight
line
46Example
- y salary (female manager, in thousands of
dollars) - x number of children
- n number of observations
47Given Data
x y
2 48
1 52
4 33
48Totals
x y
2 48
1 52
4 33 n3
Sum7 Sum133
49Slope (b1) -6.5
- Method of Least Squares formulas not on BUS 302
exam - b1 -6.5 given
Interpretation If one female manager has 1 more
child than another, salary is 6,500 lower that
is, salary of female managers is expected to
decrease by -6.5 (in thousand of dollars) per
child
50Intercept (b0)
- b0 44.33 (-6.5)(2.33) 59.5
- If number of children is zero, expected salary is
59,500
51Regression Equation
52Forecast Salary If 3 Children
- 59.5 6.5(3) 40
- 40,000 expected salary
53Standard Error of Estimate
54Standard Error of Estimate
(1)x (2)y (3) 59.5-6.5x (4) (2)-(3)
2 48 46.5 1.5 2.25
1 52 53 -1 1
4 33 33.5 -.5 .25
SSE3.5
55Standard Error of Estimate
Actual salary typically 1,900 away from expected
salary
56Coefficient of Determination
- R2 of total variation in y that can be
explained by variation in x - Measure of how close the linear regression line
fits the points in a scatter diagram - R2 1 max. possible value perfect linear
relationship between y and x (straight line) - R2 0 min. value no linear relationship
57Sources of Variation (V)
- Total V Explained V Unexplained V
- SS Sum of Squares V
- Total SS Regression SS Error SS
- SST SSR SSE
- SSR Explained V, SSE Unexplained
58Coefficient of Determination
- R2 SSR
SST - R2 197 .98
200.5 - Interpretation 98 of total variation in salary
can be explained by variation in number of
children
590 lt R2 lt 1
- 0 No linear relationship since SSR0
(explained variation 0) - 1 Perfect relationship since SSR SST
(unexplained variation SSE 0), but does not
prove cause and effect
60RCorrelation Coefficient
- Case 1 slope (b1) lt 0
- R lt 0
- R is negative square root of coefficient of
determination
61Our Example
- Slope b1 -6.5
- R2 .98
- R -.99
62Case 2 Slope gt 0
- R is positive square root of coefficient of
determination - Ex R2 .49
- R .70
- R has no interpretation
- R overstates relationship
63Caution
- Nonlinear relationship (parabola, hyperbola, etc)
can NOT be measured by R2 - In fact, you could get R20 with a nonlinear
graph on a scatter diagram
64Summary Correlation Coefficient
- Case 1 If b1 gt 0, R is the positive square root
of the coefficient of determination - Ex1 y 43x, R2.36 R .60
- Case 2 If b1 lt 0, R is the negative square root
of the coefficient of determination - Ex2 y 80-10x, R2.49 R -.70
- NOTE! Ex2 has stronger relationship, as measured
by coefficient of determination
65Extreme Values
- R1 perfect positive correlation
- R -1 perfect negative correlation
- R0 zero correlation
66MS Excel Output
Correlation Coefficient (-0.9912) Note that you
need to change the sign because the sign of slope
(b1) is negative (-6.5)
Coefficient of Determination
Standard Error of Estimate
Regression Coefficient
67Top Ten 6
- What Distribution to Use?
68Use Binomial Distribution If
- Random variable (x) is number of successes in n
trials - Each trial is success or failure
- Independent trials
- Constant probability of success (p) on each trial
- Sampling with replacement (in practice, people
may use binomial w/o replacement, but theory is
with replacement)
69Success vs. Failure
- The binomial experiment can result in only one of
two possible outcomes - Male vs. Female
- Defective vs. Non-defective
- Yes or No
- Pass (8 or more right answers) vs. Fail (fewer
than 8) - Buy drink (21 or over) vs. Cannot buy drink
70Binomial Is Discrete
- Integer values
- 0,1,2,n
- Binomial is often skewed, but may be symmetric
71Normal Distribution
- Continuous, bell-shaped, symmetric
- Meanmedianmode
- Measurement (dollars, inches, years)
- Cumulative probability under normal curve use Z
table if you know population mean and population
standard deviation - Sample mean use Z table if you know population
standard deviation and either normal population
or n gt 30
72t Distribution
- Continuous, mound-shaped, symmetric
- Applications similar to normal
- More spread out than normal
- Use t if normal population but population
standard deviation not known - Degrees of freedom df n-1 if estimating the
mean of one population - t approaches z as df increases
73Normal or t Distribution?
- Use t table if normal population but population
standard deviation (s) is not known - If you are given the sample standard deviation
(s), use t table, assuming normal population
74Top Ten 3
- Confidence Intervals Mean and Proportion
75Confidence Interval
- A confidence interval is a range of values within
which the population parameter is expected to
occur.
76Factors for Confidence Interval
- The factors that determine the width of a
confidence interval are -
- The sample size, n
- The variability in the population, usually
estimated by standard deviation. - The desired level of confidence.
77Confidence Interval Mean
- Use normal distribution (Z table if)
- population standard deviation (sigma) known and
either (1) or (2) - Normal population
- Sample size gt 30
78Confidence Interval Mean
79Normal Table
- Tail .5(1 confidence level)
- NOTE! Different statistics texts have different
normal tables - This review uses the tail of the bell curve
- Ex 95 confidence tail .5(1-.95) .025
- Z.025 1.96
80Example
- n49, Sx490, s2, 95 confidence
- 9.44 lt µ lt 10.56
81Another Example
- One of SOM professors wants to estimate the mean
number of hours worked per week by students. A
sample of 49 students showed a mean of 24 hours.
It is assumed that the population standard
deviation is 4 hours. What is the population
mean?
82Another Example contd
- 95 percent confidence interval for the
population mean. -
The confidence limits range from 22.88 to 25.12.
We estimate with 95 percent confidence that the
average number of hours worked per week by
students lies between these two values.
83Confidence Interval Mean t distribution
- Use if normal population but population standard
deviation (s) not known - If you are given the sample standard deviation
(s), use t table, assuming normal population - If one population, n-1 degrees of freedom
84Confidence Interval Mean t distribution
85Confidence Interval Proportion
- Use if success or failure
- (ex defective or not-defective,
- satisfactory or unsatisfactory)
- Normal approximation to binomial ok if
- (n)(p) gt 5 and (n)(1-p) gt 5, where
- n sample size
- p population proportion
- NOTE NEVER use the t table if proportion!!
86Confidence Interval Proportion
- Ex 8 defectives out of 100, so p .08 and
- n 100, 95 confidence
87Confidence Interval Proportion
- A sample of 500 people who own their house
revealed that 175 planned to sell their homes
within five years. Develop a 98 confidence
interval for the proportion of people who plan to
sell their house within five years.
88Interpretation
- If 95 confidence, then 95 of all confidence
intervals will include the true population
parameter - NOTE! Never use the term probability when
estimating a parameter!! (ex Do NOT say
Probability that population mean is between 23
and 32 is .95 because parameter is not a random
variable. In fact, the population mean is a fixed
but unknown quantity.)
89Point vs Interval Estimate
- Point estimate statistic (single number)
- Ex sample mean, sample proportion
- Each sample gives different point estimate
- Interval estimate range of values
- Ex Population mean sample mean error
- Parameter statistic error
90Width of Interval
- Ex sample mean 23, error 3
- Point estimate 23
- Interval estimate 23 3, or (20,26)
- Width of interval 26-20 6
- Wide interval Point estimate unreliable
91Wide Confidence Interval If
- (1) small sample size(n)
- (2) large standard deviation
- (3) high confidence interval (ex 99 confidence
interval wider than 95 confidence interval) - If you want narrow interval, you need a large
sample size or small standard deviation or low
confidence level. -
92Top Ten 7
93P-value
- P-value probability of getting a sample
statistic as extreme (or more extreme) than the
sample statistic you got from your sample, given
that the null hypothesis is true
94P-value Example one tail test
- H0 µ 40
- HA µ gt 40
- Sample mean 43
- P-value P(sample mean gt 43, given H0 true)
- Meaning probability of observing a sample mean
as large as 43 when the population mean is 40 - How to use it Reject H0 if p-value lt a
(significance level)
95Two Cases
- Suppose a .05
- Case 1 suppose p-value .02, then reject H0
(unlikely H0 is true you believe population mean
gt 40) - Case 2 suppose p-value .08, then do not reject
H0 (H0 may be true you have reason to believe
that the population mean may be 40)
96P-value Example two tail test
- H0 µ 70
- HA µ ? 70
- Sample mean 72
- If two-tails, then P-value
- 2 ? P(sample mean gt 72)2(.04).08
- If a .05, p-value gt a, so do not reject H0
97Top Ten 2
98H0 Null Hypothesis
- Population meanµ
- Population proportionp
- A statement about the value of a population
parameter - Never include sample statistic (such as, x-bar)
in hypothesis
99HA or H1 Alternative Hypothesis
- ONE TAIL ALTERNATIVE
- Right tail µgtnumber(smog ck)
- pgtfraction(defectives)
- Left tail µltnumber(weight in box of crackers)
- pltfraction(unpopular Presidents
approval low)
100One-Tailed Tests
- A test is one-tailed when the alternate
hypothesis, H1 or HA, states a direction, such as
- H1 The mean yearly salaries earned by full-time
employees is more than 45,000. (µgt45,000) - H1 The average speed of cars traveling on
freeway is less than 75 miles per hour. (µlt75) - H1 Less than 20 percent of the customers pay
cash for their gasoline purchase. (p lt0.2)
101Two-Tail Alternative
- Population mean not equal to number (too hot or
too cold) - Population proportion not equal to fraction (
alcohol too weak or too strong) -
102Two-Tailed Tests
- A test is two-tailed when no direction is
specified in the alternate hypothesis
- H1 The mean amount of time spent for the
Internet is not equal to 5 hours. (µ ? 5). - H1 The mean price for a gallon of gasoline is
not equal to 2.54. (µ ? 2.54).
103Reject Null Hypothesis (H0) If
- Absolute value of test statistic gt critical
value - Reject H0 if Z Value gt critical Z
- Reject H0 if t Value gt critical t
- Reject H0 if p-value lt significance level (alpha)
- Note that direction of inequality is reversed!
- Reject H0 if very large difference between sample
statistic and population parameter in H0
Test statistic A value, determined from sample
information, used to determine whether or not to
reject the null hypothesis. Critical value The
dividing point between the region where the null
hypothesis is rejected and the region where it is
not rejected.
104Example Smog Check
- H0 µ 80
- HA µ gt 80
- If test statistic 2.2 and critical value 1.96,
reject H0, and conclude that the population mean
is likely gt 80 - If test statistic 1.6 and critical value
1.96, do not reject H0, and reserve judgment
about H0
105Type I vs Type II Error
- Alphaa P(type I error) Significance level
probability that you reject true null hypothesis - Beta ß P(type II error) probability you do
not reject a null hypothesis, given H0 false - Ex H0 Defendant innocent
- a P(jury convicts innocent person)
- ß P(jury acquits guilty person)
106Type I vs Type II Error
H0 true H0 false
Reject H0 Alpha a P(type I error) 1 ß (Correct Decision)
Do not reject H0 1 a (Correct Decision) Beta ß P(type II error)
107Example Smog Check
- H0 µ 80
- HA µ gt 80
- If p-value 0.01 and alpha 0.05, reject H0,
and conclude that the population mean is likely gt
80 - If p-value 0.07 and alpha 0.05, do not reject
H0, and reserve judgment about H0
108Test Statistic
- When testing for the population mean from a large
sample and the population standard deviation is
known, the test statistic is given by
109Example
- The processors of Best Mayo indicate on the
label that the bottle contains 16 ounces of mayo.
The standard deviation of the process is 0.5
ounces. A sample of 36 bottles from last hours
production showed a mean weight of 16.12 ounces
per bottle. At the .05 significance level, can
we conclude that the mean amount per bottle is
greater than 16 ounces?
110Example contd
- 1. State the null and the alternative hypotheses
- H0 µ 16, H1 µ gt 16
2. Select the level of significance. In this
case, we selected the .05 significance level.
- 3. Identify the test statistic. Because we know
the population standard deviation, the test
statistic is z. -
- 4. State the decision rule.
-
- Reject H0 if zgt 1.645 ( z0.05)
111Example contd
- 5. Compute the value of the test statistic
-
-
- 6. Conclusion Do not reject the null hypothesis.
We cannot conclude the mean is greater than 16
ounces.