Title: Probability Theory
1Probability Theory
- Review of essential concepts
2Probability
- P(A ? B) P(A) P(B) P(A ? B)
- 0 P(A) 1
- P(O)1
3Problem 1
- Given that P(A)0.6 and P(B)0.7, which of the
following cannot be true? - P(A ? B) 0.5 ? or
- P(A ? B) 0.9 ? and
- P(A ? B) 0.2
- P(A ? B) 0.4
- P(A ? B) 0.7
4Conditional Probability
- A and B are called independent if P(A
? B) P(A) P(B) - P(A B) P(A ? B)/P(B)
- P(A B) ???? A ? B
- A and B are independent ? P(AB)P(A)
5Complete Probability
- P(A) P(AH1)P(H1)
- P(AH2)P(H2)
- P(AHn)P(Hn)
- H1, H2, Hn complete disjoint system of events
6Bayes Formula
- P(BA) - prior probability
- P(AB) posterior probability
7Problem 2
- Suppose a certain drug test is 99 sensitive and
99 specific, that is, the test will correctly
identify a drug user as testing positive 99 of
the time, and will correctly identify a non-user
as testing negative 98 of the time. Let's assume
a corporation decides to test its employees for
opium use, and 0.5 of the employees use the
drug. What is the probability that, given a
positive drug test, an employee is actually a
drug user?
8Problem 3
- We are presented with three doors - red, green,
and blue - one of which has a prize. We choose
the red door, which is not opened until the
presenter performs an action. The presenter who
knows what door the prize is behind, and who must
open a door, but is not permitted to open the
door we have picked or the door with the prize,
opens the blue door and reveals that there is no
prize behind it and subsequently asks if we wish
to change our mind about our initial selection of
red. What is the probability that the prize is
behind each of the green and red doors?
9Random Variables
- Discrete (Uniform, Binomial, Poisson, Geometric,
Hypergeometric, Negative Binomial,) - Continuous (Uniform, Normal, Exponential, Gamma,
Chi-square, Student, Fisher, Dirchilet,)
10Discrete Distributions
Poisson
11Continuous Distributions
Beta distribution
12Binomial Distribution
- Binomial random number the number of successes
in n independent trials pprobability of success
in one trial
p0.1
p0.3
p0.5
13Problem 4
The probability that a certain machine will
produce a defective item is 0.20. If a
random sample of 6 items is taken from the output
of this machine, what is the probability
that there will be 5 or more defectives in the
sample?
14Problem 5
There are 10 patients on the Neo-Natal Ward of a
local hospital who are monitored by 2 staff
members. If the probability (at any one time) of
a patient requiring emergency attention by a
staff member is 0.3, assuming the patients to be
behave independently, what is the probability at
any one time that there will not be sufficient
staff to attend all emergencies?
15Cumulative Probability
X random variable F(x) P(X x)
Most of the data analysis tools have a built-in
function for the cumulative binomial probability
16Poisson Distribution
- Poisson random number the number of rare events
per unit of time or space
?1.5
?5
17Problem 6
- The marketing manager of a company has noted that
she usually receives 10 complaint calls during a
week (consisting of five working days), and that
the calls occur at random. Find the probability
that she gets five such calls in one day.
18Problem 7
- The rate at which a particular defect occurs in
lengths of plastic film being produced by a
stable manufacturing process is 4.2 defects per
75 meter length. A random sample of the film is
selected and it was found that the length of the
film in the sample was 25 meters. What is the
probability that there will be at most 2 defects
found in the sample?
19Normal Distribution
20Cumulative Probability
Standard Normal Distribution
21Other Normal Distributions
- Z N(0,1)
- Mean 0
- Variance 1
- X N(µ, s)
- Mean µ
- Variance s2
- Z (X- µ)/s
22Problem 8
- The diameters of steel disks produced in a plant
are normally distributed with a mean of 2.5 cm
and standard deviation of 0.02 cm. What is the
probability that a disk picked at random has a
diameter greater than 2.54 cm?
23Problem 9
- The height of an adult male is known to be
normally distributed with a mean of 69 inches and
a standard deviation of 2.5 inches. What is the
height of the doorway such that 96 percent of the
adult males can pass through it without having to
bend?
24Problem 10
- The longevity of people living in a certain
locality has a standard deviation of 14 years.
What is the mean longevity if 30 of the people
live longer than 75 years? Assume a normal
distribution for life spans.
25Normal Approximation to Binomial
- X Binom(n,p)
- n number of trials
- p probability of a single success
- X N(µ, s)
- µ np
- s2 np(1-p)
ngt40 npgt5 n(1-p)gt5
26Problem 11
The unemployment rate in a certain city is 8.5 .
A random sample of 100 people from the labor
force is drawn. Find the approximate probability
that the sample contains at least ten unemployed
people.
27Continuity correction
Normal approximation is still an approximation
28Problem 12
Companies are interested in the demographics of
those who listen to the radio programs they
sponsor. A radio station has determined that only
20 of listeners phoning in to a morning talk
program are male. During a particular week, 200
calls are received by this program. What is the
approximate probability that at least 50 of the
callers are male?
29Poisson Approximation to Bionomial
- X Binom(n,p)
- n number of trials
- p probability of a single success
- X Poisson(?)
- ? np
n?8 p?0 np?const
30Problem 13
A certain genetic characteristic will express
itself in 0.001 of the population. In a sample of
n3000 subjects, k7 are observed to display the
characteristic, whereas only three are expected
to display the characteristic. How likely is it
that a rate this great or greater could occur by
mere chance?
31Expected Value
E(X) S xi pi not a random number
E(XY) 11/221/3 E(X)E(Y)
E(X) 01/211/21/2
E(Y) 01/312/32/3
X and Y are independent ? Xa and Yb are
independent events
32Variance
Var(X) E (X-E(X))2 E(X2)-(E (X))2
E(X)2/3
E(X-E(X)) -2/92/9 0
Var(X)4/91/31/92/32/9
E(X2)2/3
Var(X)E(X2)-E2(X)2/3 4/9 2/9
33Expected Value and Variance
- X random variable
- E(XY) E(X) E(Y)
- E(cX) cE(X)
- E(c) c
- If X and Y are independent then E(XY) E(X)E(Y)
- Var(X)E(X2)-E2(X)
- Var(cX)c2Var(X)
- If X and Y are independent then Var(XY)
Var(X)Var(Y) - For arbitrary X and Y, Var(XY) Var(X) Var(Y)
2Cov(X,Y)
34Exercises
- Using properties of E(X) prove that
- Var(X) E (X-E(X))2 E(X2)-(E (X))2
- Var(XY) Var(X) Var(Y) 2Cov(X,Y)
- where
- Cov(X,Y)E (X-E(X))(Y-E(Y))
- Cov(X,Y)E(XY) - E(X)E(Y)
- Find X and Y such that X and Y are dependent but
Cov(X,Y)0
35Problem 14
- The Attila Barbell Company makes bars for weight
lifting. The weights of the bars are independent
and are normally distributed with a mean of 720
ounces (45 pounds) and a standard deviation of 4
ounces. The bars are shipped 10 in a box to the
retailers. The weights of the empty boxes are
normally distributed with a mean of 320 ounces
and a standard deviation of 8 ounces. The weights
of the boxes filled with 10 bars are expected to
be normally distributed with a mean of 7,520
ounces. What is the standard deviation?
36Statistics
- Part I Sampling distribution
37Sampling Distribution
- Sample X1, X2, , Xn
- Xi are random numbers
Population heights of adult males
- All Xi are
- from the same distribution
- are independent
38Sample Mean
-
- All Xi are
- from the same distribution, i.e,
- E(Xi)µ, Var(Xi) s2
- are independent random numbers
-
39The Law of Large Numbers
40Illustrative example
Population 1,2,3, sample size n2
41Central Limit Theorem
- The sum of a sufficiently large number of
identically distributed independent random
variables is approximately normally distributed
regardless of the population distribution
42Normal Approximation to Binomial
X number of successes in n trials XX1X2Xn
43Problem 18
- There are two games involving flipping a coin. In
the first game you win a prize if you can throw
between 45 and 55 of heads. In the second game
you win if you can throw more than 80 heads. For
each game would you rather flip the coin 30 times
or 300 times?
44Sampling distribution
X is approximately normal when ngt40 X is
approximately normal regardless of the
original distribution
45Problem 15
- The average outstanding bill for delinquent
customer accounts for a national department store
chain is 187.50 with a standard deviation of
54.50. In a simple random sample of 50
delinquent accounts, what is the probability that
the mean outstanding bill is over 200?
46Problem 16
- The average number of daily emergency room
admissions at a hospital is 85 with standard
deviation of 37. In a simple random sample of 30
days, what is the probability that the mean
number of daily emergency admissions is between
75 and 95?
47Problem 17
- A summer resort rents rowboats to customers but
does not allow more than four people to a boat.
Each boat is designed to hold no more than 800
pounds. Suppose the distribution of adult males
who rent boats, including their clothes and gear,
is normal with a mean of 190 pounds and standard
deviation of 10 pounds. If the weights of
individual passengers are independent, what is
the probability that a group of four adult male
passengers will exceed the acceptable weight
limit of 800 pounds?
48Statistics
- Part II Hypothesis testing
49Hypothesis testing
- H0 null hypothesis
- HA alternative hypothesis
In a court H0 the person is not guilty HA
the person is guilty Doctors appointment H0
patient is sick HA patient is not sick
50Type I/II error
- Type I error (a)
- It is the error of rejecting a null hypothesis
when it is actually true. - Type II error (ß)
- It is the error of failing to reject a null
hypothesis when it is in fact false.
51Decision rule
- Assume we get many samples
- We set up a decision rule which rejects or
accepts the hull hypothesis for each sample - Sometimes we will commit Type I error
- Sometimes we will commit Type II error
- (Of course many times we will be correct!)
Decision rule comes separately from the set of
hypotheses
52Type I/II error
53Problem 19
- A patient claims that he consumes only 2000
calories per day, but a dietician suspects that
the actual figure is higher. The dietician plans
to check his food intake for 30 days and will
reject the patient's claim if the 30-day-mean is
more than 2100 calories. If the standard
deviation (in calories per day) is 350, what is
the probability that the dietician will
mistakenly reject a patient's true claim?
54Problem 20
- City planners wish to test the claim that
shoppers park for an average of only 47 minutes
in the downtown area. The planners have decided
to tabulate parking durations for 225 shoppers
and to reject the claim if the sample mean
exceeds 50 minutes. If the claim is wrong and the
true mean is 51 minutes, what is the probability
that the random sample will lead to a mistaken
failure to reject the claim? Assume that the
standard deviation in parking durations is 27
minutes.
55P-value
- P-value is the probability of obtaining a result
at least as extreme as the one that was actually
observed, given that the null hypothesis is true. - ???? ?? ??, ??? ?? ???????????? ? ???????
???????? ???? ?????, ?? ?????? ???? ??
??????????? ?????? ??, ??? ?? ????? ? ???????
(???, ??? ??? ????)
56Hypothesis testing
- P-value is a function of sample
- a is a function of decision rule
- Reject H0 if p-valuelt a
- Small p-value indicates that you see something
very unusual if H0 were true
57Problem 21
- A service station advertises that its mechanics
can change a muffler in only 15 minutes. A
consumers group doubts this claim and runs a
hypothesis test using 49 cars needing new
mufflers. In this sample the mean changing time
is 16.25 minutes with a standard deviation of 3.5
minutes. Is this a strong evidence against the 15
minute claim?
58Estimators
- An estimator is a function of the observable
sample data that is used to estimate an unknown
population parameter - is an estimator for µ
- s is an estimator for s
- is an estimator for p
59Standard error
- Standard error standard deviation of the
estimator
60Problem 22
- A local restaurant owner claims that only 15 of
visiting tourists stay for more than 2 days. A
chamber of commerce volunteer is sure that the
real percentage is higher. He plans to survey 100
tourists and intends to speak up if at least 18
of the tourists stay longer than 2 days. What is
the probability of mistakenly rejecting the
restaurant owner's claim if it is true?
61Two-sample mean
- Two independent samples, X1,, Xn and Y1,,Ym
have independent sample means
62Two-sample proportion
- Two independent sample proportions
63Problem 23
- A historian believes that the average height of
soldiers in World War II was greater than that of
soldiers in World War I. She examines a random
sample of records of 100 men in each war and
notes standard deviations of 2.5 and 2.3 inches
in World War I and World War II, respectively. If
the average height from the sample of World War
II soldiers is 1 inch greater than that from the
sample of World War I soldiers, what conclusion
is justified from a two-sample hypothesis test
where H0 µ1 µ2 vs. HA µ1lt µ2?
64Confidence intervals
- Hypothesis testing A coffee machine is supposed
to deliver 8 ounces of coffee in a cup, but in my
sample of 10 cups I get only 7.5 ounces. Is this
ok? - Confidence intervals My sample of 10 cups of
coffee contains on average 7.5 ounces of liquid.
What is the likely estimate for the mean amount
of coffee per cup? - Hypothesis testing and construction of confidence
intervals are mutually inverse problems
65Confidence intervals
- Parameter Estimate quantile SE,
- SE standard error
66Problem 23 revisited
- A patient claims that he consumes only 2000
calories per day, but a dietician suspects that
the actual figure is higher. The dietician
checked his food intake for 30 days and found
that the 30-day-mean is more than 2100 calories.
What is the 95 confidence interval for the
number of calories in patients diet? - Assume standard deviation of 350 calories per
day.
67Problem 24
- A chamber of commerce volunteer is interested in
the percentage of visiting tourists staying for
more than 2 days in a certain hotel. He surveyed
100 tourists and found that 18 of them stay
longer than 2 days. What is the 99 confidence
interval for the percentage of visiting tourists
who stay for more than 2 days?
68Problem 25
- In a random sample of 300 high school students,
225 said they managed time effectively, while in
a similar sample of 270 college students, only
108 felt they were effective time managers. What
is a 99 confidence interval estimate for the
difference between the proportions of high school
and colleges students who think they manage time
effectively?
69Problem 26
- A medical researcher believes that taking 1000
milligrams of vitamin C per day will result in
fewer colds than a daily intake of 500 milligrams
will. In a group of 50 volunteers taking 1000
milligrams per day, the numbers of colds per
individual during a winter season averaged 1.8
with a variance of 1.5. Similar data from a group
of 60 volunteers taking 500 milligrams per day
showed an average of 2.4 with a variance of 1.6.
What was the P-value of this test?
70How do we get s?
- Population standard deviation is usually unknown
- If sample size is large (ngt40) then we can assume
that the sample standard deviation (s)
approximates the population standard deviation
(s) well enough - If sample size is small then this assumption is
no longer valid, i.e., sampling error in the
estimation of s cannot be ignored
71Known vs. unknown s
s
known
unknown
z
Small sample
Large sample
t
z
72Student t-distribution
- Student t-distribution has one parameter called
degrees of freedom
- When the number of degrees of freedom is large,
the t-distribution is close to z-distribution
73t-distribution table
Degrees of freedom sample size - 1
74Problem 28
- An article ("Undergraduate Marijuana use and
Anger" by Sue Stoner) in a 1988 issue of the
Journal of Psychology (Vol. 122, p. 33) reported
that in a sample of 17 marijuana users the mean
and standard deviation on an anger expression
scale were 42.72 and 6.05, respectively. Test
whether this result is significantly greater than
the established mean of 41.6 for nonusers. What
assumptions are necessary for the above test to
be valid?
75T-test assumptions
- Random sampling (like in z-test)
- Normal population (unlike z-test, where sample
mean is automatically normal regardless of the
population when sample size is large) - Degrees of freedom number of independent
observations (actually, residuals)
76Problem 29
- A hospital exercise laboratory technician notes
the resting pulse rates of five joggers to be 60,
58, 59, 61, and 67, respectively, while the
resting pulse rates of seven non-exercisers are
83, 60, 75, 71, 91, 82, and 84, respectively.
Establish a 99 confidence interval estimate for
the difference in pulse rates between joggers and
non-exercisers. - (Means and standard deviations are 61, 78, 3.54,
and 10.23, respectively)
77Equal variances assumption
- Assume that both populations have the same
standard deviation (i.e., amount of exercise
affects mean of the population, not its standard
deviation)
d.f. minn,m
d.f. n m - 2
78Problem 29 revisited
- A hospital exercise laboratory technician notes
the resting pulse rates of five joggers to be 60,
58, 59, 61, and 67, respectively, while the
resting pulse rates of seven non-exercisers are
83, 60, 75, 71, 91, 82, and 84, respectively.
Establish a 99 confidence interval estimate for
the difference in pulse rates between joggers and
non-exercisers. Assume equal variances. - (Means and standard deviations are 61, 78, 3.54,
and 10.23, respectively)
79Problem 30
- Pepper plants watered lightly every day for a
month show an average growth of 27 cm with the
standard deviation of 8.3 cm, while pepper plants
watered heavily once a week for a month show an
average growth of 29 cm with the standard
deviation of 7.9 cm. In a sample of 60 plants,
half of which were given each of the water
treatments, what is the probability that the
difference in average growth between the two
halves is between -3 and 3 cm?
80Problem 31
- A researcher believes a new diet should improve
weight gain in laboratory mice. If ten control
mice on the old diet gain an average of 4 ounces
with a standard deviation of 0.3 ounces, while
the average gain for the ten mice on the new diet
is 4.8 ounces with a standard deviation of 0.2
ounces, what is the p-value?
81Dependent samples
- Trace metals in drinking water wells affect the
flavor of the water and unusually high
concentrations can pose a health hazard. In the
paper, Trace Metals of South Indian River
Region (Environmental Studies, 1982, 62-6),
trace metal concentrations (mg/L) on zinc were
found from water drawn from the bottom and the
top of each of 6 wells.
82Dependent samples
One sample t-test
83FAQs
- Do I have to divide by square root of n?
- Yes, if you are looking for P(Xgt100)
- No, if you are looking for P(Xgt100)
- Do I have to divide by square root of n in
one-proportion or two-proportion tests? - No. If you use Standard Error, it already
contains the square root of n - When I compute standard deviation from the
sample, do I have to divide it by square root of
n? - Yes, if your calculations involve sample mean.
84Common misconception
- Sample standard deviation is an estimator for the
population standard deviation - Standard deviation of the sampling distribution
is smaller than the population standard deviation - Sample standard deviation is NOT an estimator for
the standard deviation of the sampling
distribution
85Statistics
- Part III Contingency tables
86Non-parametric hypotheses
- H0 features are independent
- HA features are dependent
A restaurant owner surveys a random sample of 385
customers to determine whether customer
satisfaction is related to gender and age.
87Assumption of independence
If gender/age and satisfaction were independent
then P(satisfied and young male)
P(satisfied)P(young male) P(satisfied)
302/385 P(young male) 33/385 P(satisfied and
young male) 30233/3852 Expected number of
satisfied young males 30233/385
88Observed and Expected
Observed
Expected
89Chi-square test for independence
d.f. (n-1)x(m-1)
90Problem 32
- A sociologist conducts a test whether there is a
relationship between cheating on exams and
socioeconomic status. A random sample of 750 high
school students yields the following results - What is the conclusion about cheating and
socioeconomic status at the 5 significance level?
91Chi-square goodness of fit
- A grocery store manager wishes to determine
whether a certain product will sell equally well
in any of the five locations in the store. Five
displays are set up, one for each location, and
the resulting numbers of the product sold are
noted - Is there enough evidence to claim a difference?
92Chi-square goodness of fit
Total 432948206 We expect 206/541.2 units
sold in each location H0 The distribution is
uniform HA The distribution is not uniform
d.f. n-1
93Problem 33
- A geneticist claims that four species of fruit
flies should appear in the ratio of 1339.
Suppose that a sample of 4000 fruit flies
contained 226, 764, 733, and 2277 flies of each
species, respectively. At the 10 significance
level, is there sufficient evidence to reject the
geneticists hypothesis?
94Chi-square test warning
- Chi-square test is applicable only if the
expected value in each cell is greater than 5
(Compare to Binomial Distribution) - If this doesnt hold, you might find Fisher exact
test more useful
95Problem 34
- A sample of teenagers might be divided into male
and female on the one hand, and those that are
and are not currently dieting on the other. We
hypothesize, perhaps, that the proportion of
dieting individuals is higher among the women
than among the men, and we want to test whether
any difference of proportions that we observe is
significant.
Expected lt 5
96Fisher exact test
Hypergeometric Distribution
97Statistics
- Part IV Regression and ANOVA
98The least squares line
- A simple data set consists of n points (data
pairs) (xi, yi), i 1, ..., n, where xi is an
independent variable and yi is a dependent
variable whose value is found by observation. - The model function has the form yf(x,ß), where ß
is the vector of parameters. - We wish to find those parameter values for which
the model "best" fits the data.
99Residuals
- The least squares method defines "best" as when
- is a minimum.
- A residual is defined as the difference between
the values of the dependent variable and the
predicted values from the estimated model - An example of a model is that of the straight
line.
100Regression Line
- Residuals are the little blue lines
- They are parallel to the y-axis
- Sum of squares of the residuals is at minimum
- Residuals for the inverse regression are
horizontal
101Residual plot
- The sum of the residuals is always zero
- A pattern in the residual plot indicates that a
non-linear model should be used
102Influential scores and outliers
- In regression, an outlier is a data point with
large residual - An influential score is the data point which
significantly influences the regression line - If an influential score is removed from the
sample, the regression line will change
significantly
103Problem 34
- Which of the five points is an outlier, and which
is an influential score?
104Correlation Coefficient
- The sample correlation coefficient is 1 in the
case of an increasing linear relationship, -1 in
the case of a decreasing linear relationship, and
some value in between in all other cases,
indicating the degree of linear dependence
105Coefficient of determination
- SST total sum of squares
- SSX sum of squares explained by X
- SSE sum of squares of residuals
- SST SSXSSE
- The square of the sample correlation coefficient,
which is also known as the coefficient of
determination, is the fraction of the variance in
yi that is accounted for by a linear fit of xi
to yi
106Sums of squares
red
blue
107Solving the regression
108SE of the regression slope
- The regression line is a result of random
sampling - Different samples produce different lines
- There is a family of lines for the given
population you get just one
109SE of the regression slope
where se is the standard deviation of the
regression error
110Problem 35
- What is the equation of the fitted line?
- Find an approximate confidence interval for the
regression slope? - Test the hypothesis that the slope is non-zero
111Problem 36
Find the regression line and a 95 confidence
interval for the regression slope.
112Confidence vs. prediction intervals
- Suppose I fuel my car 7 days a week, from Sunday
to Sunday, each day at a randomly chosen gas
station. I get a sample of gasoline prices for 7
days - Confidence interval is for the average gasoline
price on Monday - Prediction interval is for a gasoline price at a
randomly chosen gas station on Monday
113Confidence vs. prediction intervals
- Confidence interval
- Prediction interval
114Problem 36 revisited
Find the a 95 prediction interval for the next
dive at 25 degrees Celsius
115ANOVA Analysis of Variance
- A collection of models, in which the variance of
the observed set is partitioned into components
due to explanatory variables - Assumptions
- Independence of observations
- The distributions in each of the groups are
normal - Variance homogeneity, called homoscedasticity
the variance of data in groups should be the
same.
116One-way ANOVA
- A manager wishes to determine whether the mean
times required to complete a certain task differ
for the three levels of employee training. He
randomly selected 10 employees with each of the
three levels of training. - Do the data provide sufficient evidence to
indicate that the mean times required to complete
a certain task differ for at least two of the
three levels of training?
117Steiners Theorem
a
xi
?????? ??????? ??????? ????? ???????????? ????? ?
118Problem 36
- Three different milling machines were being
considered for purchase by a manufacturer.
Potentially, the company would be purchasing
hundreds of these machines, so it wanted to make
sure it made the best decision. Initially, five
of each machine were borrowed, and each was
randomly assigned to one of 15 technicians (all
technicians were similar in skill). Each machine
was put through a series of tasks and rated using
a standardized test. The higher the score on the
test, the better the performance of the machine.
The data are
119Partition of sum of squares
- SST SSA SSE
- SST total sum of squares
- SSA sum of squares for factor A
- SSE sum of squares of errors
120Partition of sum of squares
121The ANOVA table
- SSA Sum of squares Between
- SSE Sum of squares Within
122Problem 36 solution
In EXCEL Tools -gt Data Analysis -gt Single Factor
ANOVA
123THE END
124Extra Problems
- All bags entering a research facility are
screened. Ninety-seven percent of the bags that
contain forbidden material trigger an alarm.
Fifteen percent of the bags that do not contain
forbidden material also trigger the alarm. If 1
out of every 1,000 bags entering the building
contains forbidden material, what is the
probability that a bag that triggers the alarm
will actually contain forbidden material?