Title: Power analysis and hypothesis testing with multiple samples
1Power analysis and hypothesis testing with
multiple samples
2Types of error
- Type II fail to reject null hypothesis when its
really false - Desired level b
- Is associated with a given effect size
- E.g., want a probability 0.1 of failing to reject
when true difference between means is 0.35.
- Type I reject null hypothesis when its really
true - Desired level a
3Setting error levels
- a is controlled by setting critical P-value for
rejecting null hypothesis - b decreased by
- increasing a
- Increasing sample size (n)
- Decreasing sample variance, var(x)
- increasing effect size, D
- Tradeoff between a and b
- Need to balance costs associated with type I and
type II errors - Power is 1-b
- POWER ANALYSIS
- Take a few samples to get an estimate of var(x)
- Assume that population has mean m0 D and
variance var(x) - If I take n samples, what is the probability of
failing to reject the null hypothesis (getting P
gt a)? - Either through simulation or theory
- Adjust n to get the desired error level
4Power and the water temperature test
5Effect of sample size on power
6Statistics Decision Making EPAs Data Quality
Objectives (DQOs)
- What are DQOs? DQOs are qualitative and
quantitative statements, developed using the DQO
Process, that clarify study objectives, define
the appropriate type of data, and specify
tolerable levels of potential decision errors
that will be used as the basis for establishing
the quality and quantity of data needed to
support decisions. DQOs define the performance
criteria that limit the probabilities of making
decision errors by considering the purpose of
collecting the data defining the appropriate
type of data needed and specifying tolerable
probabilities of making decision errors. - See link on class website
7The DQO process
- State the Problem
- Define the problem identify the planning team
examine budget, schedule. - Identify the Decision
- State decision identify study question define
alternative actions. - Identify the Inputs to the Decision
- Identify information needed for the decision
(information sources, basis for Action Level,
sampling/analysis method). - Define the Boundaries of the Study
- Specify sample characteristics define
spatial/temporal limits, units of decision making.
- Develop a Decision Rule
- Define statistical parameter (mean, median)
specify Action Level develop logic for action. - Specify Tolerable Limits on Decision Errors
- Set acceptable limits for decision errors
relative to consequences (health effects, costs). - Optimize the Design for Obtaining Data
- Select resource-effective sampling and analysis
plan that meets the performance criteria.
8Preliminary assessment of household lead dust
- State the Problem
- Describing the problem. The owners wish to
evaluate the potential hazards associated with
lead in dust in a single-family residence because
other residences in the Athington Park House
neighborhood had shown levels of lead in dust
that might pose potential hazards. - Establishing the planning team. The planning team
included the property owners, a certified risk
assessor (to collect and handle dust samples and
serve as a liaison with the laboratory), and a
quality assurance specialist. The decision makers
were the property owners. - Describing the conceptual model of the potential
hazard. The conceptual model described a
single-family residence in a neighborhood where
hazardous levels of lead had been detected in
other residences. Interior sources of lead in
dust were identified as lead-based paint on
doors, walls, and trim, which deteriorated to
form, or attach to, dust particles. Exterior
sources included lead in exterior painted
surfaces that had deteriorated and leached into
the dripline soil, or lead deposited from
gasoline combustion fumes that accumulated in
soil. In these cases, soil could be tracked into
the house, and collected as dust on floors,
window sills, toys, etc. As this dust could be
easily ingested through hand-to-mouth activities,
dust was considered to be a significant exposure
route. Levels of lead in floor dust were to be
used as an indicator of the potential hazard. - Identifying the general intended use of collected
data. The data collected in this study will be
used to determine if a heath hazard is present at
Athington Park House using the criteria
established under 40 CFR 745. This is a decision
making (test of hypothesis) DQO Process. - Identifying available resources, constraints, and
deadlines. The property owners were willing to
commit up to 1,000 for the study. To minimize
inconvenience to the family, all sampling would
be conducted during one calendar day.
- Identify the Decision
- Specifying the primary study question. The
primary question to be addressed is to determine
if there were significant levels of lead in floor
dust at the House. - Determining the range of possible outcomes from
this study. If there were significant levels of
lead in floor dust at the residence, the team
planned follow-up testing to determine whether
immediately dangerous contamination exists and
the location of the contamination in the
property. If not, then there was no potential
lead hazard, and testing would be discontinued.
9Preliminary assessment of household lead dust
- Identify the Inputs to the Decision
- Identifying the types of information that is
needed to resolve the decision statement. The
assessment of a dust lead hazard would be
evaluated by measuring dust lead loadings by
individual dust wipe sampling according to
established protocol. - Identifying the source of information. The EPA
proposed standard stated that if dust lead levels
were above 50 µg /ft2 on bare floors, a lead
health hazard was possible and follow-up testing
and/or intervention should be undertaken (40 CFR
745). - Identifying how the Action Level will be
determined. The Action Level is the EPA standard
specified in 40 CFR 745. - Identifying appropriate sampling and analysis
methods. Wipe samples were collected according to
ASTM standard practice E1728. These samples were
digested in accordance with ASTM standard
practice E1644 and the sample extracts were
chemically analyzed by ASTM standard test method
E1613. The results of these analyses provided
information on lead loading (i.e., µgof lead per
square foot of wipe area) for each dust sample.
The detection limit was well below the Action
Level.
- Define the Boundaries of the Study
- Specifying the spatial and temporal boundaries
for collecting data. The spatial boundaries of
the study area were defined as all floor areas
within the dwelling that were reasonably
accessible to young children who lived at, or
visited, the property. Dust contained in each one
ft.2 area of each floor of the residence was
sampled and sent to a laboratory for analysis. - Specifying other practical constraints for
collecting data. Permission from the residents of
Athington Park House was required before risk
assessors could enter the residence to collect
dust wipe samples. Sampling was completed within
1 calendar day to minimize the inconvenience to
the residents. - Specifying the scale of estimates to be made. The
test results were considered to appropriately
characterize the current and future hazards. It
was possible that lead contained in soil could be
tracked into the residence and collect on
surfaces, but no significant airborne sources of
lead deposition were known in the region. The
dust was not expected to be transported away from
the property therefore, provided the exterior
paint was maintained in intact condition, lead
concentrations measured in the dust were not
expected to change significantly over time. - Specifying the scale of inference for decision
making. The decision unit was the interior floor
surface (approximately 1,700 ft2) of the
residence at the time of sampling and in the near
future.
10Preliminary assessment of household lead dust
- Develop a Decision Rule
- Specifying the Action Level. This was given in 40
CFR 745 which specified 50 µg/ft2. - Developing the population of interest and the
theoretical decision rule. From 40 CFR 745, the
median was selected as the appropriate parameter
to characterize the population under study. The
median dust lead loading was defined to be that
level, measured in µg/ft2, above and below which
50 of all possible dust lead loadings at the
property were expected to fall. If the true
median dust loading in the residence was greater
than 50 µg/ft2, then the planning team required
followup testing. Otherwise, they decided that a
dust lead hazard was not present and discontinued
testing.
11Preliminary assessment of household lead dust
- Determining the impact of decision errors and
setting tolerable decision error limits. The edge
of the gray region was designated by considering
that a false acceptance decision error would
result in the unnecessary expenditure of scarce
resources for follow-up testing and/or
intervention associated with a presumed hazard
that did not exist. The planning team decided
that this decision error should be adequately
controlled for true dust lead loadings of 40
µg/ft2 and below. Since human exposure to lead
dust hazards causes serious health effects, the
planning team decided to limit the false
rejection error rate to 5. This meant that if
this dwellings true median dust lead loading was
greater than 50 µg/ft2, the baseline condition
would be correctly rejected 19 out of 20 times.
The false acceptance decision, which would result
in unnecessary use of testing and intervention
resources, was allowed to occur more frequently
(i.e., 20 of the time when the true dust-lead
loading is 40 µg/ft2 or less).
- Specify Tolerable Limits on Decision Errors
- Setting the baseline condition. The baseline
condition adopted by the property owners was that
the true median dust lead loading was above the
EPA hazard level of 50 µg/ft2, due to the
seriousness of the potential hazard. The planning
team decided that the most serious decision error
would be to decide that the true median dust lead
loading was below the EPA hazard level of 50
µg/ft2, when in truth the median dust lead
loading was above the hazard level. This
incorrect decision would result in significant
exposure to dust lead and adverse health effects.
12(No Transcript)
13EXAMPLE LEAD DUST
- Preliminary sampling suggests that the standard
deviation of lead dust observations is 30 mg/ft2 - We want to know how many observations we need to
take so that b is 0.2 if the true mean dust
concentration is 10 mg/ft2 below the
contamination threshold and we are using an a of
0.05
For t-test, a for 1-sided test equivalent to 2a
for 2-sided test
1 - b
14Interlude the lognormal distribution
15LEAD DUST DONE RIGHT
Effect size is log(50) log(40) 0.22 Std.
dev. of log(lead) est. as 1.5
16Cleanup of a contaminated site
- THE PROBLEM
- A site has suffered the release of a toxic
chemical (TcCB) into the soil, and the company
responsible has undertaken cleanup activities. - How should we decide whether the cleanup has been
adequate?
- THE DATA
- We have samples of TcCB concentration (measured
in ppb) in the soils at the cleanup site, as well
as samples of concentrations at an uncontaminated
reference site with similar soil
characteristics. - The concentrations of TcCB at the reference site
are not zero, and we need to determine what the
normal levels of this chemical are.
17EPA standards for assessing site contamination
- If a site has not been declared to be
contaminated, then the null hypothesis should be
that it is clean, i.e., there is no difference
from the control site. The alternative
hypothesis is that the site is contaminated. A
non-significant test results leads to the
conclusion that there is no real evidence that
the site is contaminated.
- If a site has been declared to be contaminated,
then the null hypothesis should be that this is
true, i.e., there is a difference (in an
unacceptable direction) from the control site.
The alternative hypothesis is that the site is
clean. A non-significant test results leads to
the conclusion that there is no real evidence
that the site has been cleaned up.
USEPA (1989) Methods for Evaluating the
Attainment of Cleanup Standards. Vol. 1 Soils
and Solid Media. EPA Report 230/02-89-042,
Office of Policy, Planning and Evaluation,
Washington, DC.
18COMPARING TWO GROUPS
- Two-sample t-test
- Tests for differences between means of two groups
- Null hypotheses
- Under null hypothesis, difference in means,
standardized by standard deviations of both
groups, should follow a t distribution
19TcCB cleanup conclusion
- Using the null hypothesis that the cleanup site
is contaminated with respect to the control site,
we fail to reject the hypothesis that the cleanup
site is still contaminated (one-sided two-sample
t-test with unequal variances, t 1.45, df
76.05, P 0.925).
20Comparing fuel efficiency of two gasoline blends
- THE PROBLEM
- The owner of a taxi company is evaluating two
gasoline blends, and wants to use the one that
produces greater fuel efficiency - How should she decide which (if either) produces
greater efficiency?
- THE DATA
- On one day, all the taxis in the fleet were
fueled with gas A, and at the end of the day the
efficiency of each car (in mpg) was calculated. - On the next day, all the taxis in the fleet were
fueled with gas B, and at the end of the day the
efficiency of each car was calculated.
21Two-sample t-test of gas data
Want to control for variability among drivers
22COMPARING TWO GROUPS
- Paired t-test
- Each observation is a pair of measurements
- Water quality upstream and downstream of a road
crossing - Fuel mileage by a taxi driver using two brands of
gasoline - Natural variability between sampling units might
swamp differences between the means - Streams have different background water quality
- Drivers have different driving styles
- Instead, test for mean of differences
23- CONCLUSION
- We find strong evidence that mileage differs
between gas A and gas B (paired t-test, t 3.12,
df 9, P 0.012). On average, the fuel
efficiency with gas B is 0.6 mpg greater than
with gas A.
24COMPARING MEANS OF 3 OR MORE GROUPS
- ANOVA (ANalysis Of VAriance)
- Like 2-sample t-test, but with multiple groups
- H0 All groups have the same mean
- HA Not all groups have the same mean
- Rejecting H0 doesnt tell you which groups differ
- Can do a bunch of t-tests for this
25CONCLUSION Very strong evidence that highway
mileage differs among car types (one-way ANOVA, F
23.67, df 5,86, P lt 0.0001)
26HYPOTHESIS TESTING OVERVIEW
27ASSUMPTIONS OF T-TEST AND ANOVA
- T-test
- Distribution within each group is normal
- ANOVA
- Distribution within each group is normal
- Variances of all groups are the same
- Both tests are robust to moderate violations of
these assumptions - Regard P value as an approximate value
- TcCB data
- Assumption of normality is badly violated
- Solution do tests on transformed data
- Car mileage data
- Assumption of equal variances is badly violated
- Solution perform Welch ANOVA
28(No Transcript)