Title: Data description
1Data description visualization
2The problem The data
- A site has suffered the release of a toxic
chemical (1,2,3,4-Tetrachlorobenzene -- TcCB)
into the soil, and the company responsible has
undertaken cleanup activities. - How should we decide whether the cleanup has been
adequate?
- We have samples of TcCB concentration (measured
in ppb) in the soils at the cleanup site, as well
as samples of concentrations at an uncontaminated
reference site with similar soil
characteristics. - The concentrations of TcCB at the reference site
are not zero, and we need to determine what the
normal levels of this chemical are.
3Visualizing univariate data I the histogram
4Histogram shape is sensitive to bin size
5 Nonparametric density plot
6Reference vs. Cleanup sites
7Describing the data I Measures of central
tendency
Reference 0.599 0.54
Cleanup 3.915 0.43
- Arithmetic mean
- The average observation
- Strongly influenced by extreme values in the data
- Median
- The typical observation
- Half the observations are above it and half are
below it - Robust to extreme values
- Mode
- The most likely observation
- Peak in the frequency distribution
8Describing the data II measures of variability
Cleanup 400.624 20.016 511.229 2.281 0.97
Reference 0.0805 0.284 47.391 0.0414 0.37
- Variance
- Mean squared difference from mean
- Standard deviation
- Square root of variance
- Coefficient of variation
- Variation scaled by the mean
- Standard error
- Measure of precision
- Interquartile range
- Distance between bottom quarter of data and upper
quarter of data
9Describing the data III measures of shape
Reference 0.902 0.132
Cleanup 7.717 62.675
- Skew measure of symmetry of data around mean
- Perfect symmetry skew 0
- Kurtosis measure of peakedness of data
- Kurtosis lt 0 (platykurtotic) points evenly
distributed - Kurtosis gt 0 (leptokurtotic) most points
either close to mean or very far away
10Visualizing univariate data II the box plot
outlier
1.5 x interquartile range above 75th percentile
or maximum of data
95 CI of Mean
75th percentile
Interquartile range
mean
median
Most compact 50
25th percentile
1.5 x interquartile range below 25th percentile
or minimum of data
11Visualizing univariate data III the cumulative
distribution function (CDF)
12Conclusions TcCB
- On average, the cleanup site is still more
contaminated than the reference site - This is because of a few extremely contaminated
hot spots the medians are about the same - However, parts of the cleanup site are cleaner
than the reference site - Cleanup technology is effective
- A few spots are still badly contaminated
- Need to focus on cleaning up hot spots
13Presenting data summaries
- The mean soil concentrations of TcCB
were (n 47) at the reference
site and (n 77) at the
cleanup site. - Always give sample size
- Always explain what you are reporting when using
/- notation
14OBSERVATIONS, POPULATIONS, AND PARAMETERS
- Observation a single datum of information about
the environment - Also called experiment or event or sample
- Sample space set of all possible observations
about the system - Might be infinite
- Also called population
- If we observe every sample in the sample space
then we have completely characterized the system
- We are interested in some parameters that
describe the sample space - Goal of statistics estimate parameters without
taking all of the possible samples
15RANDOM VARIABLES
- Each observation is a random variable
- Prior to taking the observation, we cannot
perfectly predict its value - The observation has a certain probability of
taking on a given value - This probability is generally unknown
- The probability distribution is determined by the
properties of the sample space
- A parameter estimate is also a random variable
- If we repeat the process of collecting n
observations and estimating the parameter, we
will get a different value - If we do this a bunch of times, we will see a
probability distribution for the estimate - The width of this distribution (the variance)
tells us about the uncertainty in the parameter
estimate
16CONFIDENCE INTERVALS
- If know how data sampled
- We can construct a Confidence Interval for an
unknown parameter, q. - A 95 C.I. gives a range such that true q is in
interval 95 of the time. - A 100(1-a) C.I. captures true q
- (1-a) of the time.
- Smaller a, more sure true q falls in interval,
but wider interval.
- C.I. FOR MEAN OF NORMALLY DISTRIBUTED DATA
- 95 C.I. for m
- SE is standard error of mean.
- t97.5 is critical value of t distribution
- Critical t value depends on sample size (n)
- If n gt 20, then t97.5 1.96 2
17Accuracy
- Precision
- Repeatability of estimate
- What we get if we took a new sample and
recalculated the estimate? - Quantified by standard error
- Also called efficiency
- Bias
- Systematic over- or under-estimate
- Can often be compensated for
- E.g., dividing by n 1 in calculation of sample
variance is a bias correction
18CENTRAL LIMIT THEOREM
- Normality requirement actually applies to
parameter estimate, not to data - Sample mean is based on the sum of a bunch of
random variables - Sum of normal random variables is itself normally
distributed - Central Limit Theorem says that the sum of
enough random variables from any probability
distribution is normally distributed
19Comparing data to a Normal distribution the QQ
plot
Quantiles of data
Quantiles of standard normal
20Transforming data
- Why transform?
- Make the data (or the residuals from the
regression) more symmetric - Eliminate nonlinear relationships
- Control other violations of regression
assumptions - But do transformed data mean anything?
- Often several natural scales
- T-test for means of log-transformed data
equivalent to comparing medians of untransformed
data
- Controlling skew Power transformations
- Powers lt 1 reduce skew
- Square root (1/2)
- Log (0)
- Inverse (-1)
- Powers gt 1 increase skew
- Squared (2)
- Also called Box-Cox transformation
- Logit transformation often used on proportion
data - We will learn a better way
21Log-transformed TcCB data (reference)
22Log-transformed TcCB (cleanup)
23The problem The data
- We are trying to design a fuel-efficient car, and
want to know what characteristics of current cars
are associated with high fuel efficiency.
- We have data on 60 models of cars, with
measurements of - Fuel efficiency (mpg)
- Fuel consumption (gallons per 100 miles)
- Weight (lbs)
- Engine displacement (cubic inches)
- Car class (small, sporty, compact, medium, large,
and vans)
24The Scatterplot Matrix
25Compact cars Sporty cars
26Measures of association
- Covariance
- Magnitude depends on strength of association and
on variance of each variable - Covariance of something with itself is the
variance - Correlation
- Measure of association, independent of
variability in data - 1 ? perfect positive association -1
? perfect negative association 0 ?
no association
27Covariance matrix for fuel data
28Correlation matrix for fuel data
29Fuel data conclusions
- Horsepower, car weight, and engine displacement
are positively associated - Weight has strongest negative association with
mileage - Inverse of mileage (gallons per mile) might be
better variable to look at (get rid of
nonlinearity) - To get more fuel efficient cars, focus on making
them lighter and giving them smaller engines
30Some questions
- Are mean TcCB levels in the cleanup site
significantly higher than in the reference
site? How confident are we that there is a
difference? - Do car types differ in their fuel economy? How
much of this is a function of their intrinsic
design, vs. differences in size?
31Further reading
- Online reader, Visualizing and Summarizing Data
- Online reader, Confidence Intervals