Data description - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Data description

Description:

Car class (small, sporty, compact, medium, large, and vans) 24. The Scatterplot Matrix ... Compact cars Sporty cars. 26. Measures of association. Covariance ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 32
Provided by: brucek64
Category:
Tags: data | description

less

Transcript and Presenter's Notes

Title: Data description


1
Data description visualization
  • ESM 206, 3/31/05

2
The problem The data
  • A site has suffered the release of a toxic
    chemical (1,2,3,4-Tetrachlorobenzene -- TcCB)
    into the soil, and the company responsible has
    undertaken cleanup activities.
  • How should we decide whether the cleanup has been
    adequate?
  • We have samples of TcCB concentration (measured
    in ppb) in the soils at the cleanup site, as well
    as samples of concentrations at an uncontaminated
    reference site with similar soil
    characteristics.
  • The concentrations of TcCB at the reference site
    are not zero, and we need to determine what the
    normal levels of this chemical are.

3
Visualizing univariate data I the histogram
4
Histogram shape is sensitive to bin size
5
Nonparametric density plot
6
Reference vs. Cleanup sites
7
Describing the data I Measures of central
tendency
Reference 0.599 0.54
Cleanup 3.915 0.43
  • Arithmetic mean
  • The average observation
  • Strongly influenced by extreme values in the data
  • Median
  • The typical observation
  • Half the observations are above it and half are
    below it
  • Robust to extreme values
  • Mode
  • The most likely observation
  • Peak in the frequency distribution

8
Describing the data II measures of variability
Cleanup 400.624 20.016 511.229 2.281 0.97
Reference 0.0805 0.284 47.391 0.0414 0.37
  • Variance
  • Mean squared difference from mean
  • Standard deviation
  • Square root of variance
  • Coefficient of variation
  • Variation scaled by the mean
  • Standard error
  • Measure of precision
  • Interquartile range
  • Distance between bottom quarter of data and upper
    quarter of data

9
Describing the data III measures of shape
Reference 0.902 0.132
Cleanup 7.717 62.675
  • Skew measure of symmetry of data around mean
  • Perfect symmetry skew 0
  • Kurtosis measure of peakedness of data
  • Kurtosis lt 0 (platykurtotic) points evenly
    distributed
  • Kurtosis gt 0 (leptokurtotic) most points
    either close to mean or very far away

10
Visualizing univariate data II the box plot
outlier
1.5 x interquartile range above 75th percentile
or maximum of data

95 CI of Mean
75th percentile
Interquartile range
mean
median
Most compact 50
25th percentile
1.5 x interquartile range below 25th percentile
or minimum of data
11
Visualizing univariate data III the cumulative
distribution function (CDF)
12
Conclusions TcCB
  • On average, the cleanup site is still more
    contaminated than the reference site
  • This is because of a few extremely contaminated
    hot spots the medians are about the same
  • However, parts of the cleanup site are cleaner
    than the reference site
  • Cleanup technology is effective
  • A few spots are still badly contaminated
  • Need to focus on cleaning up hot spots

13
Presenting data summaries
  • The mean soil concentrations of TcCB
    were (n 47) at the reference
    site and (n 77) at the
    cleanup site.
  • Always give sample size
  • Always explain what you are reporting when using
    /- notation

14
OBSERVATIONS, POPULATIONS, AND PARAMETERS
  • Observation a single datum of information about
    the environment
  • Also called experiment or event or sample
  • Sample space set of all possible observations
    about the system
  • Might be infinite
  • Also called population
  • If we observe every sample in the sample space
    then we have completely characterized the system
  • We are interested in some parameters that
    describe the sample space
  • Goal of statistics estimate parameters without
    taking all of the possible samples

15
RANDOM VARIABLES
  • Each observation is a random variable
  • Prior to taking the observation, we cannot
    perfectly predict its value
  • The observation has a certain probability of
    taking on a given value
  • This probability is generally unknown
  • The probability distribution is determined by the
    properties of the sample space
  • A parameter estimate is also a random variable
  • If we repeat the process of collecting n
    observations and estimating the parameter, we
    will get a different value
  • If we do this a bunch of times, we will see a
    probability distribution for the estimate
  • The width of this distribution (the variance)
    tells us about the uncertainty in the parameter
    estimate

16
CONFIDENCE INTERVALS
  • If know how data sampled
  • We can construct a Confidence Interval for an
    unknown parameter, q.
  • A 95 C.I. gives a range such that true q is in
    interval 95 of the time.
  • A 100(1-a) C.I. captures true q
  • (1-a) of the time.
  • Smaller a, more sure true q falls in interval,
    but wider interval.
  • C.I. FOR MEAN OF NORMALLY DISTRIBUTED DATA
  • 95 C.I. for m
  • SE is standard error of mean.
  • t97.5 is critical value of t distribution
  • Critical t value depends on sample size (n)
  • If n gt 20, then t97.5 1.96 2

17
Accuracy
  • Precision
  • Repeatability of estimate
  • What we get if we took a new sample and
    recalculated the estimate?
  • Quantified by standard error
  • Also called efficiency
  • Bias
  • Systematic over- or under-estimate
  • Can often be compensated for
  • E.g., dividing by n 1 in calculation of sample
    variance is a bias correction

18
CENTRAL LIMIT THEOREM
  • Normality requirement actually applies to
    parameter estimate, not to data
  • Sample mean is based on the sum of a bunch of
    random variables
  • Sum of normal random variables is itself normally
    distributed
  • Central Limit Theorem says that the sum of
    enough random variables from any probability
    distribution is normally distributed

19
Comparing data to a Normal distribution the QQ
plot
Quantiles of data
Quantiles of standard normal
20
Transforming data
  • Why transform?
  • Make the data (or the residuals from the
    regression) more symmetric
  • Eliminate nonlinear relationships
  • Control other violations of regression
    assumptions
  • But do transformed data mean anything?
  • Often several natural scales
  • T-test for means of log-transformed data
    equivalent to comparing medians of untransformed
    data
  • Controlling skew Power transformations
  • Powers lt 1 reduce skew
  • Square root (1/2)
  • Log (0)
  • Inverse (-1)
  • Powers gt 1 increase skew
  • Squared (2)
  • Also called Box-Cox transformation
  • Logit transformation often used on proportion
    data
  • We will learn a better way

21
Log-transformed TcCB data (reference)
22
Log-transformed TcCB (cleanup)
23
The problem The data
  • We are trying to design a fuel-efficient car, and
    want to know what characteristics of current cars
    are associated with high fuel efficiency.
  • We have data on 60 models of cars, with
    measurements of
  • Fuel efficiency (mpg)
  • Fuel consumption (gallons per 100 miles)
  • Weight (lbs)
  • Engine displacement (cubic inches)
  • Car class (small, sporty, compact, medium, large,
    and vans)

24
The Scatterplot Matrix
25
Compact cars Sporty cars
26
Measures of association
  • Covariance
  • Magnitude depends on strength of association and
    on variance of each variable
  • Covariance of something with itself is the
    variance
  • Correlation
  • Measure of association, independent of
    variability in data
  • 1 ? perfect positive association -1
    ? perfect negative association 0 ?
    no association

27
Covariance matrix for fuel data
28
Correlation matrix for fuel data
29
Fuel data conclusions
  • Horsepower, car weight, and engine displacement
    are positively associated
  • Weight has strongest negative association with
    mileage
  • Inverse of mileage (gallons per mile) might be
    better variable to look at (get rid of
    nonlinearity)
  • To get more fuel efficient cars, focus on making
    them lighter and giving them smaller engines

30
Some questions
  • Are mean TcCB levels in the cleanup site
    significantly higher than in the reference
    site? How confident are we that there is a
    difference?
  • Do car types differ in their fuel economy? How
    much of this is a function of their intrinsic
    design, vs. differences in size?

31
Further reading
  • Online reader, Visualizing and Summarizing Data
  • Online reader, Confidence Intervals
Write a Comment
User Comments (0)
About PowerShow.com