Title: Basic Statistics with Excel
1Basic Statistics with Excel
- ChE 408 Engineering Experimental Design
- Valerie L. Young
- Chemical Engineering, Ohio University
2Outline Excel as a Tool for . . .
- Summarizing data
- Fitting data (regression)
- Hypothesis testing
This is a tutorial. It consists of both
information and exercises to introduce Excel as a
tool for statistical analysis. Work through it at
your own pace. It assumes that you are familiar
with Excel and can do simple calculations and
make plots. If you cant, seek more help.
3Some Excel Tidbits
- Excel can display dates in many formats, but it
stores dates as numbers. - 1 January 1, 1900
- 37622 January 1, 2003
- A single quote forces a cell entry to be text
- To fix a cell, use in front of the column
letter and/or row number. Then the address wont
update when you copy the cell. - Excel can count the number of values in a range
- count(N2N35)
- countif(N2N35,-777)
- countif(N2N35,gt-777)
4Summarizing Data Histograms
- When your data set consists of a single variable
measured multiple times, a histogram shows how
the values are distributed.
About 26 of measurements are between 55 and 56.
Indicate total of values on graph or in caption.
This distribution is left-skewed.
This is ? 50 and lt 51. Next category is ? 51 and
lt 52. No overlap.
All categories the same size.
5Exercise Construct a Histogram
- Now that you have seen a histogram generated by
Excel, you should practice by making one
yourself. On the next slide, you will be
directed to load a data file and make a histogram
to display the data. Each bullet on the slide is
a hint for what you should do next to complete
the histogram. You should be able to figure out
how to make Excel do these things.
6Exercise Construct a Histogram
- Load calibration.xls from the ChE 408 homepage
- Count the total number of entries under cal
factor - count(b2b28)
- Set up a list of categories for the histogram
- 46-47
- Count the number of entries in each category
- countif(b2b28,lt47)-countif(b2b28,lt46)
- Calculate the fraction of entries in each
category - Insert a column chart on a new page, fraction vs.
category
7End of Histogram Exercise
- Continue to learn about more ways to summarize
data.
8More Ways to Summarize Data
- Plots
- Time series (use xy, not line, to see trends
over course of experiment) - y vs. x (use xy, not line, to investigate
correlations) - Column chart (to make histogram)
- Pie chart (to show proportions rarely used)
- Descriptive Statistics
- Measures of location
- Measures of variability
- Tools gt Data Analysis gt Descriptive Statistics
9Plots to Summarize Data
Dependent variable. Cal Factor depends on sample
size.
Look! No gray background.
Axes scaled so data covers graph.
Independent variable
Figure 2a. Example of a time-series plot, showing
that values from the first day appear to be lower
and more scattered than the rest.
Figure 2b. Example of a y vs. x plot, showing
that lower values of calibration factor are
obtained with smaller amounts of sample.
10Descriptive Statistics Location
- Mean (average)
- Strongly affected by unusual points
- average(b2b28)
- Median (50 of data higher, 50 of data lower)
- Seldom strongly affected by unusual points
- median(b2b28)
11Descriptive Statistics Variability
- Standard Deviation, stdev(b2b28)
- A measure of how widely the data is spread around
the mean - Strongly affected by unusual points
- Relative Standard Deviation
- stdev(b2b28)/average(b2b28)
- Usually given as percentage
- Format gt Cells gt Percent
- Variance
- (stdev(b2b28))2
12Descriptive Statistics Variability
- Interquartile range, IQR
- Width of the middle 50 of the data
- Seldom strongly affected by unusual points
- IQR q0.75 q0.25
- q0.75 75 of data is lower, percentile(b2b28,0
.75) - q0.25 25 of data is lower, percentile(b2b28,0
.25) - Range
- Values spanned by the data
- max(b2b28) min(b2b28)
- Strongly affected by unusual points
13Descriptive Statistics Variability
- Confidence Interval
- Mean /- Uncertainty covers the confidence
interval - 95 confidence limits means that if you
calculate a mean and confidence interval from a
set of replicate measurements (the sample), you
can be 95 sure that if you made an infinite
number of measurements (the population), their
mean would lie within the confidence interval
14Descriptive Statistics
The mode is the value that appears most often.
Each value appears just once in this data set.
The 95 confidence interval is 54.2 0.9.
Kurtosis and skewness describe the symmetry of
the distribution. We wont discuss these further
in this class.
This table is as-generated by Excel. Round to
appropriate sig figs before reporting these
numbers.
15Exercise Summarizing Data
- Now that you have seen graphical and statistical
methods for summarizing data, you should
practice. On the next slide, you will be
directed to load a data file and generate plots
and descriptive statistics. Each bullet on the
slide is a hint for what you should do next. You
should be able to figure out how to make Excel do
these things.
16Exercise Summarizing Data
- Load calibration.xls from the ChE 408 homepage
- Construct a time-series plot for the calibration
factor - Plot calibration factor vs. sample amount
- Use descriptive statistics to summarize the
location and variability of the calibration
factor - Calculate each statistic individually
- Use Tools gt Data Analysis gt Descriptive Statistics
17End of Summarizing Data Exercise
- Continue to learn about fitting equations to
data.
18Fitting Data
- For linear regression, use Tools gt Data Analysis
gt Regression - Dont use trendline no statistics
- X Variable ? slope
- R2 Coefficient of Determination
- A value near 1.0 means that the value of y
depends strongly on the value of x. Does NOT
mean the dependence is linear. - Use residual plots to show linearity
- Residuals should be random around zero
- Use p-values to show significance of linear fit
- p probability that points are arranged like
this by chance
19Linear Regression Example of Excel Output
Round to appropriate sig figs before reporting
these numbers.
99 of the variation in y is explained by
variation in x. The remainder may be random
error, or may be explained by some factor other
than x.
Ratio of variability explained by model to
leftover variability. High number means model
explains most variation in data.
Probability of getting that value of F by
randomly sampling from normally-distributed data.
y (0.580.02)x (0.0150.010)
Probability of getting a slope or intercept this
much different from zero by randomly sampling
from normally distributed data.
Confidence limits on slope and intercept.
20Residual Plot
This plot is as-generated by Excel. Fix the
formatting before including this plot in a report.
The random distribution of residuals around zero
suggests that the model accounts for all
predictable variation in y, and all that is left
is random uncertainty.
A residual for a given value of x is the
difference between the measured value of y, and
the value of y calculated using the regression
model.
21Exercise Linear Regression
- Now that you have seen the kind of information
Excel gives you when you fit a straight line to
data, you should practice. On the next slide, you
will be directed to load a data file, plot the
data, and fit a line to it. Each bullet on the
slide is a hint for what you should do next. You
should be able to figure out how to make Excel do
these things.
22Exercise Fitting Data
- Load Jul2627data.xls from the ChE 408 homepage
- Plot i-butane vs. n-butane. Does this plot
appear linear? - Perform linear regression on i-butane vs.
n-butane. - Are these values strongly correlated?
- Are these values linearly correlated?
- What is the linear equation relating these
values? - What uncertainty would you place on the values of
the slope and intercept for this linear
relationship?
23End of Summarizing Data Exercise
- Continue to learn about fitting equations to
data.
24Hypothesis Testing Can Answer Questions Like . .
.
- Is this value significantly different from the
one I expected? - Is the variability of this data significantly
different from what I expected? - Are these two sets of data significantly
different from each other? - When you say significant, back it up with
statistics.
25Hypothesis Testing
- H0 null hypothesis
- There is no significant difference
- H1 alternative hypothesis
- There is a significant difference (two-sided)
- This is significantly higher / lower (one-sided)
- You cannot prove the null hypothesis. You can at
best say the data offer no significant evidence
against it.
26Testing Locationp-value (Significance Level)
- Significance level of 5 (p lt 0.05) means at
most a 5 chance that this difference is due to
random uncertainty and not real. - Excel calculates this as part of a t-test
- p-value gt significance level accept H0
- Cannot prove that observations and expectations
differ
27Testing Location t-test
- t ? (difference between samples) / (variability)
- Excel will automatically calculate t-values to
compare - Means of two datasets with equal variances
- Means of two datasets with unequal variances
- Two sets of paired data
- abs(t-score) lt abs(t-critical) accept H0
- Cannot prove that observations and expectations
differ
28t-Test Independent Samples
Round to appropriate sig figs before reporting
these numbers.
Change this if you want to know whether the means
of the two samples differ by at least some
specified amount.
Probability of drawing two random samples from a
normally distributed population and getting the
mean of sample 1 this much larger than the mean
of sample 2. The mean of sample 1 is larger at
a significance level of ?0.02 (or at the 2
significance level), because p lt 0.02.
t gt tcritical(one-tail), so the mean of sample 1
is significantly larger than the mean of sample
2.
t gt tcritical(two-tail), so the mean of sample 1
is significantly different from the mean of
sample 2.
P lt 0.03, so the means of the two samples are
different at the 3 significance level.
29Exercise Hypothesis Testing
- Now that you have seen the results of some
t-tests and p-tests in Excel, you should
practice. On the next slide, you will be directed
to load a data file and make decisions about the
data. Each bullet on the slide is a hint for
what you should do next. You should be able to
figure out how to make Excel do these things.
30Exercise - Hypothesis Testing
- Load calibration.xls from the ChE 408 homepage
- Calculate the mean, std dev, and variance for
only the first day. - Calculate the mean, std dev, and variance
excluding the first day. - Use a t-test and a p-test to learn whether the
calibration factor measured the first day differs
significantly from the rest - Tools gt Data analysis gt t-Test Two sample . . .
- Can use equal variances if ratio of variances lt
3.
31End of Hypothesis Testing Exercise
- This concludes the Excel tutorial.