Basic Statistics with Excel - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Basic Statistics with Excel

Description:

It consists of both information and exercises to introduce Excel as a tool for ... To 'fix' a cell, use $ in front of the column letter and/or row number. ... – PowerPoint PPT presentation

Number of Views:243

Avg rating:3.0/5.0

Slides: 32

Provided by: webcheE

Category:

more less

Transcript and Presenter's Notes

Title: Basic Statistics with Excel

1
Basic Statistics with Excel

ChE 408 Engineering Experimental Design
Valerie L. Young
Chemical Engineering, Ohio University

2
Outline Excel as a Tool for . . .

Summarizing data
Fitting data (regression)
Hypothesis testing

This is a tutorial. It consists of both
information and exercises to introduce Excel as a
tool for statistical analysis. Work through it at
your own pace. It assumes that you are familiar
with Excel and can do simple calculations and
make plots. If you cant, seek more help.
3
Some Excel Tidbits

Excel can display dates in many formats, but it
stores dates as numbers.
1 January 1, 1900
37622 January 1, 2003
A single quote forces a cell entry to be text
To fix a cell, use in front of the column
letter and/or row number. Then the address wont
update when you copy the cell.
Excel can count the number of values in a range
count(N2N35)
countif(N2N35,-777)
countif(N2N35,gt-777)

4
Summarizing Data Histograms

When your data set consists of a single variable
measured multiple times, a histogram shows how
the values are distributed.

About 26 of measurements are between 55 and 56.
Indicate total of values on graph or in caption.
This distribution is left-skewed.
This is ? 50 and lt 51. Next category is ? 51 and
lt 52. No overlap.
All categories the same size.
5
Exercise Construct a Histogram

Now that you have seen a histogram generated by
Excel, you should practice by making one
yourself. On the next slide, you will be
directed to load a data file and make a histogram
to display the data. Each bullet on the slide is
a hint for what you should do next to complete
the histogram. You should be able to figure out
how to make Excel do these things.

6
Exercise Construct a Histogram

Load calibration.xls from the ChE 408 homepage
Count the total number of entries under cal
factor
count(b2b28)
Set up a list of categories for the histogram
46-47
Count the number of entries in each category
countif(b2b28,lt47)-countif(b2b28,lt46)
Calculate the fraction of entries in each
category
Insert a column chart on a new page, fraction vs.
category

7
End of Histogram Exercise

Continue to learn about more ways to summarize
data.

8
More Ways to Summarize Data

Plots
Time series (use xy, not line, to see trends
over course of experiment)
y vs. x (use xy, not line, to investigate
correlations)
Column chart (to make histogram)
Pie chart (to show proportions rarely used)
Descriptive Statistics
Measures of location
Measures of variability
Tools gt Data Analysis gt Descriptive Statistics

9
Plots to Summarize Data
Dependent variable. Cal Factor depends on sample
size.
Look! No gray background.
Axes scaled so data covers graph.
Independent variable
Figure 2a. Example of a time-series plot, showing
that values from the first day appear to be lower
and more scattered than the rest.
Figure 2b. Example of a y vs. x plot, showing
that lower values of calibration factor are
obtained with smaller amounts of sample.
10
Descriptive Statistics Location

Mean (average)
Strongly affected by unusual points
average(b2b28)
Median (50 of data higher, 50 of data lower)
Seldom strongly affected by unusual points
median(b2b28)

11
Descriptive Statistics Variability

Standard Deviation, stdev(b2b28)
A measure of how widely the data is spread around
the mean
Strongly affected by unusual points
Relative Standard Deviation
stdev(b2b28)/average(b2b28)
Usually given as percentage
Format gt Cells gt Percent
Variance
(stdev(b2b28))2

12
Descriptive Statistics Variability

Interquartile range, IQR
Width of the middle 50 of the data
Seldom strongly affected by unusual points
IQR q0.75 q0.25
q0.75 75 of data is lower, percentile(b2b28,0
.75)
q0.25 25 of data is lower, percentile(b2b28,0
.25)
Range
Values spanned by the data
max(b2b28) min(b2b28)
Strongly affected by unusual points

13
Descriptive Statistics Variability

Confidence Interval
Mean /- Uncertainty covers the confidence
interval
95 confidence limits means that if you
calculate a mean and confidence interval from a
set of replicate measurements (the sample), you
can be 95 sure that if you made an infinite
number of measurements (the population), their
mean would lie within the confidence interval

14
Descriptive Statistics
The mode is the value that appears most often.
Each value appears just once in this data set.
The 95 confidence interval is 54.2 0.9.
Kurtosis and skewness describe the symmetry of
the distribution. We wont discuss these further
in this class.
This table is as-generated by Excel. Round to
appropriate sig figs before reporting these
numbers.
15
Exercise Summarizing Data

Now that you have seen graphical and statistical
methods for summarizing data, you should
practice. On the next slide, you will be
directed to load a data file and generate plots
and descriptive statistics. Each bullet on the
slide is a hint for what you should do next. You
should be able to figure out how to make Excel do
these things.

16
Exercise Summarizing Data

Load calibration.xls from the ChE 408 homepage
Construct a time-series plot for the calibration
factor
Plot calibration factor vs. sample amount
Use descriptive statistics to summarize the
location and variability of the calibration
factor
Calculate each statistic individually
Use Tools gt Data Analysis gt Descriptive Statistics

17
End of Summarizing Data Exercise

Continue to learn about fitting equations to
data.

18
Fitting Data

For linear regression, use Tools gt Data Analysis
gt Regression
Dont use trendline no statistics
X Variable ? slope
R2 Coefficient of Determination
A value near 1.0 means that the value of y
depends strongly on the value of x. Does NOT
mean the dependence is linear.
Use residual plots to show linearity
Residuals should be random around zero
Use p-values to show significance of linear fit
p probability that points are arranged like
this by chance

19
Linear Regression Example of Excel Output
Round to appropriate sig figs before reporting
these numbers.
99 of the variation in y is explained by
variation in x. The remainder may be random
error, or may be explained by some factor other
than x.
Ratio of variability explained by model to
leftover variability. High number means model
explains most variation in data.
Probability of getting that value of F by
randomly sampling from normally-distributed data.
y (0.580.02)x (0.0150.010)
Probability of getting a slope or intercept this
much different from zero by randomly sampling
from normally distributed data.
Confidence limits on slope and intercept.
20
Residual Plot
This plot is as-generated by Excel. Fix the
formatting before including this plot in a report.
The random distribution of residuals around zero
suggests that the model accounts for all
predictable variation in y, and all that is left
is random uncertainty.
A residual for a given value of x is the
difference between the measured value of y, and
the value of y calculated using the regression
model.
21
Exercise Linear Regression

Now that you have seen the kind of information
Excel gives you when you fit a straight line to
data, you should practice. On the next slide, you
will be directed to load a data file, plot the
data, and fit a line to it. Each bullet on the
slide is a hint for what you should do next. You
should be able to figure out how to make Excel do
these things.

22
Exercise Fitting Data

Load Jul2627data.xls from the ChE 408 homepage
Plot i-butane vs. n-butane. Does this plot
appear linear?
Perform linear regression on i-butane vs.
n-butane.
Are these values strongly correlated?
Are these values linearly correlated?
What is the linear equation relating these
values?
What uncertainty would you place on the values of
the slope and intercept for this linear
relationship?

23
End of Summarizing Data Exercise

Continue to learn about fitting equations to
data.

24
Hypothesis Testing Can Answer Questions Like . .
.

Is this value significantly different from the
one I expected?
Is the variability of this data significantly
different from what I expected?
Are these two sets of data significantly
different from each other?
When you say significant, back it up with
statistics.

25
Hypothesis Testing

H0 null hypothesis
There is no significant difference
H1 alternative hypothesis
There is a significant difference (two-sided)
This is significantly higher / lower (one-sided)
You cannot prove the null hypothesis. You can at
best say the data offer no significant evidence
against it.

26
Testing Locationp-value (Significance Level)

Significance level of 5 (p lt 0.05) means at
most a 5 chance that this difference is due to
random uncertainty and not real.
Excel calculates this as part of a t-test
p-value gt significance level accept H0
Cannot prove that observations and expectations
differ

27
Testing Location t-test

t ? (difference between samples) / (variability)
Excel will automatically calculate t-values to
compare
Means of two datasets with equal variances
Means of two datasets with unequal variances
Two sets of paired data
abs(t-score) lt abs(t-critical) accept H0
Cannot prove that observations and expectations
differ

28
t-Test Independent Samples
Round to appropriate sig figs before reporting
these numbers.
Change this if you want to know whether the means
of the two samples differ by at least some
specified amount.
Probability of drawing two random samples from a
normally distributed population and getting the
mean of sample 1 this much larger than the mean
of sample 2. The mean of sample 1 is larger at
a significance level of ?0.02 (or at the 2
significance level), because p lt 0.02.
t gt tcritical(one-tail), so the mean of sample 1
is significantly larger than the mean of sample
2.
t gt tcritical(two-tail), so the mean of sample 1
is significantly different from the mean of
sample 2.
P lt 0.03, so the means of the two samples are
different at the 3 significance level.
29
Exercise Hypothesis Testing

Now that you have seen the results of some
t-tests and p-tests in Excel, you should
practice. On the next slide, you will be directed
to load a data file and make decisions about the
data. Each bullet on the slide is a hint for
what you should do next. You should be able to
figure out how to make Excel do these things.

30
Exercise - Hypothesis Testing

Load calibration.xls from the ChE 408 homepage
Calculate the mean, std dev, and variance for
only the first day.
Calculate the mean, std dev, and variance
excluding the first day.
Use a t-test and a p-test to learn whether the
calibration factor measured the first day differs
significantly from the rest
Tools gt Data analysis gt t-Test Two sample . . .
Can use equal variances if ratio of variances lt
3.

31
End of Hypothesis Testing Exercise