Title: Looking at Data -- Distributions
1Looking at Data -- Distributions
2Worker Salary Data Set
- Individuals are described by variables
- Data Types
- Categorical groupings ( M/F etc.)
- Quantitative (we can do arithmetic)
- Distributions
- Values taken by a variable and how often it takes
them
3(No Transcript)
4(No Transcript)
5Graphing categorical variables
- Excel is good at this.
- Example Firestone Tires in year 2000. 148
fatalities due to defective tires, 2969 reported
accidents (from accident reports).
6Bar Chart Pareto Chart (
sorted bar chart)
7Graphing Quantitative Variables
- Stemplots and Histograms
- graphical displays of the distribution of a
quantitative variable. - salaries,
- grades,
- stock prices,
- et cetera..
-
8(No Transcript)
9Histograms
- Decide on width of each class
- bins in Excel
- Count number in each class
- Plot a bar of the appropriate height
10More bars on histogram
- Could also use a relative frequency histogram.
11Stem Plots quick and easy manual plots (note --
printed handout in wrong order)
We arbitrarily decided not to include the outlier
in these stem plots
12Examining Distributions
- Look at overall pattern
- Shape
-
- Center
- About half values above, half below
- Spread
- Look for values outside the overall pattern
- Called outliers.
- Look for causes and decide what to do about the
outliers.
13Distribution of salaries for 30 major league
teams on opening day 2000
14Cincinnati Reds 2000 Salaries (relative
frequencies)
15Monthly returns on all US common stocks Jan 1951
to Dec 2000
16Terminology
17Time plots
- Plots observations through time
- E.g. NASDAQ
- Seasonal sales data
-
18Describing distributions with numbers
- Central values
- Measures of Spread
- Measures of Shape
19Central valuesMean and median
20(No Transcript)
21Mean vs. Median
- Mean is more affected by outliers.
- Cincinnati Reds Salaries
22Measures of spread
- Main measures for us
- Range,
- Interquartile Range
- Standard Deviation
- Range largest smallest
-
23Measuring Spread -- Percentiles
- pth percentile -- p percent of distribution
falls below it. - Quartiles
- First quartile 25th percentile (Q1)
- Median 50th percentile (M)
- Third quartile 75th percentile (Q3)
- Interquartile range IQR Q3 Q1
245 Number Summary and Basic Boxplot
25Boxplot of Workers Salaries
26Fancier Boxplots Percent Hispanic in U.S. States
27- Software Issues
- Boxplots very useful but not built into Excel
- Excel Boxplot Macro is available on textbook
website.
28Variance and Standard Deviation
- Measure variability around the mean
29A simple example
- Data set
- 1, 1, 3, 4, 6.
- Use the formulas to calculate mean and standard
deviation for this data set.
30Comments on Std. Dev.
- As a descriptive measure
- Best suited for symmetric distributions.
- Use only when mean is chosen as measure of
center. - s 0 if and only if no spread
- Not resistant to outliers
- Other usage
- Has very nice math properties that will be very
useful throughout the course.
315 number summary vs. mean and std. dev.
- 5 number summary usually better if distribution
skewed or with strong outliers. - Mean, std.dev. suitable if distribution is
reasonably symmetric and free of outliers. - Rules arent cast in stone -- use your
judgment.
32Summary Describing Distributions
- Plot your data ( histogram or stem plot).
- Look for patterns and outliers. Can they be
explained? - Calculate appropriate numerical summary measures
(5 number summary or mean, std. dev.) to give
brief description of centre and spread.
33Using Excel to Compute Measures
Average AVERAGE(data)
Standard Deviation STDEV(data)
Median MEDIAN(data)
1st Quartile QUARTILE(data, 1)
34th Percentile PERCENTILE(data,0.34)
Range RANGE(data)
Sum SUM(data)
- Money Spent in Store (Example 1.5)
34Charting in Excel
- Excel Manual gives instructions for Charts and
Histograms. - Be sure that you have installed the Analysis
Tool Pak.
35An extension of our basic approach for analyzing
data sets
- Sometimes the pattern of a large number of
observations is so strong that we can approximate
it with density curve. - Mathematical model of distribution.
36City gas mileage (miles per gallon) of 856 2001
vehicles. Note bars are for relative
frequency histogram
37Proportion with 20 miles per gallon or less
38Density Curve
39Mean and Median of Density Curves
- Symmetric curve mean median
- Skewed to the right mean gt median
- Skewed to the left ???
40Normal curves
- The most important class of density curves.
- Symmetric, unimodal and bell shaped.
- Are completely described by mean, µ, and std.
deviation, ?.
4168, 95, 99.7 rule Very important
4268, 95, 99.7 rule µ 0, ? 1.(called
Standard Normal Dist.)
4368, 95, 99.7 rule µ ?, ? ?
44Generalization of 68, 95, 99.7 Rule
- Can consider any number of standard deviations
from mean, not just 1, 2, or 3. - Need to measure how many standard deviations we
are from mean. - Let x be an observation from a normal dist with
mean, mu and std. dev, sigma. - Standardized value of x is
- z (x mu) /sigma
- z tells us how many standard deviations x is
above or below the mean
45Standard Normal Distribution and Key Result for
Normal Distributions
46Normal Tables
- Table A ( front cover of text) gives proportion
of observations that fall to the left of z
standard deviations from the mean
47Normal Tables (positive z)
48Normal Tables (negative z)
49Normal Table Example (cont.)
- What about the proportion of observations between
z -2.15 and z 1.2? - What proportion lie outside the above range?
50Standardizing example
- Example Fills from vending machine have mean of
250 ml. with a standard deviation of 15 ml. - Consider fills of 240 ml and 275 ml.
- Get the corresponding z values. Interpret.
51Calculations for general normal distributions
- Example Fills from vending machine have mean of
250 ml. with a standard deviation of 15 ml. - What proportion of fills are
- Below 240 ml?
- Between 240 and 275 ml?
- Above 330 ml?
52Working backwards
- 4 of values for the standard normal distribution
are bigger than what value? - The most central 90 of standard normal
distribution lies between what values?
53Working backwards (cont.)
- Vending machine example. Mean 250, Std
deviation 15. - 6 of fills are above how many ml?
- ______________________
- ( Note Excel and some calculators can compute
normal probabilities covered in later slide).
But also you should know how to use tables for
exams)
54Ex 1.98
- The yearly rate of return on stock indices is
approximately normal. Between 1950 and 2000
U.S. common stocks had a mean yearly return of
about 13 with a standard deviation of about 17. - In what range do the middle 95 of all yearly
returns lie? - In what of the years is the market down for the
year ( return lt 0)? - In what percent of the years does the index gain
25 or more? - In 25 of the years the gain is at least how much?
55SAT Scores
- In 2000, male scores on the Math SAT were normal
with a mean of 533 and a standard deviation of
115. - What fraction scored 750 or better?
- What is the 99th percentile of male SAT scores?
56Using Excel for Normal Distributions
- See Excel manual for full details.
- Area under a normal curve to the left of x
- NORMDIST(x, µ, ?,1)
- Find value such that a proportion p of the
observations lie to the left. - NORMINV(p, µ, ?).
57How to tell if a distribution is normal?
- Look at histogram, symmetry, etc.
- A Normal quantile plot is more sensitive tool.
- A macro for this is available on the textbook
website.
58Basic idea of normal quantile plot
- Percentiles of the distribution being considered
and standard normal distn. will be linearly
related. -
- Get percentile for each observation x.
- For each percentile so obtained, get
corresponding z value from standard normal
distribution. - Plot x values against z values.
- Macro does all this with some extra technical
details related to sampling. - If data follows a normal distribution (more or
less), plot will be approximately a straight
line.
59Quantile plot for unemployment data (excluding
Puerto Rico)
60Cincinnati Reds Salary Data