Looking at Data -- Distributions - PowerPoint PPT Presentation

About This Presentation
Title:

Looking at Data -- Distributions

Description:

Cincinnati Reds 2000 Salaries (relative frequencies) 15 ... Quantile plot for unemployment data (excluding Puerto Rico) 60. Cincinnati Reds Salary Data ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 61
Provided by: bus45
Category:

less

Transcript and Presenter's Notes

Title: Looking at Data -- Distributions


1
Looking at Data -- Distributions
  • Tools for Exploring Data

2
Worker Salary Data Set
  • Individuals are described by variables
  • Data Types
  • Categorical groupings ( M/F etc.)
  • Quantitative (we can do arithmetic)
  • Distributions
  • Values taken by a variable and how often it takes
    them

3
(No Transcript)
4
(No Transcript)
5
Graphing categorical variables
  • Excel is good at this.
  • Example Firestone Tires in year 2000. 148
    fatalities due to defective tires, 2969 reported
    accidents (from accident reports).

6
Bar Chart Pareto Chart (
sorted bar chart)
7
Graphing Quantitative Variables
  • Stemplots and Histograms
  • graphical displays of the distribution of a
    quantitative variable.
  • salaries,
  • grades,
  • stock prices,
  • et cetera..

8
(No Transcript)
9
Histograms
  • Decide on width of each class
  • bins in Excel
  • Count number in each class
  • Plot a bar of the appropriate height

10
More bars on histogram
  • Could also use a relative frequency histogram.

11
Stem Plots quick and easy manual plots (note --
printed handout in wrong order)
We arbitrarily decided not to include the outlier
in these stem plots
12
Examining Distributions
  • Look at overall pattern
  • Shape
  • Center
  • About half values above, half below
  • Spread
  • Look for values outside the overall pattern
  • Called outliers.
  • Look for causes and decide what to do about the
    outliers.

13
Distribution of salaries for 30 major league
teams on opening day 2000
14
Cincinnati Reds 2000 Salaries (relative
frequencies)
15
Monthly returns on all US common stocks Jan 1951
to Dec 2000
16
Terminology
17
Time plots
  • Plots observations through time
  • E.g. NASDAQ
  • Seasonal sales data

18
Describing distributions with numbers
  • Central values
  • Measures of Spread
  • Measures of Shape

19
Central valuesMean and median
20
(No Transcript)
21
Mean vs. Median
  • Mean is more affected by outliers.
  • Cincinnati Reds Salaries

22
Measures of spread
  • Main measures for us
  • Range,
  • Interquartile Range
  • Standard Deviation
  • Range largest smallest

23
Measuring Spread -- Percentiles
  • pth percentile -- p percent of distribution
    falls below it.
  • Quartiles
  • First quartile 25th percentile (Q1)
  • Median 50th percentile (M)
  • Third quartile 75th percentile (Q3)
  • Interquartile range IQR Q3 Q1

24
5 Number Summary and Basic Boxplot
25
Boxplot of Workers Salaries
26
Fancier Boxplots Percent Hispanic in U.S. States
27
  • Software Issues
  • Boxplots very useful but not built into Excel
  • Excel Boxplot Macro is available on textbook
    website.

28
Variance and Standard Deviation
  • Measure variability around the mean

29
A simple example
  • Data set
  • 1, 1, 3, 4, 6.
  • Use the formulas to calculate mean and standard
    deviation for this data set.

30
Comments on Std. Dev.
  • As a descriptive measure
  • Best suited for symmetric distributions.
  • Use only when mean is chosen as measure of
    center.
  • s 0 if and only if no spread
  • Not resistant to outliers
  • Other usage
  • Has very nice math properties that will be very
    useful throughout the course.

31
5 number summary vs. mean and std. dev.
  • 5 number summary usually better if distribution
    skewed or with strong outliers.
  • Mean, std.dev. suitable if distribution is
    reasonably symmetric and free of outliers.
  • Rules arent cast in stone -- use your
    judgment.

32
Summary Describing Distributions
  • Plot your data ( histogram or stem plot).
  • Look for patterns and outliers. Can they be
    explained?
  • Calculate appropriate numerical summary measures
    (5 number summary or mean, std. dev.) to give
    brief description of centre and spread.

33
Using Excel to Compute Measures
Average AVERAGE(data)
Standard Deviation STDEV(data)
Median MEDIAN(data)
1st Quartile QUARTILE(data, 1)
34th Percentile PERCENTILE(data,0.34)
Range RANGE(data)
Sum SUM(data)
  • Money Spent in Store (Example 1.5)

34
Charting in Excel
  • Excel Manual gives instructions for Charts and
    Histograms.
  • Be sure that you have installed the Analysis
    Tool Pak.

35
An extension of our basic approach for analyzing
data sets
  • Sometimes the pattern of a large number of
    observations is so strong that we can approximate
    it with density curve.
  • Mathematical model of distribution.

36
City gas mileage (miles per gallon) of 856 2001
vehicles. Note bars are for relative
frequency histogram
37
Proportion with 20 miles per gallon or less
38
Density Curve
39
Mean and Median of Density Curves
  • Symmetric curve mean median
  • Skewed to the right mean gt median
  • Skewed to the left ???

40
Normal curves
  • The most important class of density curves.
  • Symmetric, unimodal and bell shaped.
  • Are completely described by mean, µ, and std.
    deviation, ?.

41
68, 95, 99.7 rule Very important
42
68, 95, 99.7 rule µ 0, ? 1.(called
Standard Normal Dist.)
43
68, 95, 99.7 rule µ ?, ? ?
44
Generalization of 68, 95, 99.7 Rule
  • Can consider any number of standard deviations
    from mean, not just 1, 2, or 3.
  • Need to measure how many standard deviations we
    are from mean.
  • Let x be an observation from a normal dist with
    mean, mu and std. dev, sigma.
  • Standardized value of x is
  • z (x mu) /sigma
  • z tells us how many standard deviations x is
    above or below the mean

45
Standard Normal Distribution and Key Result for
Normal Distributions
46
Normal Tables
  • Table A ( front cover of text) gives proportion
    of observations that fall to the left of z
    standard deviations from the mean

47
Normal Tables (positive z)
48
Normal Tables (negative z)
49
Normal Table Example (cont.)
  • What about the proportion of observations between
    z -2.15 and z 1.2?
  • What proportion lie outside the above range?

50
Standardizing example
  • You do
  • Example Fills from vending machine have mean of
    250 ml. with a standard deviation of 15 ml.
  • Consider fills of 240 ml and 275 ml.
  • Get the corresponding z values. Interpret.

51
Calculations for general normal distributions
  • Example Fills from vending machine have mean of
    250 ml. with a standard deviation of 15 ml.
  • What proportion of fills are
  • Below 240 ml?
  • Between 240 and 275 ml?
  • Above 330 ml?

52
Working backwards
  • 4 of values for the standard normal distribution
    are bigger than what value?
  • The most central 90 of standard normal
    distribution lies between what values?

53
Working backwards (cont.)
  • Vending machine example. Mean 250, Std
    deviation 15.
  • 6 of fills are above how many ml?
  • ______________________
  • ( Note Excel and some calculators can compute
    normal probabilities covered in later slide).
    But also you should know how to use tables for
    exams)

54
Ex 1.98
  • The yearly rate of return on stock indices is
    approximately normal. Between 1950 and 2000
    U.S. common stocks had a mean yearly return of
    about 13 with a standard deviation of about 17.
  • In what range do the middle 95 of all yearly
    returns lie?
  • In what of the years is the market down for the
    year ( return lt 0)?
  • In what percent of the years does the index gain
    25 or more?
  • In 25 of the years the gain is at least how much?

55
SAT Scores
  • In 2000, male scores on the Math SAT were normal
    with a mean of 533 and a standard deviation of
    115.
  • What fraction scored 750 or better?
  • What is the 99th percentile of male SAT scores?

56
Using Excel for Normal Distributions
  • See Excel manual for full details.
  • Area under a normal curve to the left of x
  • NORMDIST(x, µ, ?,1)
  • Find value such that a proportion p of the
    observations lie to the left.
  • NORMINV(p, µ, ?).

57
How to tell if a distribution is normal?
  • Look at histogram, symmetry, etc.
  • A Normal quantile plot is more sensitive tool.
  • A macro for this is available on the textbook
    website.

58
Basic idea of normal quantile plot
  • Percentiles of the distribution being considered
    and standard normal distn. will be linearly
    related.
  • Get percentile for each observation x.
  • For each percentile so obtained, get
    corresponding z value from standard normal
    distribution.
  • Plot x values against z values.
  • Macro does all this with some extra technical
    details related to sampling.
  • If data follows a normal distribution (more or
    less), plot will be approximately a straight
    line.

59
Quantile plot for unemployment data (excluding
Puerto Rico)
60
Cincinnati Reds Salary Data
Write a Comment
User Comments (0)
About PowerShow.com