Title: Quantitative Methods in Policy Analysis
1Quantitative Methods in Policy Analysis
- Session 2 Summary Statistics
- Steve Fetter
- Tim Gulden
2Bad Graphs Revisited
3Statistics
- Collecting, analyzing, and presenting numerical
facts or data - Descriptive statistics organize, summarize, and
present data - Inferential statistics draw general conclusions
about a population based on a small sample drawn
from it
4Descriptive Statistics
- Accurately and honestly report, describe,
present, and summarize data, in numbers, tables,
and graphs - What is the central value?
- How widely are values spread from the center?
- Is this spreading symmetrical? Are there distinct
groups? - How common is a certain value?
- Are there data that are very atypical?
5Frequency Tables
- A frequency table indicates how many observations
fall in various categories - For continuous data, we must first choose
appropriate categories - enough categories so that we can see a meaningful
distribution, but - not so many that there are only a few
observations per category. - Intervals should be multiples of 1, 2, 2.5, or 5
- Rule of thumb divide the range of values into 8
to 15 equal bins more data ? more bins
6Histograms
- A histogram is the graphical analog of a
frequency table - We can create frequency tables and histograms in
Excel manually, with the data analysis toolpack,
with pivot tables, or with add-ins - For categorical variables, use a bar or pie chart
to display frequencies
7ACTOR.XLS
- Data for 66 actors and 4 variables
- Gender
- DomesticGross average domestic gross of last
few movies (in millions) - ForeignGross average foreign gross of last few
movies (in millions) - Salary current amount the actor asks for a movie
(in millions)
8Creating Histograms in Excel
- Create column containing upper limits for each
category (e.g., 2, 4, 620) bins - Manually
- FREQUENCY(salary,bins)
- Use chart wizard to create histogram
- With Data Analysis Toolpack
- Tools/Data Analysis/Histogram
- Enter bin and data ranges
9Try it ACTORS.XLS
10Histogram Shape
- symmetric
- skewed to the right (positively skewed)
- skewed to the left (negatively skewed)
- bimodal (or tri-modal, etc.)
11Symmetric
12Skewed to Right (Positive Skew)
13Skewed to Left (Negative Skew)
14Bimodal
15Scatterplots
- A useful way to picture the relationship between
two variable is to plot a point for each
observation a scatterplot - We can usually see whether there is any
relationship between the two variables, and, if
so, what type it is - Example relationship between actors salary and
movie gross
16Scatterplots with Excel
- Highlight column with Y-variable data
- Select Insert/Chart (or click Chart icon)
- Select type XY (Scatter) and click next
- Select series tab, enter range for X-variable
data - Enter axis labels, delete gridlines and legend
- Select location for graph
- Edit plot background, axis range, font sizes,
etc. - Save as custom type or copy and change data range
- If X-Y in adjacent columns, highlight both in
step 1
17The resulting scatterplot
Try it ACTORS.XLS
18Presidents approval rating v. number of House
seats lost by presidents party in mid-term
elections, 1946-98
19Pivot Tables
- Pivot tables are one of Excels most powerful
tools. They allow us to slice and dice the
data, breaking it into various subpopulations - Statisticians often refer to these as contingency
tables or crosstabs - Example actresses claim they are being underpaid
relative to male actors. A pivot table is one way
to investigate if the data support this claim
20Creating a Pivot Table
- Position the cursor in the data set
- Select Data/PivotTable
- Select Excel List and Pivot Table
- Verify that the range of the data set is correct
- Click Layout to specify variables
- to put a variable in any of the four areas (page,
row, column or data), click and drag it to area - drag salary to the row area, gender to the column
area, and gender to the data area
21Creating a Pivot Table (cont.)
- The table shows too much detail. To group the
salaries in wider bins - right-click on the salary column.
- select Group and Outline/Group
- in the dialogue box, enter 2 for starting at,
20 for ending at and 2 for by - The resulting table should look as follows after
expressing the counts as percentages (right-click
on count of gender, select field settings and
show data as of column) and rounding off to
two significant digits
22Modified Pivot Table
23Bad graphical presentations
24A good graphical presentation
25Analyzing the Table
- Over half the women are in the lowest salary
category, whereas only 19 of the men are in this
category. - No women are in the highest two salary
categories, whereas 23 of the men are in these
categories - Clearly, women are paid less. Is this evidence of
discrimination? What analysis would you do to
shed light on this question?
26More Pivot Table Analysis
- Perhaps women are paid less because movies with
female leads gross less - To analyze this, examine the average salary of
men and women for each domestic gross level - drag DomesticGross to the row area
- drag Gender to the column area
- drag Salary to data area (summarize by average)
- group DomesticGross in increments of 20
- round off!
27Average Female and Male Salaries by Domestic Gross
28Salary discrimination against actresses?
- Men average more than women in the two lowest
domestic gross categories, but not the third.
Beyond the third category, it is hard to tell
because no females were leads. - Thus, there appears to be discrimination against
women in low-grossing movies. To reach a more
definitive conclusion, we need to use more
sophisticated methods (e.g., chi-square or
regression analysis).
29Break10 minutes
30Summary Statistics
- Summary statistics describe data sets or
relationships between variables with one or a few
numbers - What is the central value?
- How widely are values spread from the center?
- Are there data that are very atypical?
- Are variables related?
31Measures of Central Location Mean
- The mean is the arithmetic average
- population mean (?) v. sample mean (x-bar)
- mean is representative or typical value if
the distribution is roughly symmetric - Excel command AVERAGE(data)
32Measures of Central Location Median
- The median is the middle observation, when data
are ordered from smallest to largest - If there is an odd number of observations, the
median is the middle observation - If n is even, the median is the average of the
two middle observations - Excel command MEDIAN(data)
33Mean or Median?
- median is representative, resistant, robust
statistic, even if distribution is skewed - Outliers affect mean more than the median
- 1, 3, 5, 7, 9 median 5, mean 5
- 1, 3, 5, 7, 90 median 5, mean 21
- Use the median when you want the typical value
(e.g., family income of a city) - Use mean when extreme values are more important
(e.g., deaths from reactor accidents)
34Measures of Variability
- The most common measure is the standard
deviation the square root of the average of the
squared deviations from the mean
Use ? for populations, s for samples. Why (n 1)
for samples? Because computing the sample mean
uses up a bit information, leaving only (n
1) independent bits to compute the standard
deviation.
35More on Standard Deviation
- We square the individual variations so that they
dont cancel out when added together - We take the square root to express the result in
the same units as the original data - Large deviations from the mean contribute more
heavily because they are squared - Standard deviation is sensitive to outliers it
is not a resistant statistic
36Variance
- The variance is the square of the standard
deviation - Less intuitive and more difficult to interpret,
because it is measured in squared units (e.g.,
squared dollars) rather than original units - We wont use variance much, but you should know
what it is.
37Rules of Thumb
- Most data sets obey these rules of thumb
- ? 2/3 of the observations are within one
- standard deviation of the mean (? ? ?)
- ? 95 are within 2? of the mean (? ? 2?)
- ?100 are within 3? of the mean (? ? 3?)
- Observations more than 3? from the mean are
generally called outliers
38Measures of Variability IQR
- The interquartile range (IQR) is a more resistant
measure than standard deviation - Use IQR with median std dev with mean
- Sort observations from smallest to largest,
divide data into four equal-sized sets at first
quartile (Q1), median, and third quartile (Q3) - IQR Q3 Q1
- QUARTILE(data,3) QUARTILE(data,1)
39Other Measures of Variability
- The range is the difference between the largest
and smallest values in the data set - A very crude measure of variability, but useful
for designing tables and graphs - MAX(data) MIN(data)
- Percentiles, or values below which a certain
percentage of the data lie) can also be used - PERCENTILE(data,0.95)
40Summary Statistics with Data Analysis
- Select
- Tools/Data Analysis/Descriptive Statistics
- Select data
- include labels and check Labels in First Row
- dont include non-numeric data
- Check Summary Statistics
- Select output option
- Reformat output
41Other Measures
- Kurtosis and skewness indicate the relative
peakedness and skewness of the distribution - Note if you use Data Analysis, summary
statistics wont change automatically if the data
change better to use formulae in this case
42Determining Linear Relationships
- Scatterplots provide insight into relationship
between variables, but making scatterplots for
each pair of a large number of variables is
tedious - To get a quick indication of possible linear
relationships we can use the correlation
statistic r for a large number of variables, a
correlation matrix
43Covariance and Correlation
- The relationship between two variables is
summarized by covariance and correlation - Each measures the strength (and direction) of a
linear relationship between two variables - Relationship is strong if data cluster tightly
around a straight line - If the values of one variable generally increase
with increasing values of the other variable, the
relationship is positive otherwise it is
negative
44Scatterplot Positive Relationship
45Scatterplot Negative Relationship
Correlation -0.413
46Correlation
- Excel command CORREL(xdata,ydata)
- More than two variables (correlation matrix)
- Select Tools/Data Analysis/Correlations
- Select data (with labels), output location
47Correlation Statistic
- correlation ranges between -1 (perfect negative
relationship) to 1 (perfect positive) 0 no
linear relationship at all - correlation between a variable and itself is 1
- correlation between X and Y is the same as the
correlation between Y and X thus, it is
sufficient to list the correlations below (or
above) the diagonal in the table.
48Boxplots
- A boxplot is a useful graphical method for
summarizing, comparing data - Analyse-it makes nice boxplots
- A sheet is added with the boxplot and the
statistics used to form it
49Boxplot Chart
50Understanding Boxplots
- The right and left of the box are Q1 and Q3 the
length of the box is the IQR the box contains
the middle 50 of the observations - The vertical line inside the box is the median
The point inside the box is the mean - The horizontal lines extend to most extreme
observations within 1.5 IQR from the box they
indicate variability and skewness - Observations more than 1.5 IQR from the box are
shown as points. If they are 1.5 to 3 IRQ from
the box, they are mild outliers otherwise, they
are extreme outliers.
51Students Heights by Gender
52Boxplots in Analyse-It
mean
Q3
median
Q3
53Manipulating Data in Excel
- Importing data fixed/delimited fields in text
files web queries - Cleaning data missing data, duplicate records,
invalid entries, Y2K errors, formatting errors,
coding - Selecting data sort, autofilter, advanced filter