Title: Lecture Unit 2 Graphical and Numerical Summaries of Data
1Lecture Unit 2Graphical and Numerical Summaries
of Data
- UNIT OBJECTIVES
- At the conclusion of this unit you should be able
to - 1) Construct graphs that appropriately describe
data - 2) Calculate and interpret numerical summaries
of a data set. - 3) Combine numerical methods with graphical
methods to analyze a data set. - 4) Apply graphical methods of summarizing data
to choose appropriate numerical summaries. - 5) Apply software and/or calculators to automate
graphical and numerical summary procedures.
2Displaying Qualitative Data
- Section 2.1
- Sometimes you can see a lot just by looking.
- Yogi Berra
- Hall of Fame Catcher, NY Yankees
3The three rules of data analysis wont be
difficult to remember
- 1. Make a picture reveals aspects not obvious in
the raw data enables you to think clearly about
the patterns and relationships that may be hiding
in your data. - 2. Make a picture to show important features of
and patterns in the data. You may also see things
that you did not expect the extraordinary
(possibly wrong) data values or unexpected
patterns - 3. Make a picture the best way to tell others
about your data is with a well-chosen picture.
4Bar Charts show counts or relative frequency for
each category
- Example Titanic passenger/crew distribution
5Pie Charts shows proportions of the whole in
each category
- Example Titanic passenger/crew distribution
6Example Top 10 causes of death in the United
States 2001
For each individual who died in the United States
in 2001, we record what was the cause of death.
The table above is a summary of that information.
7Top 10 causes of death bar graph Each category
is represented by one bar. The bars height shows
the count (or sometimes the percentage) for that
particular category.
Top 10 causes of deaths in the United States 2001
8Top 10 causes of deaths in the United States 2001
Bar graph sorted by rank ? Easy to analyze
Sorted alphabetically ? Much less useful
9Top 10 causes of death pie chart Each slice
represents a piece of one whole. The size of a
slice depends on what percent of the whole this
category represents.
Percent of people dying from top 10 causes of
death in the United States in 2001
10Make sure your labels match the data. Make
sure all percents add up to 100.
Percent of deaths from top 10 causes
Percent of deaths from all causes
11Child poverty before and after government
interventionUNICEF, 1996
- What does this chart tell you?
- The United States has the highest rate of child
poverty among developed nations (22 of under
18). - Its government does the leastthrough taxes and
subsidiesto remedy the problem (size of orange
bars and percent difference between orange/blue
bars). - Could you transform this bar graph to fit in 1
pie chart? In two pie charts? Why?
The poverty line is defined as 50 of national
median income.
12Unnecessary dimension in a pie chart
13Contingency Tables Categories for Two Variables
- Example Survival and class on the Titanic
Marginal distributions
14Contingency Tables Categories for Two Variables
(cont.)
- Conditional distributions.
- Given the class of a passenger, what is the
chance the passenger survived?
15Contingency Tables Categories for Two Variables
(cont.)
- Questions
- What percent of survivors were in second class?
- What percent were in second-class and survivors ?
- What percent of the second-class passengers
survived?
118/710
118/2201
118/285
163-Way Tables
- Example Georgia death-sentence data
17UC Berkeley Lawsuit
18LAWSUIT (cont.)
19Simpsons Paradox
- The reversal of the direction of a comparison or
association when data from several groups are
combined to form a single group.
20Fly Alaska Airlines, the on-time airline!
21American West Wins!Youre a Hero!
22Section 2.2Displaying Quantitative Data
- Histograms
- Stem and Leaf Displays
23Relative Frequency Histogram of Exam Grades
.30
.25
.20
Relative frequency
.15
.10
.05
0
40
50
60
70
80
90
100
Grade
24Frequency Histograms
25Frequency Histograms
- A histogram shows three general types of
information - It provides visual indication of where the
approximate center of the data is. - We can gain an understanding of the degree of
spread, or variation, in the data. - We can observe the shape of the distribution.
26All 200 m Races 20.2 secs or less
27Histograms Showing Different Centers
28Histograms - Same Center, Different Spread
29Frequency and Relative Frequency Histograms
- identify smallest and largest values in data set
- divide interval between largest and smallest
values into between 5 and 20 subintervals called
classes - each data value in one and only one class
- no data value is on a boundary
30How Many Classes?
31Histogram Construction (cont.)
- compute frequency or relative frequency of
observations in each class - x-axis class boundaries
- y-axis frequency or relative frequency scale
- over each class draw a rectangle with height
corresponding to the frequency or relative
frequency in that class
32Ex. No. of daily employee absences from work
- 106 obs approx. no of classes
- 2(106)1/3 2121/3 5.69
- 1 log(106)/log(2) 1 6.73 7.73
- There is no single correct answer for the
number of classes - For example, you can choose 6, 7, 8, or 9
classes dont choose 15 classes
33EXCEL Histogram
34Absences from Work (cont.)
- 6 classes
- class width (158-121)/637/66.17 7
- 6 classes, each of width 7 classes span 6(7)42
units - data spans 158-12137 units
- classes overlap the span of the actual data
values by 42-375 - lower boundary of 1st class (1/2)(5) units below
121 121-2.5 118.5
35EXCEL histogram
36Grades on a statistics exam
- Data
- 75 66 77 66 64 73 91 65 59 86 61 86 61
- 58 70 77 80 58 94 78 62 79 83 54 52 45
- 82 48 67 55
37Frequency Distribution of Grades
Class Limits Frequency
40 up to 50 50 up to 60 60 up to 70 70 up to
80 80 up to 90 90 up to 100 Total
2 6 8 7 5 2 30
38Relative Frequency Distribution of Grades
Class Limits Relative Frequency
40 up to 50 50 up to 60 60 up to 70 70 up to
80 80 up to 90 90 up to 100
2/30 .067 6/30 .200 8/30 .267 7/30
.233 5/30 .167 2/30 .067
39Relative Frequency Histogram of Grades
.30
.25
.20
Relative frequency
.15
.10
.05
0
40
50
60
70
80
90
100
Grade
40Stem and leaf displays
- Have the following general appearance
- stem leaf
- 1 8 9
- 2 1 2 8 9 9
- 3 2 3 8 9
- 4 0 1
- 5 6 7
- 6 4
41Stem and Leaf Displays
- Partition each no. in data into a stem and
leaf - Constructing stem and leaf display
- 1) deter. stem and leaf partition (5-20 stems)
- 2) write stems in column with smallest stem at
top include all stems in range of data - 3) only 1 digit in leaves drop digits or round
off - 4) record leaf for each no. in corresponding stem
row ordering the leaves in each row helps
42Example employee ages at a small company
- 18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39
stem 10s digit leaf 1s digit - 18 stem1 leaf8 18 1 8
- stem leaf
- 1 8 9
- 2 1 2 8 9 9
- 3 2 3 8 9
- 4 0 1
- 5 6 7
- 6 4
43Suppose a 95 yr. old is hired
- stem leaf
- 1 8 9
- 2 1 2 8 9 9
- 3 2 3 8 9
- 4 0 1
- 5 6 7
- 6 4
- 7
- 8
- 9 5
44Number of TD passes by NFL teams 2000
season(stems are 10s digit)
45Pulse Rates n 138
46Advantages/Disadvantages of Stem-and-Leaf Displays
- Advantages
- 1) each measurement displayed
- 2) ascending order in each stem row
- 3) relatively simple (data set not too large)
- Disadvantages
- display becomes unwieldy for large data sets
47Population of 185 US cities with between 100,000
and 500,000
- Multiply stems by 100,000
48Back-to-back stem-and-leaf displays. TD passes by
NFL teams 1998, 2000multiply stems by 10
49Interpreting Graphical Displays Shape
- A distribution is symmetric if the right and left
sides of the histogram are approximately mirror
images of each other.
50Outliers
- An important kind of deviation is an outlier.
Outliers are observations that lie outside the
overall pattern of a distribution. Always look
for outliers and try to explain them.
The overall pattern is fairly symmetrical except
for 2 states clearly not belonging to the main
trend. Alaska and Florida have unusual
representation of the elderly in their
population. A large gap in the distribution is
typically a sign of an outlier.
Alaska
Florida
51Other Graphical Methods for Economic Data
- Time plots
- plot observations in time order, with time on
the horizontal axis and the vari-able on the
vertical axis - Time series
- measurements are taken at regular intervals
(monthly unemployment, quarterly GDP, weather
records, electricity demand, etc.)
52Winning Times 100 M Dash
53Annual Mean Temperature
54End of Section 2.2