Strategy and tactics for graphic multiples in Stata

About This Presentation

Title:

Strategy and tactics for graphic multiples in Stata

Description:

Strategy and tactics for graphic multiples in Stata Nicholas J. Cox Department of Geography Durham University, UK Comparison Many useful graphs compare two ... – PowerPoint PPT presentation

Number of Views:231

Avg rating:3.0/5.0

Slides: 75

Provided by: Iren86

Category:

more less

Transcript and Presenter's Notes

Title: Strategy and tactics for graphic multiples in Stata

1
Strategy and tactics for graphic
multiples in Stata

Nicholas J. Cox
Department of Geography
Durham University, UK

2
Comparison

Many useful graphs compare two or more sets of
values, and so can be thought as of multiples.
Often there can be a fine line between richly
detailed graphics and busy, unintelligible
graphics that lead nowhere.
In this presentation I survey strategy and
tactics for developing good graphic multiples in
Stata.

3
Strategies what to do

superimpose (on top) or juxtapose (alongside)?
plot different versions or reductions of the data
transform scales for easier comparison
linear reference patterns
backdrops of context

4
Tactics details of what to do

over() and by() options and graph combine
kill the key or lose the legend if you can
annotations and self-explanatory markers

5
Datasets visited

James Shorts collation from the transit of Venus
Florence Nightingales data on deaths in the
Crimean War
deaths from the Titanic sinking
Grunfeld panel data
admissions to Berkeley
hostility in response to insult or apology
fluctuations in Arctic sea ice

6
Original programs discussed

catplot (SSC)
devnplot (SSC)
qplot (Stata Journal)
sparkline (SSC)
spineplot (SJ)
stripplot (SSC)
tabplot (SSC)

Categorical comparisons

8
Berkeley admissions data

A classic dataset covers admissions to six
graduate majors by gender at UC Berkeley.
At first sight, females were discriminated
against.
But there is an underlying interaction major by
major, females generally do well, yet their
acceptance rates are worse on more popular
majors.
This is an example of an amalgamation paradox
named for E.H. Simpson (1922) but known to K.
Pearson (18571936) and G.U. Yule (18711951).

9
Berkeley data references

The original reference was Bickel, P.J., E.A.
Hammel and J.W. OConnell. 1975. Sex bias in
graduate admissions Data from Berkeley. Science
187 398404.
The Berkeley data were discussed as an example
for Stata in Cox, N.J. 2008. Spineplots and
their kin. Stata Journal 8 105121.

10
A simple problem?

The structure of the data is already well known.
The challenge is how best to present it.
There are three categorical variables
major (anonymously A, B, C, D, E, F)
gender (male, female)
decision (accept, reject)
so the data are just 24 frequencies.

11
Bar chart

Many researchers would reach first for a bar
chart.
Here is a slightly non-standard example, produced
by tabplot (SSC), which is for one-way, two-way
or three-way bar charts.
One feature here is showing numbers too in a
hybrid of graph and table.
A cosmetic detail is toning down the use of
colour. Large blocks with strong colours are
unsubtle.

12
(No Transcript)
13
Mosaic plot or spineplot

The previous bar chart omitted the frequencies.
We can show them using a mosaic plot or
spineplot.
The proportions of both variables are shown,
giving marginal and conditional distributions.
Areas of tiles are proportional to raw
frequencies. Departures from independence are
easily seen.
The program here is spineplot.

14
(No Transcript)
15
Drilling down

The bar chart and spineplot do a fair job of
showing the gross breakdown with four percents.
(Two are redundant.)
Predictably, both would be rejected as trivial by
many journal reviewers, but both could be useful
for presentations.
But clearly we need to drill down to see the
patterns for different majors.

16
More detailed bar chart

Stacking bars is a standard strategy, but the
result is immediately much more complicated.
Showing all the detail does not always help.
Focusing more sharply on the response of interest
is a way forward.
In general there is no need for alphabetical
order. Here majors A to F are already ordered by
admission rate.

17
(No Transcript)
18
Dot chart

Dot charts as advocated by W.S. Cleveland remain
under-used by comparison with bar charts.
In Stata that usually means graph dot.
By using marker position alone, rather than bar
length, they are less busy and thus ease more
detailed comparison.
Here it is easier to identify that female
admission rates are higher for four majors and
lower for the other two.

19
(No Transcript)
20
Details for dot charts

Open symbols (e.g. ? not ?) tolerate overlap much
better than closed symbols. ? can even be
combined with whenever nearly equal values are
possible.
Legends (keys) are at best a necessary evil.
Self-explanatory or at least memorable
symbolisation is to be prized wherever it is
possible. Using blue for males and pink for
females is a simple example.

21
A scatter plot?

Many statistically-minded people find the idea of
bar charts trivial, but their practice not very
helpful. Where is the scatter plot, they cry?
Plotting admission rate against number of
applicants re-introduces a crucial aspect, size
of major. This allows identification of positive
correlation for males and negative correlation
for females, hence the paradox.
This is currently my favourite plot for these
data.

22
(No Transcript)
23
Previously

In an earlier version of this plot I had
admissions versus applications, both raw
frequencies.
Reference lines here are lines through the origin
such as y x and y 0.5x for 100 and 50
admission rates.
But it is simpler to plot admission rates. Then
the reference lines are horizontal.

24
Slogans the banal in search of the profound

Focus as far as possible on the response or
outcome, the variable you most want to explain.
Linear reference patterns are good and horizontal
patterns better.
Omit what is unimportant and keep what is
important.
Even for a very simple problem, it is rare that a
single graph meets all needs.

Continuous comparisons

26
Hostility change

Results of an experiment reported by Atkinson, C.
and J. Polivy. 1976. Effects of delay, attack,
and retaliation on state depression and
hostility. Journal of Abnormal Psychology 85
570576.
Male and female subjects were made to wait and
then either were insulted or received an apology.
Half were given a chance to retaliate by
negatively evaluating the experimenter.
Hostility was measured before and after the
experiment.

27
Variables in hostility study

Response
Change in hostility, a difference of scores
and so approximately continuous
Predictors all binary
Treatment insult, apology
Gender male, female
Retaliation allowed yes, no

28
ANOVA-type problems What to plot?

Change in hostility is adequately modelled by a
simple linear model, using analysis of variance.
What to plot for similar analyses is key here.
Box plots (with medians etc.) are surprisingly
common even when comparison of means is the
central question.
Plotting means with standard errors or confidence
intervals is also common, but what about the
detail omitted?

29
devnplot (SSC)

devnplot (SSC) is named for its emphasis on
plotting deviations. Deviations are measured from
any level you care to specify, but deviations
from means are the default.
devplot was too ugly and deviationplot too
long.
Quantile enthusiasts will see it as a way to plot
ordered quantiles side by side. Compare quantile
or qplot (SJ).

30
devnplot syntax

The syntax resembles standard modelling syntax,
response named first and any predictors
following.
With one variable named we get in essence a
quantile plot for that variable, a plot of the
ordered values versus an implicit cumulative
probability scale.
The scaffolding emphasising that each value can
be represented by a deviation from a level might
seem redundant, but bear with me.

31
(No Transcript)
32
Adding predictors to the syntax

You can specify either one or two predictors.
The result is a quantile plot for each subset,
namely a category or combination of categories.
An undocumented upper limit arising from a limit
in graph is 20 subsets, but more than 20 would
likely be too busy any way.
A third binary predictor can be shown indirectly
by a separate() option.

33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
devnplot virtues

The display serves well in showing variation
within subsets as well as variation between.
Interactions can be seen.
The scaffolding (in subtle gray) helps to tie the
values of a group together visually.
The separate() option is best used to highlight a
few unusual or interesting cases.

37
Waterfall plots

Similar plots have been called waterfall plots,
especially in clinical oncology.
But watch out waterfall plots (or charts) have
at least two quite different meanings elsewhere,
in business and physical science contexts.
Sometimes the jungle of plot names is just a
confounded nuisance.

38
James Short and the transit of Venus (1763)

Short collated and corrected observations made by
various astronomers during the transit of Venus
in 1761.
The parallax here is the angle subtended by the
earths radius, as if viewed and measured from
the surface of the sun.
The data will be published and discussed in Stata
Journal 13(3).

39
Deviation plot

A deviation plot adjusts to the differing sample
sizes.
Here deviations are relative to 25 trimmed means
(otherwise known as midmeans or interquartile
means). Boxplot fans can think that they average
values within the box.
The context here of careful precise measurement
does not rule out the occasional mild or even
strong outlier.

40
(No Transcript)
41
Quantile plots

Deviation plots (waterfall plots, if you prefer)
are in essence quantile plots.
qplot from SJ can
superimpose through its over() option
or juxtapose through its by() option.
How well does that compare?

42
(No Transcript)
43
(No Transcript)
44
devnplot or qplot?

I prefer devnplot here, although qplot has useful
options too, including flexibility over axis
scales.
For example, if we plot against standard normal
quantiles, normal (Gaussian) distributions will
follow straight lines.

45
(No Transcript)
46
Strip plot

An alternative display is a strip plot or dot
plot. (Many other names exist.)
Here it takes on the flavour of a histogram but
with markers or point symbols for each value.
Some binning allows stacking.
stripplot from SSC offers an alternative to
official Statas dotplot.

47
(No Transcript)
48
Histograms or box plots?

Many statistical people would start almost
automatically with histograms or box plots for
such data. How do they compare?
You can judge for yourself.
A specific problem with histograms is keeping the
amount of scaffolding down. It is easy to lose
valuable real estate in axis and title
information.

49
(No Transcript)
50
(No Transcript)
51
How did we do that?

The main trick here is moving the subtitles to
the left. It only works here because they are so
short, but accept good fortune, however it
comes.
The incantation is
subtitle(, ring(1) pos(9) nobox nobexpand)

52
Box plots

Box plots do work fairly well, but they just
leave out too much detail for my taste.
If the details are accessible, you can decide for
yourself whether they are trivial.

53
(No Transcript)
54

Timed comparisons

55
Time series

Comparisons of time series are an especially
rich, and especially challenging, area of
statistical graphics.
The widespread term spaghetti plot hints
immediately at the difficulties.
As always, we want to combine a grasp of general
patterns with access to individual details.

56
sparkline

The Grunfeld data (webuse grunfeld) are a classic
dataset in panel-based economics.
Ten companies were monitored for 193554.
This can be an example for sparkline (SSC).
The name sparkline was suggested by Edward Tufte
for intense text-like graphics. Time series are
the most obvious example.

57
Vertical and horizontal

By default sparkline stacks small graphs
vertically.
If several graphs are combined, it is typical to
cut down on axis labels and rely on differences
in shape to convey information.
Horizontal stacking is also supported, which can
be useful for archaeological or environmental
problems focused on variations with depth or
height.

58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
Nightingales data

Florence Nightingale (1820-1910) is well
remembered for her nursing in the Crimean war and
less so as a pioneer in data analysis.
Her most celebrated dataset is often reproduced
using her polar diagram, but is easier to think
about as time series.
Zymotic (loosely, infectious) disease mortality
dominates other kinds, so much so that a square
root scale helps comparison. (A logarithmic scale
over-transforms here.)

62
(No Transcript)
63
(No Transcript)
64
Sparkline?

A sparkline display is useful to show relative
shape, such as times of peaks.
We see that seasonality is only part of what is
being seen. The harsh winter of 18545 coincided
with some of the hardest battles of the war.

65
(No Transcript)
66
Arctic sea ice

Another time series example concerns seasonal
variation in Arctic sea ice for 2002-13, just 12
annual series.
The usual spaghetti plot shows the similarity of
series well, but makes comparing them difficult.
Although some people try using a key or legend,
that rarely works well beyond a very few series.
Separating out the series runs into the opposite
problem.

67
(No Transcript)
68
(No Transcript)
69
Combine backdrop as context

So, use both ideas
Plot all data as a backdrop
(subdued, say using grayscale).
Plot each series within its context
(with stronger colour, thicker line).
See for discussion Cox, N. J. 2010. Graphing
subsets. Stata Journal 10 670681.

70
(No Transcript)
71

Cross-fertilisation

72
Titanic data

The Titanic sank in 1912. Statistically, we want
to explain fraction survived in terms of age, sex
and class of those on board.
A standard graph is a stacked or divided bar
graph, but it lacks punch. The command used was
catplot (SSC).
So, we end with something rather different,
produced with devnplot.

73
(No Transcript)
74
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Strategy and tactics for graphic multiples in Stata - PowerPoint PPT Presentation

Strategy and tactics for graphic multiples in Stata

Strategy and tactics for graphic multiples in Stata Nicholas J. Cox Department of Geography Durham University, UK Comparison Many useful graphs compare two ... – PowerPoint PPT presentation