Title: STATA Lab: EP521
1STATA Lab EP521 Learning by Doing Session 1
Exploring Data
Ray Boston boston_at_vet.upenn.edu Room 604
Blockley 610 925 6557
2This 6 session Stata Lab series will expose the
2nd level functionality of Stata through
practical demonstrations and exercises relating
to your course EP 521
Course Schedule
Presenter Ray Boston Location Room 604
Blockley Phone 610 925 6557 boston_at_vet.upenn.edu
3Commands used in this lab use use a Stata
dataset earlier stored on disk note replace
option inspect inspect specific variables note
missing value info available here describe descri
be variables in a Stata dataset note detail
option summarize summarize a Stata dataset note
works on individual variables codebook report
details of data coding for indicated
variable(s) display display a message, or a
variable value (scalar or local) label label a
categorical variable encode make a numeric
variable for a sting variable generate generate
a new Stata variable replace replace the value
of a variable list list specific variables,
note tables can be also generated table
tabulate an interval variable by a cateogorical
variable tabstat tabulate some statistics for
specific variables tabulate tabulate some
information, note this is the command for
Fishers test sort sort the data gsort sort
the dataset in a specified way
4Secondary commands used in this lab for loop
for a series of objects, note that this is an
out-of-date command collapse reduce your dataset
to summary statistics scatter produce a Stata 8
scatter plot gr7 produce s Stata 7 graph d d
cr change the end-of-line delimiter ( and
cr) preserve preserve a copy of the current
Stata dataset in computer memory restore retrieve
the preserved copy of the Stata dataset note
the stored copy is no-longer available, and
the original dataset is replaced scalar generate
a Stata scalar variable local generate a Stata
local variable note also called a local
macro cc generate a case-control type of epi
table cs generate a cohort-study type of epi
table logit perform a logistic
regression poisson perform a poisson
regression We may return to these commands for
specific purposes in later labs
5Problem Woodward presents the following table
(Table 2.9, p. 48) relating to sex versus smoking
status in the Scottish Heart Health Study. Adapt
the information in this table for analysis with
STATA
Variables sex, smoker, count
Coding
smoker 0, non-smoker 1, smoker sex 0, female
1, male count actual cell count
6The data can be entered into STATA via the data
editor
Label the values of sex and smoker so that our
table make sense
. label define smlabel 0 "Non smoker" 1 " Smoker
" . label define selabel 0 "Female" 1 " Male
" . label val sex selabel . label val smoker
smlabel
Note that cell counts, and NOT margins are
entered into STATA
7Label the variable count
. label var count "Cell count"
For some preliminary limbering lets explore the
data as it stands
. list sex smoker count
1. Female Smoker 1562 2.
Male Non smoker 2241 3. Female
Non smoker 2259 4. Male Smoker
2279
Why wasnt count labeled like sex, and smoker?
We should now save the table as a file
. save table 2_9 Woodward.dta",replace
Where was the data saved? cd Why did we
include the replace option? pre-existence Why do
we refer to replace as an option? , Why did we
use quotes () around the file name? space What
format was the data saved in? .dta
8Lets see a table of this data
. table sex smoker fwecount, row
col ---------------------------------------------
- smoker
sex Non smoker Smoker
Total -------------------------------------------
-- Female 2,259 1,562
3,821 Male 2,241 2,279
4,520 Total 4,500
3,841 8,341 --------------------------------
--------------
Lets see how we recall the coding schemes .. why
would this be needed?
. codebook sex sex -------------------------------
-------------------------------- (unlabeled)
type numeric (byte)
label selabel range 0,1
units 1 unique
values 2 coded missing 0 /
4 tabulation Freq. Numeric
Label 2 0
Female 2 1
Male
9We will explore this data using the Stata command
sequence which follows
- First some EXTREMELY important points
- In practice you will ALWAYS build your
statistical exploration of data - using command sequences such as we now
demonstrate - Why?
- The nature of the commands in the command
sequence is ALWAYS - retained on your computer in a disk file,
usually close to the dataset - (table 2_9 Woodward.dta) for which it was
developed. - Why?
- Commands are stored as ordinary text in files
called do files - Why?
- Stata has a special editor, the do file editor,
for the creation, and - editing of do files.
- Why?
10use "C\Stata\EP521\Epi 521 04\Session 1\table
2_9 Woodward.dta",clear Information about the
raw data correctness/screening list codebook desc
ribe summarize summarize sex smoke
fwecount label define smlabel 0 "Non smoker"
1 " Smoker " label define selabel 0 "Female" 1
" Male " label val sex selabel label val smoker
smlabel list If we want to copy the table to
Excel Select, and Edit copy table, and
Paste the following table list, nolabel noobs
clean codebook inspect describe Some tables
describing the data tabulate sex fwecount,
su(smoke) mean table sex fwecount, c(mean
smoke freq) format(7.2f) tabulate sex smoke
fwecount, chi table sex smoke fwecount, row
col tabstat smoke fwecount, s(mean sd sem N)
by(sex) long Present some simple graphs of
this data preserve collapse smoke fwecount,
by(sex) gen pos3(sex1)
Get the data into Stata
Screening the input using list
describe summarize codebook inspect,
and table variations
Preparing to graph
11scatter smoke sex, c(l) ml(sex) more scatter
smoke sex, c(l) ml(sex) mlabv(pos) more Now for
adjustments required by Stata 8 graphics
syntax d scatter smoke sex, c(l) ms(Sh)
mlabv(pos) xlabel(0 1, valuelabel)
title("Smoking Proportion By
Sex") ytitle(" ")
ylabel(,angle(0)) d cr more gr7 requests a
Stata 7 type graph You establish Stata 7 graph
preferences using 'oldgprefs' gr7 smoke sex, c(l)
s(sex) xlabel(0 1) ylabel l1("Smoking
Proportion By Sex") more Let's determine the
malefemale risk ratio for smoking di "Risk ratio
" max(smoke1,smoke2)/min(smoke1,smoke2)
restore Two alternate ways of looking at the
data - Risk perspective cs smoke sex
fwecount poisson smoke sex fwecount, irr
nolog ro Using scalars let's calculate the
malefemale odds ratio for smoking gsort sex
-smoke scalar prob_femalecount1/(count1count
2) scalar odds_female prob_female/(1-prob_fema
le) scalar prob_malecount3/(count3count4)
scalar odds_maleprob_male/(1-prob_male) scalar
odds_ratioodds_male/odds_female scalar list
_all Two alternate ways of looking at the data
- Odds perpsective cc smoke sex fwecount logit
smoke sex fwecount, or nolog
Stata 8 Graphing commands
Stata 7 Graphing command
Manual rr calculation
Two other ways of determining risk ratio - rr
Manual or calculation
Two other ways of determining odds ratio - or
12An exercise to get you started using Stata
productively on your own
13The following table is from Kahn Sempos (p. 81)
and reflects a distillation of some information
extracted from the Framingham study.
Ultimately we would like to use these numbers to
possibly tell us to what degree blood pressure
elevation disposes us to CHD what is the overall
risk for CHD amongst study participants in the
table how much is the risk of CHD elevated if we
have high blood pressure
14Getting the CHD data into STATA and naming the
variables. What do we mean by naming the
variables?
15Perform the following tasks Screen the data
entered to confirm its correctness How could you
generate the margins to add confidence here? Do
it. Label the variables appropriately. What
constitutes appropriate labeling? Save the Stata
data file. Where did you save it? What format
was used? Verify that you have indeed save the
Stata data file Perform tests to verify that you
have correctly prepared your data Tables Reprodu
ce the table in which the problem is first
introduced Tabulate the proportion of subjects
with CHD by blood pressure grouping Add standard
error estimates to this table Are the proportions
with CHD different by blood pressure
group? Graphs Collapse the data into
proportions with CHD, by blood pressure
group Produce a simple Stata 8 graph of CHD
proportion against blood pressure Add features to
your graph to make it publication ready Produce a
Stata 7 graph of the same data which was easier?
16The Excel file, cardatarb.xls contains some
recent (New Yorker, Jan 05, 2004) accident
statistics relating to indirect and direct road
deaths when a range of different car types were
involved. The purpose of the investigation
under- pinning the data was to see if large
vehicles are associated with different types of
accidents than small cars. You are asked to
perform the following tasks Get the data from
Excel directly into Stata Describe and summarize
the data Generate a neat table of all types of
deaths (these are actually death rates
per million vehicles of the indicated type) by
vehicle type. Is there a suggestion of an
association here? Make a numeric variable out of
the car type variable. Confirm that the new
variable you have created is indeed of the type
sought Label the numeric variable appropriately.
Hint youll need codebooks help here The
vehicles are essentially of two classes, large
and small. Create a new numeric variable which
is 1 for large vehicles, and 0 for others. Label
this variable appropriately. See if your data
breaks down equitably by your new numeric
variable. Tabulate a breakdown of deaths of the
different types by your new size-related vehicle
group variable. How could you actually detect a
statistically significant difference here? (see
nptrend)