Title: Generating new variables and manipulating data with STATA
1Generating new variables and manipulating data
with STATA
- Biostatistics 212
- Lecture 3
2Housekeeping
- Questions re Log and Do files?
3Today...
- What we did in Lab 1, and why it was unrealistic
- What does data cleaning mean?
- How to generate a variable
- How to manipulate the data in your new variable
- How to label variables and otherwise document
your work - Examples
4Last time
5Last time
- What was unrealistic?
- The dataset came as a Stata .dta file
6Last time
- What was unrealistic?
- The dataset came as a Stata .dta file
- The variables were ready to analyze
7Last time
- What was unrealistic?
- The dataset came as a Stata .dta file
- The variables were ready to analyze
- Most variables were labeled
8Last time
9How your data will arrive
- On paper forms
- In a text file (comma or tab delimited)
- In Excel
- In Access
- In another data format (SAS, etc)
10Importing into Stata
- Options
- Copy and Paste
- insheet, infile, fdause, other flexible Stata
commands - A convenience program like Stat/Transfer
11Importing into Stata
- Make sure it worked
- Look at the data
12Importing into Stata
- Demo neonatal opiate withdrawal data
13Exploring your data
- Figure out what all those variables mean
- Options
- Browse, describe, summarize, list in STATA
- Refer to a data dictionary
- Refer to a data collection form
- Guess, or ask the person who gave it to you
14Exploring your data
- Demo Neonatal opiate withdrawal data
15Exploring your data
- Demo Neonatal opiate withdrawal data
- Problems arise
- Sex is m/f, not 1/0
- Gestational age has nonsense values (0, 60)
- Breastfeeding has a bunch of weird text values
- Drug variables coded y or blank
- Many variable names are obscure
16Cleaning your data
- You must clean your data so it is ready to
analyze.
17Cleaning your data
- Cleaning tasks
- Check for consistency and clean up nonsense data
and outliers - Deal with missing values
- Code all dichotomous variables 1/0
- Categorize variables as needed (for Table 1, etc)
- Derive new variables
- Rename variables
- With common sense, or with a consistent scheme
- Label variables
- Label the VALUES of coded variables
18Cleaning your data
- The importance of documentation
- Retracing your steps
- Document every step using a do file
19Data cleaningBasic skill 1 Making a new
variable
- Creating new variables
- generate newvar expression
-
20Data cleaningBasic skill 1 Making a new
variable
- Creating new variables
- generate newvar expression
- An expression can be
- A number (constant) - generate allzeros 0
- A variable - generate ageclone age
- A function - generate agesqrt sqrt(age)
21Data cleaningBasic skill 2 Getting rid of
variables/observations
- Getting rid of a variable
- drop var
- Getting rid of observations
- drop if boolean exp
22Data cleaningBasic skill 3 Manipulating values
of a variable
- Changing the values of a variable
- replace var exp if boolean exp
- A boolean expression evaluates to true or false
for each observation
23Data cleaningBasic skill 3 Manipulating values
of a variable
- Examples
- generate male 0
- replace male 1 if sexmale
- generate ageover50 0
- replace ageover 50 1 if agegt50
- generate complexvar age
- replace complexvar (ln(age)3)
- if (agegt30 male1) (othervar1gtothervar2)
24Data cleaningBasic skill 3 Manipulating values
of a variable
- Logical operators for boolean expressions
- English Stata
- Equal to
- Not equal to !,
- Greater than gt
- Greater than/equal to gt
- Less than lt
- Less than/equal to lt
- And
- Or
25Data cleaningBasic skill 3 Manipulating values
of a variable
- Mathematical operators
- English Stata
- Add
- Subtract -
- Multiply
- Divide /
- To the power of
- Natural log of ln(expression)
- Base 10 log of log10(expression)
- Etcetera
26Data cleaningBasic skill 3 Manipulating values
of a variable
- Another way to manipulate data
- Recode var oldvalue1newvalue1 oldvalue2newvalue
2 if boolean expression - More complicated, but more flexible command than
replace
27Data cleaningBasic skill 3 Manipulating values
of a variable
- Examples
- Generate male 0
- Recode male 01 if sexmale
- Generate raceethnic race
- Recode raceethnic 16 if ethnichispanic
- (Replace raceethnic 6 if ethnichispanic
race1) - Generate tertilescac cac
- Recode min/541 55/822 83/max3
28Data cleaningBasic skill 4 Labeling things
- You can label
- A dataset label data label
- A variable label var varname label
- Values of a variable (2-step process)
- label define labelname value1 label1 value2
value2 - Label values varname labelname
29Data cleaningBasic skill 5 Dealing with missing
values
- Missing values are important, easy to forget
- . for numbers
- for text
- tab var1 var2, missing
- Watch the total n for tab, summarize commands,
regression analyses, etc.
30Data cleaning
- Demo Neonatal opiate withdrawal data
31Cleaning your data
- Cleaning tasks
- Check for consistency and clean up non-sense data
- Deal with missing values
- Code all dichotomous variables 1/0
- Categorize variables meaningfully (for Table 1,
etc) - Derive new variables
- Rename variables
- With common sense, or with a consistent scheme
- Label variables
- Label the VALUES of coded variables
32Data cleaning
- At the end of the day you have
- 1 raw data file, original format
- 1 raw data file, Stata format
- 1 do file that cleans it up
- 1 log file that documents the cleaning
- 1 clean data file, Stata format
33Summary
- Data cleaning
- ALWAYS necessary to some extent
- ALWAYS use a do file
- NEVER overwrite original data
- Check your work
- Watch out for missing values
- Label as much as you can
34Lab this week
- Its long
- Its hard
- Its important
- Email lab to your section leaders email
- Due at the beginning of lecture next week
35Preview of next week
- Using Excel
- What is it good for?
- Formulas
- Designing a good spreadsheet
- Formatting