Title: Generating new variables and manipulating data with STATA
1Generating new variables and manipulating data
with STATA
- Biostatistics 212
- Lecture 3
2Housekeeping
- Lab 1 handed back today
- Think of red ink as teaching points, not
penalties - Do and Log files
- Understand each command!
- Order them appropriately
3The .do file template
cd C\data\biostat212\ version 11 set more
off clear set memory 10m capture log
close log using name of your log.log,
replace / here are my comments / use name
of your dataset, clear summarize this browse
that tabulate this log close set more on
- Tell STATA where to look for things and where to
put things - Run correct version of Stata
- Stop STATA from prompting you to push a button to
continue - Tell STATA to clear any datasets in memory and
increase its mem capacity - Since your do file may not be perfect, tell STATA
to close any logs that are open when you try to
run your do file - Tell STATA to create a log of your output for you
and what youre going to call that log. Tell it
to overwrite it each time - Stick in some comments to remind you what this do
file is for - Tell STATA what dataset to work on
- Leave some SPACE for putting in analysis commands
you want to keep - Lastly, tell STATA to close the log and go back
to its usual more mode
4New issues
5Today...
- What we did in Lab 1, and why it was unrealistic
- What does data cleaning mean?
- Importing data into Stata
- How to generate a variable
- How to manipulate the data in your new variable
- How to label variables and otherwise document
your work - Examples
6Last time
7Last time
- What was unrealistic?
- The dataset came as a Stata .dta file
8Last time
- What was unrealistic?
- The dataset came as a Stata .dta file
- The variables were ready to analyze
9Last time
- What was unrealistic?
- The dataset came as a Stata .dta file
- The variables were ready to analyze
- Most variables were labeled
10Last time
11How your data will arrive
- On paper forms
- In a text file (comma or tab delimited)
- In Excel
- In Access
- In another data format (SAS, etc)
12Importing into Stata
- Options
- Copy and Paste
- insheet, infile, fdause, other flexible Stata
commands - A convenience program like Stat/Transfer
13Importing into Stata
- Make sure it worked
- Look at the data
14Importing into Stata
- Demo neonatal opiate withdrawal data
- Import with cut and paste from Excel
- Import with insheet (save as .csv file first)
15Exploring your data
- Figure out what all those variables mean
- Options
- Browse, describe, summarize, list in STATA
- Refer to a data dictionary
- Refer to a data collection form
- Guess, or ask the person who gave it to you
16Exploring your data
- Demo Neonatal opiate withdrawal data
17Exploring your data
- Demo Neonatal opiate withdrawal data
- Problems arise
- Sex is m/f, not 1/0
- Gestational age has nonsense values (0, 60)
- Breastfeeding has a bunch of weird text values
- Drug variables coded y or blank
- Many variable names are obscure
18Cleaning your data
- You must clean your data so it is ready to
analyze.
19Cleaning your data
- What does the variable measure?
- rename and/or label var so its clear
- Find nonsense values and outliers
- recode as missing or track down real value?
- Deal with missing values
- Too many values missing in some subjects? Coding
consistent? - drop variable or observation?
- Categorize as needed
- generate a new numeric variable
- recode (dichotomous variables coded as 1/0,
watch missing values) - label define and then label values
- Check
- tab oldvar newvar, missing
- bysort catvar sum contvar
20Cleaning your data
- The importance of documentation
- Retracing your steps
- Document every step using a do file
21Data cleaningBasic skill 1 Making a new
variable
- Creating new variables
- generate newvar expression
-
22Data cleaningBasic skill 1 Making a new
variable
- Creating new variables
- generate newvar expression
- An expression can be
- A number (constant) - generate allzeros 0
- A variable - generate ageclone age
- A function - generate agesqrt sqrt(age)
23Data cleaningBasic skill 2 Manipulating values
of a variable
- Changing the values of a variable
- replace var exp if boolean_expression
- A boolean expression evaluates to true or false
for each observation
24Data cleaningBasic skill 2 Manipulating values
of a variable
- Examples
- generate bmi weight/(height2)
- generate male 0
- replace male 1 if sexmale
- generate ageover50 0
- replace ageover 50 1 if agegt50
- generate complexvar age
- replace complexvar (ln(age)3)
- if (agegt30 male1) (othervar1gtothervar2)
25Data cleaningBasic skill 2 Manipulating values
of a variable
- Logical operators for boolean expressions
- English Stata
- Equal to
- Not equal to !,
- Greater than gt
- Greater than/equal to gt
- Less than lt
- Less than/equal to lt
- And
- Or
26Data cleaningBasic skill 2 Manipulating values
of a variable
- Mathematical operators
- English Stata
- Add
- Subtract -
- Multiply
- Divide /
- To the power of
- Natural log of ln(expression)
- Base 10 log of log10(expression)
- Etcetera
27Data cleaningBasic skill 2 Manipulating values
of a variable
- Another way to manipulate data
- recode var oldvalue1newvalue1 oldvalue2newvalue
2 if boolean_expression - More complicated, but more flexible command than
replace
28Data cleaningBasic skill 2 Manipulating values
of a variable
- Examples
- generate male 0
- recode male 01 if sexmale
- generate female male
- recode female 10 01
29Data cleaningBasic skill 2 Manipulating values
of a variable
- Examples
- generate raceethnic race
- recode raceethnic 16 if ethnichispanic
- (replace raceethnic 6 if ethnichispanic
race1) - generate tertilescac cac
- recode tertilescac min/541 55/822 83/max3
30Data cleaningBasic skill 3 Getting rid of
variables/observations
- Getting rid of a variable
- drop var
- Getting rid of observations
- drop if boolean_expression
31Data cleaningBasic skill 4 Labeling things
- You can label
- A dataset label data label
- A variable label var varname label
- Values of a variable (2-step process)
- label define labelname value1 label1 value2
value2 - label values varname labelname
- label define caccatlabel 0 0 1 1-100 2
101-400 3 gt400 - label values caccat caccatlabel
32Data cleaningBasic skill 5 Dealing with missing
values
- Missing values are important, easy to forget
- . for numbers
- for text
- tab var1 var2, missing
- Watch the total n for tab, summarize commands,
regression analyses, etc.
33Data cleaning
- Demo Neonatal opiate withdrawal data
34Cleaning your data
- What does the variable measure?
- rename or label var so its clear
- Find nonsense values and outliers
- recode as missing or track down real value?
- Deal with missing values
- Too many? Coding consistent?
- drop variable or observation?
- Categorize as needed
- generate a new numeric variable
- recode (dichotomous variables coded as 1/0,
watch missing values) - label define and then label values
- Check
- tab oldvar newvar, missing
- bysort catvar sum contvar
35Data cleaning
- At the end of the day you have
- 1 raw data file, original format
- 1 raw data file, Stata format
- 1 do file that cleans it up
- 1 log file that documents the cleaning
- 1 clean data file, Stata format
36Summary
- Data cleaning
- ALWAYS necessary to some extent
- ALWAYS use a do file
- NEVER overwrite original data
- Check your work
- Watch out for missing values
- Label as much as you can
37Lab this week
- Its long
- Its hard
- Its important
- Email lab to your section leaders email
- Due at the beginning of lecture next week
38Preview of next week
- Using Excel
- What is it good for?
- Formulas
- Designing a good spreadsheet
- Formatting