Title: SDA: a tool for teaching and research with microdata
1SDA a tool for teaching and research with
microdata
- Laine Ruus ltlaine.ruus_at_utoronto.cagt
- University of Toronto. Data Library Service
- 2008-12-03, revised 2009-04-14
- http//www.chass.utoronto.ca/misc/mun09/sda_intro.
ppt
2What this session covers
- Introduction
- Demo of main SDA capabilities
- Some tips and tricks
- Advantages and disadvantages for teaching and
research - Common questions about SDA
3SDA_at_UT is brought to you by
- University of California, Berkeley.
Computer-assisted Survey Methods Program (CSM)
writes and supports the server-side software - University of Toronto. Centre for Computing in
the Humanities and Social Sciences (CHASS)
provides the hardware, buys the software, and
provides system support wetware - University of Toronto. Libraries provides the
budget to purchase the data, and care, feeding
and user support wetware - And Memorial University Libraries which
subscribes to the service.
4Our experience with SDA
- CHASS installed SDA in the fall of 2004
- At last count, have 900 data files in SDA
- Some have only the metadata that was generated
from the original syntax files (SAS/SPSS/Stata),
but a number also have full question text. - Most are microdata, but a few are aggregate
statistics (census files) - A number of voracious data users now expect to
find the latest microdata released by Stat Can in
SDA
5(No Transcript)
6(No Transcript)
7Review of main SDA utilities
- Frequencies, weighted unweighted
- Crosstabulations
- Comparison of means (ANOVA)
- Correlations
- Regressions
- Logit/probit regressions
8Tips tricks
- Have we not gotten around to coding the missing
values? - Want to include missing values in your
cross-tabulation, or other analysis? - Collapsing uniform categories of continuous
variables on the fly - Recoding variables on the fly
9Problem in this variable, we have not yet coded
value 5 as missing data. Therefore it would be
included in analyses.
10Solution specify, after the variable name, only
those values you want to include
11Problem to include values coded as missing in
descriptive statistics or analyses
This is a missing value. It will not be included
in descriptive statistics or analyses.
12Solution 1 specify, after the variable name,
the lowest value thru .
13Solution 2 use include missing data values
under Table options
14Solution 3 list the values explicitly after the
variable name
15Problem to generate frequencies or a
cross-tabulation of a continuous variable
16Solution 1 collapse to uniform categories,
defining a starting point
c30000,-30000 means - collapse to uniform
categories - each category should be 30000 in
size - begin with value -30000
17Solution 2 recode to desired categories. Note
use of to denote both lowest and highest values.
18Tips tricks (contd)
- Computing percentages in aggregate data
- Dummy coding variables in regressions
- Defining an interaction on the fly
19Problem given a file of aggregate statistics,
list percentages rather than counts. NB use the
Listcase program
These are all counts
20Solution define percentages in the Listcase
program.
Defines a percentage with v4 in the numerator and
v2 in the denominator.
21Problem to use a categorical variable in a
regression analysis, it needs to be dummy-coded
(ie 1 and 0).
22Solution dummy-code categorical variables
on-the-fly. Interactions can also be coded
on-the-fly, including interactions with
dummy-coded variables.
Dummy coded values 10-14 will be coded to 1,
all others will be 0.
Interaction involving a dummy coded variable and
a continuous variable.
23Advantages for teaching
- Stable environment, 24x7 access
- Very easy to explain to novice users
- Reduce/eliminates need for computer labs with
statistical software - Allows you to each statistics rather than
software - Students get hands on data quickly
- Switch easily between weighted and unweighted
distributions
24Advantages for teaching (contd)
- Measures of association and tests of significance
comparable to SAS - Design effects, in files in which cluster and/or
statum variables are available - Interactive demonstration of statistical concepts
- Share recoded variables
- Can quickly mount additional data to fulfill your
teaching needs
25Advantages for research
- Stable environment, 24x7 access
- Access to latest available version of the data
- Basic exploratory data analysis eg are there
enough cases for my subset? - Design effects, where cluster/sample variables
available - Download data and import to SAS/SPSS/Stata on own
workstation - Share recoded variables
- Integrated variable descriptions (selected data
files)
26Advantages for data management
- Creates metadata from SAS/SPSS/Stata syntax or
DDI format xml files - Very easy and fast to import files with good
syntax files - Control over what users can and cannot do
- Outputs include SAS/SPSS/Stata syntax or DDI
format xml files - Overhead size of uncompressed data about 50
27Disadvantages of SDA
- Search for variables/values among data files not
yet implemented at UT/CHASS - Cant download created/recoded variables coming
in spring 2009 - Graphics minimal, eg no stem-and-leaf, box-plots
etc - Doesnt output SAS/SPSS/Stata system/export
files, only raw data files plus syntax files - Little support for Study/File level metadata
(DDI) - No support for nCubes (DDI 2)
28How SDA compares to the competition
- See table at
- http//www.chass.utoronto.ca/datalib/misc/accoleds
/2008/sda_compare.htm
29Common questions from researchers students
- When to weight versus not to weight
- Does it only do cross-tabs?
- But I want the raw data, not a cross-tabulation!
- Differences between syntax, data, and system
files.
30An application we wouldnt have tackled without
SDA
- Q I need the average expenditure on eye care in
Canada by age group of household head for as long
a time-period as possible. - A Once we explained SDA, the student had
generated this statistics from each of the
FAMEX/SHS files, 1969-2004 in under 30 mins. (He
knew only Stata.)
31Functions we know to be coming in SDA
- Among-file variable searching already available
but not yet implemented on CHASS - Downloading recoded variables
- Will allow users to load own data files (Archiver
in SDA 3.1) -- already available but not yet
implemented on CHASS
32Exercises
- First time SDA user? Try these exercises using
the Census 2001 microdata on individuals - Experienced SDA user? Try these exercises using a
variety of DLI data files
33Questions
- Question 1 Where will I find the SDA server at
University of Toronto? - Answer 1 The URL is
- http//www.chass.utoronto.ca/datalib/
- Select Microdata analysis and extraction
34Questions (contd)
- Question 2
- How are files chosen to be mounted on the SDA
server at UT?
- Answer 2
- All significant Canadian microdata files, eg by
Statistics Canada as released by DLI - Other files based on your requests
35Questions (contd)
- Question 3
- My research is being done collaboratively with a
colleague at another Canadian university. Can my
colleague get access to SDA?
- Answer 3
- SDA is available as a subscription service to
other Canadian DLI-member universities and
colleges. Current subscribers include U of
Victoria, Ryerson U, and Memorial U