Title: Creating Something from Nothing: Working with Synthetic Files
1Creating Something from NothingWorking with
Synthetic Files
- Bo Wandschneider University of Guelph
DLI Training April 2004, Kingston
2Outline
- NLSCY background
- Types of microdata files
- Which microdata file to use
- Providing services for synthetic files
This presentation is a modification of a workshop
that Chuck Humphrey and I presented at the May
2003 National DLI Training and a Presentation
Chuck presented at Accoleds DLI training, 2003.
3NLSCY
- The National Longitudinal Survey of Children and
Youth (NLSCY) is a long-term study of Canadian
children that follows their development and
well-being from birth to early adulthood. The
NLSCY began in 1994 and is jointly conducted by
Statistics Canada and Human Resources Development
Canada.
4NLSCY
- There are 4 cycles
- There are 8 different files
- 2 of these are available as a PUMF
- Primary
- Self-Reporting
- The rest include secondary and those based on
people reporting about child (teacher,
principal)
5Types of Microdata Files
- Confidential Microdata Products
- Master Files
- Share Files
- Public Access Microdata Products
- Public use anonymized microdata (PUMFS)
- Synthetic Files
6Microdata Products
- Microdata
- raw data organized in a file where the records or
lines in the file are observations of a specific
unit of analysis and the information on the lines
are the values of variables - requires some form of processing or analysis to
be used
7Microdata Products
8Confidential Microdata
- Master Files
- These files contain the fullness of detail
captured about the unit of observation. The
information in these files could identify the
individual who provided the original information
and, therefore, are considered confidential.
9Confidential Microdata
10Confidential Microdata
- Master File detailed identifiers
11Confidential Microdata
12Confidential Microdata
- Master File - fullness of data
13Confidential Microdata
- Master File - fullness of data
14Confidential Microdata
- Master File - fullness of data
15Confidential Microdata
- Share Files
- these are confidential files in which the
respondents have signed a consent form permitting
Statistics Canada to allow access to their
information for approved research. - Used with NPHS and NLSCY
16 Public Access Microdata
- Anonymized Microdata
- these microdata are specially prepared to
minimize the possibility of disclosing or
identifying any of the cases or observations - the original data from the master file are
edited to create a public use microdata file
17 Public Access Microdata
- Steps in Anonymizing Microdata
- removal of all personal identifiers
- include only gross levels of geography
- collapse detailed information into fewer general
categories or cap values - suppress the values of a variable
18 Public Access Microdata
- Statistics Canada PUMFs
- only available for select social surveys that
undergo a review of the Data Release Committee,
an internal Statistics Canada committee - no enterprise public use microdata
19 Public Access Microdata
- Statistics Canada PUMFs
- almost all are cross-sectional, that is,
represent data collected at one point in time - longitudinal data are difficult to anonymize
while maintaining any useful information.
20 Public Access Microdata
- PUMFs personal identifiers
21 Public Access Microdata
22 Public Access Microdata
23 Public Access Microdata
- PUMFs suppressed variables
- Note from the MASTER file NOT the PUMF
24Public Access Microdata
- Synthetic Files
- These microdata do not contain actual real
cases but are pseudo-cases that for some surveys,
provide aggregate results close to the real
cases
25Public Access Microdata
- Synthetic Files
- They have been prepared to create analysis runs
with the master file without possibly disclosing
or identifying any of the cases
26Public Access Microdata
- Synthetic Files
- The results are not to be reported, but are
strictly to be used to prepare analyses of master
files - Usually associated with longitudinal files.
27Public Access Microdata
- Steps in creating Synthetic Files
- Observations are transformed
- No records actually exist
- Keep fullness of variable description
- How the files are made is kept confidential
28Public Access Microdata
29Public Access Microdata
- Synthetic Files NPHS 1999 General File
PUMF Synthetic
Obs 49046 49046
Var 176 400
30Implications for Analysis
- What are the implications in doing analysis with
these different types of microdata files?
31Implications for Analysis
- Master File
- All observations
- Has the most variables with the most detail
- Lots of geography and personal characteristics
- Little grouping or capping of categories
32Implications for Analysis
- Master File
- Restricted access only available to authorized
Statistics Canada employees, which includes
deemed employees - Use of the analysis is controlled through a
contract
33Implications for Analysis
- Master File
- Includes linkage variables across files within a
study, e.g., NLSCY linkage among the files for
different units of analysis (kids, parents,
teachers).
34Implications for Analysis
- Public Use Microdata (PUMF)
- Valuable content for a tremendous amount of
research - Suppresed observations
- Suppressed variables
- Suppresed Content
- Gross Geography
- Collapsed categories
- Capped variables
- Where issues arise is when smaller area geography
is desired rare subpopulations are being
studied or the variables that are needed have
been used to anonymize respondents
35Implications for Analysis
- Public Use Microdata (PUMF)
- Licensed product agree to certain terms of use
- No linkage to multiple units of analysis, except
for a few exceptions (e.g., GSS Time Use and
Family)
36Implications for Analysis
- Synthetic Files
- Looks like a duck and quacks like a duck, but
it isnt a duck or any other type of fowl.
37Implications for Analysis
- Synthetic Files
- Looks like master files
- Lots of observations (maybe)
- Lots of variables
- Little grouping or capping of categories
- Lots of geographic detail
38Synthetic Files
- Precautions
- Results not authentic but may be close in the
aggregate for some synthetic files - Use for testing analysis setups only
- Still need the master files for publishable
results.
39Where do we get Access?
- Master File
- Restricted access governed under the Statistics
Act - Remote Job Submission (a.k.a, RDA)
- Research Data Centres
- Apply to SSHRC to obtain a peer-reviewed proposal
and STC for security clearance.
40Where do we get Access?
- Public Use Microdata Files (PUMF)
- Get from DLI
- Analyze where it is convenient
- Can use a variety of analysis software, including
SAS, SPSS, Stata, HLM, LISREL, etc.
41Where do we get Access?
- Synthetic Files
- Author Divisions may create it
- Most relevant when dealing with new Panel Data,
but not necessarily, e.g., the Census has
potential - NLSCY, NPHS CCHS synthetic files on DLI FTP site
42Where do we get Access?
- Synthetic files
- Work locally with the file
- Build SAS and SPSS setups
43Which File is Appropriate?
- 1st stop is still the PUMF
- This file has the easiest access for us
- Probably meets the needs of most patrons
- Not as administratively burdensome as synthetic
or master file - Perfect for clients just looking for data
courses in quantitative analysis
44Which File is Appropriate?
- If more detail is needed, refer to the Master
File Documentation - Inform patrons that the cost of use is higher,
both in terms of accessibility and analytical
requirements - Interest most likely to come from grad students
and experienced researchers
45Which File is Appropriate?
- Download the Synthetic files from DLI
- Make them aware of problems with synthetic files
RESULTS ARE NOT PUBLISHABLE - Encourage them to submit an application for RDC
access there is a time lag
46Which File is Appropriate?
47Which File is Appropriate?
- Some of you may work with patron using synthetic
files before passing her/him off to RDC.
48Services for Synthetic Files
- DLI Contacts can provide four basic services with
synthetic files. - Build SPSS and SAS system files from the raw
synthetic data files that are distributed through
DLI - Provide information about the use of Remote Job
Submission and RDCs
49Services for Synthetic Files
- Assist with finding variables in the synthetic
files - Provide instruction about ways of capturing SPSS
or SAS code from dummy analysis runs with the
synthetic files. It is this code that is
submitted to STC through remote job submission.
50Services for Synthetic Files
- 1. Building SPSS and SAS system files for
synthetic data - The NLSCY synthetic data are distributed as a raw
ASCII file with accompanying command files for
SPSS and SAS - Separate synthetic data files exist for each
component of the NLSCY not all components have
PUMFs
51Services for Synthetic Files
- 1. Building SPSS and SAS system files for
synthetic data - The synthetic data for the NLSCY cycle 3
primary file, has 948 variables and 6,393
fabricated cases. Creating the SPSS and SAS
system files from this file is not difficult, but
it does take time. DLI Contacts may wish to
create these products for their patrons.
52Services for Synthetic Files
- 2. Information about Remote Job Submission (RJS)
- The author divisions supporting RJS have
established their own guidelines and have
different operating procedures. Not all
divisions supporting longitudinal surveys
currently support RJS (e.g., SLID). - Therefore, there is a need to track down this
information for our patrons.
53Services for Synthetic Files
- 2. Information about Remote Job Submission (RJS)
- For example, the sources for information about
RJS include the Centre for Education Statistics - http//www.statcan.ca/english/edu/rda/index.htm
54(No Transcript)
55Services for Synthetic Files
- 2. Information about Remote Job Submission (RJS)
- Where do you find this information?
- Ask the DLI Team via the DLI List
- The EAC has asked for a description of RJS on the
DLI website, which should be on the DLI Teams
to-do list - mailtonlscy_at_statcan.ca
56Services for Synthetic Files
- 2. Information about Research Data Centres
- The collection of master files available through
RDCs is listed on the STC website for RDCs - Each RDC has its own website describing its
services - http//www.statcan.ca/english/rdc/index.htm
57(No Transcript)
58Services for Synthetic Files
- 3. Data Reference for the content of the
synthetic files - Helping researchers identify variables over
longitudinal files is an important service - Need to keep the unit of analysis straight
- Need to understand the mnemonic naming convention
for variables over cycles - Develop indexing aids for you and your patrons
59Services for Synthetic Files
- 4. Provide helpful tips for preserving the code
from dummy analysis runs in SPSS and SAS - Researchers will run analyses on the synthetic
file to generate the code that they will
subsequently email for Remote Job Submission - Providing information about how to do this easily
will be helpful to your patrons
60Exercises