Creating Something from Nothing: Working with Synthetic Files - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Creating Something from Nothing: Working with Synthetic Files

Description:

This presentation is a modification of a workshop that Chuck ... 'Looks like a duck and quacks like a duck', but it isn't a duck or any other type of fowl. ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 61
Provided by: bowandsc
Category:

less

Transcript and Presenter's Notes

Title: Creating Something from Nothing: Working with Synthetic Files


1
Creating Something from NothingWorking with
Synthetic Files
  • Bo Wandschneider University of Guelph

DLI Training April 2004, Kingston
2
Outline
  • NLSCY background
  • Types of microdata files
  • Which microdata file to use
  • Providing services for synthetic files

This presentation is a modification of a workshop
that Chuck Humphrey and I presented at the May
2003 National DLI Training and a Presentation
Chuck presented at Accoleds DLI training, 2003.
3
NLSCY
  • The National Longitudinal Survey of Children and
    Youth (NLSCY) is a long-term study of Canadian
    children that follows their development and
    well-being from birth to early adulthood. The
    NLSCY began in 1994 and is jointly conducted by
    Statistics Canada and Human Resources Development
    Canada.

4
NLSCY
  • There are 4 cycles
  • There are 8 different files
  • 2 of these are available as a PUMF
  • Primary
  • Self-Reporting
  • The rest include secondary and those based on
    people reporting about child (teacher,
    principal)

5
Types of Microdata Files
  • Confidential Microdata Products
  • Master Files
  • Share Files
  • Public Access Microdata Products
  • Public use anonymized microdata (PUMFS)
  • Synthetic Files

6
Microdata Products
  • Microdata
  • raw data organized in a file where the records or
    lines in the file are observations of a specific
    unit of analysis and the information on the lines
    are the values of variables
  • requires some form of processing or analysis to
    be used

7
Microdata Products
  • NLSCY cycle 3 - primary

8
Confidential Microdata
  • Master Files
  • These files contain the fullness of detail
    captured about the unit of observation. The
    information in these files could identify the
    individual who provided the original information
    and, therefore, are considered confidential.

9
Confidential Microdata
  • Master File Example

10
Confidential Microdata
  • Master File detailed identifiers

11
Confidential Microdata
  • Master File geography

12
Confidential Microdata
  • Master File - fullness of data

13
Confidential Microdata
  • Master File - fullness of data

14
Confidential Microdata
  • Master File - fullness of data

15
Confidential Microdata
  • Share Files
  • these are confidential files in which the
    respondents have signed a consent form permitting
    Statistics Canada to allow access to their
    information for approved research.
  • Used with NPHS and NLSCY

16
Public Access Microdata
  • Anonymized Microdata
  • these microdata are specially prepared to
    minimize the possibility of disclosing or
    identifying any of the cases or observations
  • the original data from the master file are
    edited to create a public use microdata file

17
Public Access Microdata
  • Steps in Anonymizing Microdata
  • removal of all personal identifiers
  • include only gross levels of geography
  • collapse detailed information into fewer general
    categories or cap values
  • suppress the values of a variable

18
Public Access Microdata
  • Statistics Canada PUMFs
  • only available for select social surveys that
    undergo a review of the Data Release Committee,
    an internal Statistics Canada committee
  • no enterprise public use microdata

19
Public Access Microdata
  • Statistics Canada PUMFs
  • almost all are cross-sectional, that is,
    represent data collected at one point in time
  • longitudinal data are difficult to anonymize
    while maintaining any useful information.

20
Public Access Microdata
  • PUMFs personal identifiers

21
Public Access Microdata
  • PUMFs gross geography

22
Public Access Microdata
  • PUMFs collapsed data

23
Public Access Microdata
  • PUMFs suppressed variables
  • Note from the MASTER file NOT the PUMF

24
Public Access Microdata
  • Synthetic Files
  • These microdata do not contain actual real
    cases but are pseudo-cases that for some surveys,
    provide aggregate results close to the real
    cases

25
Public Access Microdata
  • Synthetic Files
  • They have been prepared to create analysis runs
    with the master file without possibly disclosing
    or identifying any of the cases

26
Public Access Microdata
  • Synthetic Files
  • The results are not to be reported, but are
    strictly to be used to prepare analyses of master
    files
  • Usually associated with longitudinal files.

27
Public Access Microdata
  • Steps in creating Synthetic Files
  • Observations are transformed
  • No records actually exist
  • Keep fullness of variable description
  • How the files are made is kept confidential

28
Public Access Microdata
  • Synthetic Files NLSCY


29
Public Access Microdata
  • Synthetic Files NPHS 1999 General File

30
Implications for Analysis
  • What are the implications in doing analysis with
    these different types of microdata files?

31
Implications for Analysis
  • Master File
  • All observations
  • Has the most variables with the most detail
  • Lots of geography and personal characteristics
  • Little grouping or capping of categories

32
Implications for Analysis
  • Master File
  • Restricted access only available to authorized
    Statistics Canada employees, which includes
    deemed employees
  • Use of the analysis is controlled through a
    contract

33
Implications for Analysis
  • Master File
  • Includes linkage variables across files within a
    study, e.g., NLSCY linkage among the files for
    different units of analysis (kids, parents,
    teachers).

34
Implications for Analysis
  • Public Use Microdata (PUMF)
  • Valuable content for a tremendous amount of
    research
  • Suppresed observations
  • Suppressed variables
  • Suppresed Content
  • Gross Geography
  • Collapsed categories
  • Capped variables
  • Where issues arise is when smaller area geography
    is desired rare subpopulations are being
    studied or the variables that are needed have
    been used to anonymize respondents

35
Implications for Analysis
  • Public Use Microdata (PUMF)
  • Licensed product agree to certain terms of use
  • No linkage to multiple units of analysis, except
    for a few exceptions (e.g., GSS Time Use and
    Family)

36
Implications for Analysis
  • Synthetic Files
  • Looks like a duck and quacks like a duck, but
    it isnt a duck or any other type of fowl.

37
Implications for Analysis
  • Synthetic Files
  • Looks like master files
  • Lots of observations (maybe)
  • Lots of variables
  • Little grouping or capping of categories
  • Lots of geographic detail

38
Synthetic Files
  • Precautions
  • Results not authentic but may be close in the
    aggregate for some synthetic files
  • Use for testing analysis setups only
  • Still need the master files for publishable
    results.

39
Where do we get Access?
  • Master File
  • Restricted access governed under the Statistics
    Act
  • Remote Job Submission (a.k.a, RDA)
  • Research Data Centres
  • Apply to SSHRC to obtain a peer-reviewed proposal
    and STC for security clearance.

40
Where do we get Access?
  • Public Use Microdata Files (PUMF)
  • Get from DLI
  • Analyze where it is convenient
  • Can use a variety of analysis software, including
    SAS, SPSS, Stata, HLM, LISREL, etc.

41
Where do we get Access?
  • Synthetic Files
  • Author Divisions may create it
  • Most relevant when dealing with new Panel Data,
    but not necessarily, e.g., the Census has
    potential
  • NLSCY, NPHS CCHS synthetic files on DLI FTP site

42
Where do we get Access?
  • Synthetic files
  • Work locally with the file
  • Build SAS and SPSS setups

43
Which File is Appropriate?
  • 1st stop is still the PUMF
  • This file has the easiest access for us
  • Probably meets the needs of most patrons
  • Not as administratively burdensome as synthetic
    or master file
  • Perfect for clients just looking for data
    courses in quantitative analysis

44
Which File is Appropriate?
  • If more detail is needed, refer to the Master
    File Documentation
  • Inform patrons that the cost of use is higher,
    both in terms of accessibility and analytical
    requirements
  • Interest most likely to come from grad students
    and experienced researchers

45
Which File is Appropriate?
  • Download the Synthetic files from DLI
  • Make them aware of problems with synthetic files
    RESULTS ARE NOT PUBLISHABLE
  • Encourage them to submit an application for RDC
    access there is a time lag

46
Which File is Appropriate?
  • RDC

47
Which File is Appropriate?
  • Some of you may work with patron using synthetic
    files before passing her/him off to RDC.

48
Services for Synthetic Files
  • DLI Contacts can provide four basic services with
    synthetic files.
  • Build SPSS and SAS system files from the raw
    synthetic data files that are distributed through
    DLI
  • Provide information about the use of Remote Job
    Submission and RDCs

49
Services for Synthetic Files
  • Assist with finding variables in the synthetic
    files
  • Provide instruction about ways of capturing SPSS
    or SAS code from dummy analysis runs with the
    synthetic files. It is this code that is
    submitted to STC through remote job submission.

50
Services for Synthetic Files
  • 1. Building SPSS and SAS system files for
    synthetic data
  • The NLSCY synthetic data are distributed as a raw
    ASCII file with accompanying command files for
    SPSS and SAS
  • Separate synthetic data files exist for each
    component of the NLSCY not all components have
    PUMFs

51
Services for Synthetic Files
  • 1. Building SPSS and SAS system files for
    synthetic data
  • The synthetic data for the NLSCY cycle 3
    primary file, has 948 variables and 6,393
    fabricated cases. Creating the SPSS and SAS
    system files from this file is not difficult, but
    it does take time. DLI Contacts may wish to
    create these products for their patrons.

52
Services for Synthetic Files
  • 2. Information about Remote Job Submission (RJS)
  • The author divisions supporting RJS have
    established their own guidelines and have
    different operating procedures. Not all
    divisions supporting longitudinal surveys
    currently support RJS (e.g., SLID).
  • Therefore, there is a need to track down this
    information for our patrons.

53
Services for Synthetic Files
  • 2. Information about Remote Job Submission (RJS)
  • For example, the sources for information about
    RJS include the Centre for Education Statistics
  • http//www.statcan.ca/english/edu/rda/index.htm

54
(No Transcript)
55
Services for Synthetic Files
  • 2. Information about Remote Job Submission (RJS)
  • Where do you find this information?
  • Ask the DLI Team via the DLI List
  • The EAC has asked for a description of RJS on the
    DLI website, which should be on the DLI Teams
    to-do list
  • mailtonlscy_at_statcan.ca

56
Services for Synthetic Files
  • 2. Information about Research Data Centres
  • The collection of master files available through
    RDCs is listed on the STC website for RDCs
  • Each RDC has its own website describing its
    services
  • http//www.statcan.ca/english/rdc/index.htm

57
(No Transcript)
58
Services for Synthetic Files
  • 3. Data Reference for the content of the
    synthetic files
  • Helping researchers identify variables over
    longitudinal files is an important service
  • Need to keep the unit of analysis straight
  • Need to understand the mnemonic naming convention
    for variables over cycles
  • Develop indexing aids for you and your patrons

59
Services for Synthetic Files
  • 4. Provide helpful tips for preserving the code
    from dummy analysis runs in SPSS and SAS
  • Researchers will run analyses on the synthetic
    file to generate the code that they will
    subsequently email for Remote Job Submission
  • Providing information about how to do this easily
    will be helpful to your patrons

60
Exercises
Write a Comment
User Comments (0)
About PowerShow.com