Return from Anarchy - PowerPoint PPT Presentation

About This Presentation
Title:

Return from Anarchy

Description:

Multi-disciplinary study of the life-course of ... grabs variable name, value labels, data values etc ... do some analysis of the data to grab range of values, ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 10
Provided by: Jess286
Category:
Tags: anarchy | grabs | return

less

Transcript and Presenter's Notes

Title: Return from Anarchy


1
Migrating from SPSS to SIR
  • Return from Anarchy

Jon Johnson 11 May 2005
2
Introduction
  • CLS runs 3 / 4 British Birth Cohort Studies
  • Multi-disciplinary study of the life-course of
    three generations born in 1958,1970 and 2000
  • Data collected in various ways, paper, CAPI,
    administrative data
  • Complex data, 100,000 variables, 18,000
    participants per study

3
History
  • Punch cards, different data centres, SIR, SPSS
  • The data has been through the range of data
    storage fashions
  • Social science versus Medical data access models
  • Goal of increased accessibility and understanding
    of relationships within data
  • Development of social science meta-data standards

4
Current Data Collection
  • Data collection methods such as CAPI has a
    negative and positive side
  • Data is pre-punched
  • Data is pre-checked
  • Data is less understandable
  • Data is more complicated
  • Recent data supplied for one sweep was gt 100,000
    variables

5
Taming data
  • Datasets are routinely supplied in SPSS format
  • SPSS is not an ideal environment to manage such
    data
  • SIR is an ideal environment to manage this data

6
Data Migration with minimum information loss
  • SPSS Data List
  • Rarely used, high level of manual intervention
  • Visual Basic (a.k.a. SaxBasic)
  • Platform dependent
  • Limited functionality, multi-step process
  • ODBC
  • Flaky at best
  • Reverse engineer SPSS file
  • SPSS Portable format - stable if poorly
    documented format

7
Implementation
  • PQL, Perl, Python ?
  • Stable across OSs
  • Good text manipulation
  • Good XML support
  • Case based databases

8
How it works
  • parse spss file
  • grabs variable name, value labels, data values
    etc
  • looks up a configuration file for BDI settings
  • check if also setting up database or just adding
    a new record
  • do some conversions time, date, scaled vars
  • do some analysis of the data to grab range of
    values,
  • write out warning if gt 3 missing values or a
    range of missing values
  • write out schema
  • python spss_parser.py -f ltinput filenamegt -s ltsir
    config filegt -d ltddi config filegt

9
Use
  • Once into SIR the data can be restructured
  • Extend to other datasets held in other
    statistical packages such as Stata or SAS going
    via StatTransfer -gt SPSS portable format and go
    from there
  • Also creates XML to add to a data store -
    superseded !!!
Write a Comment
User Comments (0)
About PowerShow.com