New Investigators Network - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

New Investigators Network

Description:

Detailed information on every person living with study child at time of interview ... Using the probability would ... Identifies a minimum set of predictors ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 68
Provided by: arac4
Category:

less

Transcript and Presenter's Notes

Title: New Investigators Network


1
New Investigators Network
  • The Longitudinal Study of Australian Children

2
Overview
  • Introduction to the LSAC Datasets
  • Data analysis issues
  • Variable naming
  • Data user resources

3
Introduction to the LSAC Datasets
4
Datasets
  • At last count 23 datasets
  • Dont panic!
  • 4 main datasets
  • 1 for each age at which weve interviewed
    children
  • lsacconf0, lsacconf2, lsacconf4, lsacconf6

5
Main LSAC datasets
  • Contain everything not excluded for good reason
  • Identifiers
  • Question variables
  • Derived items
  • Linked data
  • Between waves questionnaire data

6
Supplementary Datasets
  • Time Use Diary (TUD)
  • Medicare Australia linked data
  • Household composition data
  • Wave 1 original format datasets

7
TUD datasets
  • Cleaned datasets
  • Imputations to improve data quality
  • Poor quality cases deleted
  • Poor quality cases
  • Imputations to improve quality made
  • Uncleaned datasets
  • No improvements or deletions

8
Household Composition
  • Detailed information on every person living with
    study child at time of interview
  • Less detailed information on those who lived with
    study child between waves
  • Parent 1, Parent 2, Mother, Father variables
  • Derived items
  • Subset of this information also on the main
    dataset (P1, P2, SC, Mother, Father, derived
    items)

9
Medicare Linked Data
  • Medicare Benefits Scheme (MBS)
  • Each record represents a claim
  • Pharmaceutical Benefits Scheme (PBS)
  • Each record represents a claim
  • Australian Childhood Immunisation Register (ACIR)
  • Each record represents an immunisation
  • Note data is just for child, no data for other
    members of the household

10
Wave 1 formatted data
  • Changes made between Wave 1 and Wave 2 data
    releases
  • Renamed variables
  • Merging of centre and home-based carer
    questionnaire data
  • Included to help those with old code
  • Will be phased out by Wave 3

11
General Data Analysis Issues
12
Data analysis general issues
  • Confidentialisation
  • Weighting
  • Clustering

13
Confidentialisation
  • In order to protect the anonymity of respondents
    some changes have been made to the dataset
  • Removal of items
  • Transformation of items
  • Top-coding (ie recoding outliers)
  • Aggregation

14
Removal of items
  • Qualitative data provided by responses
  • Other specify
  • Other open ended text questions
  • Linked census data for childcare/education
    arrangements

15
Transformation of items
  • Postcode -gt postcode ID numbers
  • Date left hospital after birth -gt time in
    hospital after birth

16
Topcoding
  • Income
  • Housing costs
  • Child support
  • Physical measurements
  • Hours in childcare

17
Aggregation
  • Occupation ASCO4 -gt ASCO2
  • SEIFA rounded to the nearest 10
  • Country of birth coded 0 if fewer than 5 cases
  • LOTE coded 0 if fewer than 5 cases
  • Date of birth changed to month of birth
  • (Also religion, but hasnt been necessary so far)

18
Weighting
19
Weighting
  • Particularly acknowledge the contribution of
    David Lawrence of Institute for Child Health
    Research
  • See also
  • LSAC Technical Paper 3 Wave 1 Weighting and
    Non-response
  • LSAC Technical Paper 5 Wave 2 Weighting and
    Non-response

20
Why weight?
  • To cut down on error due to
  • Sampling
  • Not using a simple random sample
  • Non-response
  • Unlikely to be random

21
Steps
  • Design weights
  • Wave 1 Non-response analysis
  • Final wave 1 weights
  • Wave 2 non-response analysis
  • Final Wave 2 weights

22
Design Weights
  • Aim for sample design to introduce as little bias
    as possible
  • Differential probability of selection is a
    known response bias
  • Can be adjusted for perfectly (or close to)

23
Large postcodes - probability
  • For most purposes design weight is calculated as
    the inverse of the probability of selection
  • Pc Psi x Ppi
  • Where
  • Pc is the probability a child will be selected
  • Psi is the probability a childs postcode will be
    selected from all postcodes in the stratum
  • Pci is the probability a child will be selected
    from all the children in their postcode

24
Implications
  • Weights become extreme if
  • Large discrepancy between postcode population at
    time of postcode selection and child selection
  • Large discrepancy between number of B and
    K-cohort children

25
Small postcodes - why not probability?
  • Extra children selected from postcodes with gt20
    kids to represent postcodes with lt20 kids
  • Using the probability would lead to
    underweighting
  • Set number of postcodes selected based on average
    size
  • Probability doesnt account for differences
    between final number of selections and target
    number

26
What did we use instead?
  • D2 Nr
  • ns
  • Where
  • D2 is the design weight for children in size 2
    strata
  • Nr is the total number of children in the stratum
    minus those in remote postcodes
  • ns is the number of children selected in the
    stratum

27
Implications
  • Assumes kids from ineligible postcodes are
    similar to those from eligible ones
  • Weights become extreme when select more of the
    larger or smaller postcodes within the stratum
    (basically random)

28
Wave 1 Non-response
  • What do we know about non-respondents?
  • We know where they live
  • We know how many of their neighbours responded to
    the study
  • We have census data on their neighbourhoods

29
Poisson regression
  • Models the chance that something will happen a
    given number of times
  • For our purposes response rate within a postcode
  • Identifies a minimum set of predictors
  • Variables must be measured by the census and LSAC
    in a consistent way
  • Must have variation by postcode (e.g. gender of
    child inappropriate)

30
B-cohort response rate regression
31
K-cohort response rate regression
32
Population Benchmarks
  • ABS ERP 31 March 2004 - State and gender
  • ABS ERP 30 June 2003 - Part of State
  • Medicare database - Adjusted for exclusion of
    some remote postcodes
  • 2001 Census - Proportion of mothers who had
    completed Year 12 and spoke a LOTE (at state X
    part of state level)

33
Standard weighting
  • Classify cases into cells based on all variables
    weighted for
  • Obtain frequencies for each cell and compare with
    population figures
  • WeightPopulation/Sample for each cell
  • Requires cross classified benchmarks
  • Unreliable when sample sizes get small

34
Calibration of weights
  • Generalised raking procedure (Deville and
    Sårndal, 1992)
  • Adjusts design weights
  • Searches for a solution that
  • Weights correctly to each population benchmark
  • Has smallest distance between original weights
    and final weights

35
Advantages
  • Only requires benchmarks for each weighting
    variable separately
  • Range of weights can be constrained

36
Wave 2 non-response
  • What do we know about Wave 2 non-respondents?
  • Quite a bit
  • 2,000 odd variables in the main LSAC Wave 1
    dataset
  • However, information is 2 years old

37
Wave 2 adjustment
  • Adjust weights based on probability of response
    in Wave 2
  • Probability based on logistic regression
  • Over 50 variables used as predictors selected
    based on
  • Little or no missing data
  • Likelihood of prediction of non-response
  • Coverage of topics included in the survey

38
Weights provided
  • LSAC provides 3 weights per wave per cohort
  • Sample weight
  • Weights add to sample size
  • Useful when using N to calculate test of
    significance
  • Population weight
  • Weights add to population totals
  • Useful when need frequencies to produce
    population estimates
  • Day weight
  • Sample weight adjusted for day-of-the-week of
    diary completion

39
Clustering
40
Clustering
  • Usual tests of significance assume independence
    of observations
  • Not the case for LSAC sample design as children
    selected from postcodes
  • Therefore standard errors need to be adjusted

41
Design effect
  • Is the standard error if design accounted for
    divided by standard error is random sampling
    assumed
  • Will be small if the variable is unrelated to
    postcode (e.g. study child gender)
  • Will be larger if the variable is related to
    postcode (e.g. neighbourhood characteristics,
    housing costs, income)

42
SAS
  • Use survey procedures (eg proc surveymeans,
    proc surveyreg)
  • proc surveyreg dataltfilenamegt totalltlibrarygt.str
    atum
  • stratum stratum
  • cluster pcodes
  • model ltstandard SAS model detailsgt
  • weight lta/b/c/dweightsgt
  • run

43
SPSS
  • Need complex samples add-on
  • Set up an analysis plan with an analysis plan
    wizard
  • Stratastratum
  • Clusterpcodes
  • Weighta/b/c/dweights
  • Use plan file to run analyses

44
Variable Naming
45
Variable Naming
  • Standard input variables
  • Derived items
  • Age invariant indicators
  • Household composition items

46
Standard input variables
  • The variable names follow the standard format in
    most cases.
  • A tt xxxxx
  • A child age indicator
  • tt topic indicator
  • xxxxx specific question identifier

47
Child Age Indicator
  • a indicates the child is aged 0-1 years
  • b indicates the child is aged 2-3 years
  • c indicates the child is aged 4-5 years
  • d indicates the child is aged 6-7 years
  • z indicates that the data item is permanent and
    will not change (e.g. date of birth)

48
Some examples
  • zhs03a is the birthweight of the study child for
    both B and K cohorts
  • apa01aParent 1 rating of parent Self-efficacy at
    Wave 1 for the B-cohort
  • bpa01aThe same thing at Wave 2 for the B-cohort
  • cpa01aThe same thing at Wave 1 for the K-cohort

49
Topic Indicator
  • 2-character abbreviation
  • Eg. fdFamily Demographics
  • Links in with topic field in the Data Dictionary
  • Complete list of topics available in the Data
    User Guide

50
Specific question identifier
  • The last 5 characters of a variable name
  • Contain whatever information is necessary to
    uniquely identify each item.
  • Generally has an arbitrary two-digit question
    number
  • Items of related content have been grouped
    together as much as possible.

51
Subject/informant - general
  • The 6th character in the variable name can also
    indicate the subject/informant
  • aParent 1
  • bParent 2
  • cStudy Child
  • pPLE
  • mMother
  • fFather
  • tTeacher/carer
  • iBetween waves mail-out respondent

52
Some examples
  • bhs13a is Parent 1s rating of their own overall
    health status
  • bhs13b is Parent 2s rating of their own overall
    health status
  • bhs13c is Parent 1s rating of the Study Childs
    overall health status
  • bhs13p is the PLEs rating of their own overall
    health status
  • bhs13m is the Mothers rating of their own
    overall health status
  • bhs13f is Fathers rating of their own overall
    health status

53
Subject/Informant Education/Child Care
54
Example
  • Grouping of like items
  • bhs12a is whether Parent 1 is concerned about the
    childs weight
  • bhs12b is whether Parent 1 considers the child to
    be underweight, normal weight, somewhat
    overweight or very overweight

55
Item grouping
  • Grouping of items (with subject indicator)
  • bhs23c1 is the study childs height
  • bhs23c2 is the study childs weight
  • bhs23c3 is the study childs waist measurement

56
Item grouping
  • Scales and subscales grouped together
  • SDQ Conduct reported by Parent 1 at Wave
    1cse03a4a, cse03a4b, cse03a4c, cse03a4d,
    cse03a4e
  • SDQ Conduct reported by Parent 1 at Wave
    2dse03a4a, dse03a4b, dse03a4c, dse03a4f,
    dse03a4g
  • Note that items d and e at Wave 1 have been
    replaced by items f and g at Wave 2

57
Item grouping
  • Recoded items
  • Changes between cohorts and waves

58
Derived items
  • A(S)m
  • A Age indicator
  • (S) optional subject/informant indicator
  • m up to 6 character mnemonic

59
Examples
  • bvocab MCDI Vocabulary measure (B-cohort Wave
    2)
  • aaemp Parent 1 employment status (B-cohort Wave
    1)
  • abemp Parent 2 employment status (B-cohort Wave
    1)
  • bbemp Parent 2 employment status (B-cohort Wave 2)

60
Age invariant indicators
  • A small number of age invariant indicators also
    on the file
  • hicid unique identifier assigned when child was
    selected by Medicare Australia
  • cohort
  • wave
  • stratum stratum from selection process
  • pcodes postcode at the time of selection

61
Household Composition Items
  • A f xmmm
  • Where
  • A Child age indicator
  • f f (for family)
  • Question number (numeric)
  • x Sub-question indicator (optional)
  • mmm person identifier

62
User Resources
63
User Resources
  • Labelled Questionnaires
  • Data Dictionary
  • Frequencies
  • Are located on the data CD and on the website
  • www.aifs.gov.au/growingup/pubs/instruments.html

64
Labelled Questionnaires
  • The interview for wave 1 was conducted using
    paper and pencil.
  • The interview for wave 2 was conducted using
    Computer Assisted Interview (CAI) and paper and
    pencil.
  • Therefore in wave 2 a CAI guide had to be created
  • This guide does not include the Household
    variables. For these variables refer to the data
    dictionary

65
Labelled Questionnaires
  • Survey question and its associated variable
  • E.g. Question - Was child ever breastfed?
    Variable - zhb05a

66
Data Dictionary
  • Information about every variable on the main
    datasets can be found here.
  • There is 2 versions of the data dictionary one in
    Excel and one online.
  • This allows users to use functions associated
    with Excel when navigating the data dictionary.

67
Frequencies
  • Frequencies have been created for each of the
    datasets
  • These frequencies can be found on the data CD
  • Frequencies on all variables have been prepared
  • If the variable is a continuous variable the mean
    and the standard deviation was obtained in order
    to reduce the number of pages.
Write a Comment
User Comments (0)
About PowerShow.com