Looking for a way - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Looking for a way

Description:

Poisson or Bernoulli sampling: Therefore, ... The Poisson Model (Skinner, Holmes (1998), Skinner, ... For we obtain the Poisson Model. Probabilistic Models ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 26
Provided by: kittyma
Category:
Tags: looking | poisson | way

less

Transcript and Presenter's Notes

Title: Looking for a way


1
Estimation of Disclosure Riskfor Sample
Microdata Using Probabilistic Modelling Natalie
Shlomo University of Southampton and the Office
of National Statistics 

2
Topics for Discussion
  • Introduction
  • Disclosure risk assessment for microdata
  • Probabilistic models for estimating quantitative
    disclosure risk measures
  • Log-linear models and their implementation
  • Results of simulations
  • Discussion and future research

3
Introduction
Protect Confidentiality
High Quality Data
Objective and Quantitative Measures
for Disclosure Risk and Data Utility
  • Microdata containing samples where sample design
    is known and each unit assigned a sampling weight
  • Population is an unknown parameter, though
    marginal distributions may be known and
    incorporated in the sampling weights


4
  • Introduction
  • Disclosure risk depends on records that are
    unique in the sample and in the population
  • Several methods for estimating disclosure risk in
    sample microdata
  • Need for a unified framework which is robust,
    accurate, easily implemented and explained to
    users

Individual Risk Measures Per Record Probability
of Re-identification
Global Risk Measures File-level Aggregation of
Individual Risk Measures
5
  • Disclosure Risk Assessment
  • Disclosure Risk Scenario
  • Assumptions about prior knowledge, public
    use files, software tools available for potential
    attacks by an intruder
  • Key variables
  • Indirectly identifying variables (assume no
    measurement errors and discreet)
  • Use most essential and visible key variables
  • Sensitive variables
  • Confidential and not to be disclosed

6
  • Assessing Disclosure Risk.
  • Key - compounded key variables
  • Cells denoted by
    but defined as a contingency table
    spanned by key variables.
  • Sample unique - a combination (cell) of the
    key that is unique in the sample.

Locality
Key
Sex
Age
Marital Status
Ethnicity
Number of Children
7
Disclosure Risk Measures
  • Population size in cell k of the key
  • Sample size in cell k of the key
  • Global Risk Measures
  • Number of sample uniques that are population
    uniques
  • Expected number of correct matches for sample
    uniques to the population

8
  • Probabilistic Models
  • Natural Assumptions
  • Assume for each cell k
    for and
  • Poisson or Bernoulli sampling
  • Therefore,
  • Probabilistic disclosure risk measures estimated
    using conditional distribution of
    and calculating
  • and
  • Assume srs for
    all k and

9
  • Probabilistic Models
  • The Poisson Model (Skinner, Holmes (1998),
    Skinner, Elamir (2004))
  • Let then
    and
  • Parameters (expected cell means) estimated by
    log linear modelling on sample counts in table
    spanned by key variables
  • Model selection techniques forward, backward or
    stepwise selection, TABU, Bayesian or other forms
    of averaging

10
  • Probabilistic Models
  • Since , fit log-linear
    model on
  • where is a vector containing main
    effects and interactions for
    key variables.
  • Estimate vector using IPF.
  • Cell mean parameter is
  • and replace
  • Two models independent and all 2 way
    interaction

11
Probabilistic Models
  • The Argus Model (Benedetti, Capobianchi,
    Franconi (1998))
  • Conditional Distribution
  • Estimate by sampling weights
    for unit
  • Sampling weights are calibrated to known
    population marginal distributions, such as age
    by sex by geography

12
Probabilistic Models
  • Severe underestimation in Argus
  • Sampling weight is
    Population estimate in cell 1 is 1,300
  • The probability estimate is
  • is now too small and the individual
    risk measure
    is under estimated

13
Probabilistic Models 3. The Generalized Negative
Binomial Model (Rinott (2003), Skinner, Elamir
(2004)) Prior on Poisson cell mean
and
Note For and equal weights we
obtain the Argus Model For we obtain
the Poisson Model
14
Probabilistic Models Assume log-linear model fit
with vector of parameters Conditional mean
Conditional variance Poisson variance GLM
variance Negative Binomial variance
Assume equal for all k (each cell has
constant CV) Estimate where
average of expected cell means on sample
uniques and standard deviation.
15
  • Implementation
  • Goodness of Fit Criteria
  • Standard goodness of fit criteria for log-linear
    models (Pearson and Likelihood ratio tests) fail
    asymptotic assumptions for large and sparse
    tables.
  • Average cell size for LFS 0.012
  • Use more robust conditional moment tests null
    hypothesis equality of conditional mean and
    conditional variance, i.e. equal dispersion in
    the Poisson model
  • Testing for the correct management of structural
    and random zeros in the model.

16
Implementation
  • Estimation of parameters in log-linear models
    depend on whether there are zeros in the margins
    of key variables and their interactions. The
    most accurate model has true structural zeros on
    the margins and not random zeros
  • Higher interaction models
  • Many zeros on margins
  • Expected cell means too high on non-zero cells
  • Models over fitted
  • Disclosure risk measures under estimated
  • Lower interaction models
  • Few zeros on margins
  • Expected cell means too lowon non-zero cells
  • Models under fitted
  • Disclosure risk measures over estimated

17
Implementation Test 1 Since
( p - number of parameters),
model under dispersed (over
fitted) and risk measures under estimated,
model over dispersed
(under fitted) and risk measures over
estimated. Problems large also means
lack of fit, degrees of freedom need to be
adjusted for structural zeros Test 2
Define GLM framework
OLS regression NB
framework
OLS regression T-statistic for under null
hypothesis normal distributed

under dispersed, over dispersed
18
Examples
Samples drawn from Census N52,557,060
K3,015,936 Key sex (2), age (96), marital
status (6), ethnicity (17), region (11), economic
status (14) LFS Individual Sample n127,200
K10,540,000 K(Non-zero)35,627
SU21,408 Key sex (2), age (100), marital status
(5), ethnicity (17), region (20), economic status
(31)
19
Examples Sample A n52,558
K(Non-zero)20,136 SU12,908
20
Sample A n52,558 K(Non-zero)20,136
SU12,908
21
Examples Sample C n157,672
K(Non-zero)39,308 SU22,960
22
Sample C n157,672 K(Non-zero)39,308
SU22,960
23
Examples - LFS Poisson 2.0
correct matches NB 1.1 correct matches
Argus 0.2 correct matches
24
Future Research
  • Model selection techniques for log linear models
    especially focused on large contingency tables
  • Develop new and more robust goodness of fit
    criteria for log linear models keeping in mind
    the end result of estimating best disclosure
    risk measures
  • Practical implementation of the models such as
    partitioning the large tables and model selection
    in sub-tables
  • More complex survey designs, in particular
    hierarchical data
  • Confidence intervals for global risk measures

25
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com