Looking for a way - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Looking for a way

Description:

Poisson or Bernoulli sampling: Therefore, ... The Poisson Model (Skinner, Holmes (1998), Skinner, ... For we obtain the Poisson Model. Probabilistic Models ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 26

Provided by: kittyma

Category:

more less

Transcript and Presenter's Notes

Title: Looking for a way

1
Estimation of Disclosure Riskfor Sample
Microdata Using Probabilistic Modelling Natalie
Shlomo University of Southampton and the Office
of National Statistics

2
Topics for Discussion

Introduction
Disclosure risk assessment for microdata
Probabilistic models for estimating quantitative
disclosure risk measures
Log-linear models and their implementation
Results of simulations
Discussion and future research

3
Introduction
Protect Confidentiality
High Quality Data
Objective and Quantitative Measures
for Disclosure Risk and Data Utility

Microdata containing samples where sample design
is known and each unit assigned a sampling weight
Population is an unknown parameter, though
marginal distributions may be known and
incorporated in the sampling weights

Introduction
Disclosure risk depends on records that are
unique in the sample and in the population
Several methods for estimating disclosure risk in
sample microdata
Need for a unified framework which is robust,
accurate, easily implemented and explained to
users

Individual Risk Measures Per Record Probability
of Re-identification
Global Risk Measures File-level Aggregation of
Individual Risk Measures
5

Disclosure Risk Assessment
Disclosure Risk Scenario
Assumptions about prior knowledge, public
use files, software tools available for potential
attacks by an intruder
Key variables
Indirectly identifying variables (assume no
measurement errors and discreet)
Use most essential and visible key variables
Sensitive variables
Confidential and not to be disclosed

Assessing Disclosure Risk.
Key - compounded key variables
Cells denoted by
but defined as a contingency table
spanned by key variables.
Sample unique - a combination (cell) of the
key that is unique in the sample.

Locality
Key
Sex
Age
Marital Status
Ethnicity
Number of Children
7
Disclosure Risk Measures

Population size in cell k of the key
Sample size in cell k of the key
Global Risk Measures
Number of sample uniques that are population
uniques
Expected number of correct matches for sample
uniques to the population

Probabilistic Models
Natural Assumptions
Assume for each cell k
for and
Poisson or Bernoulli sampling
Therefore,
Probabilistic disclosure risk measures estimated
using conditional distribution of
and calculating
and
Assume srs for
all k and

Probabilistic Models
The Poisson Model (Skinner, Holmes (1998),
Skinner, Elamir (2004))
Let then
and
Parameters (expected cell means) estimated by
log linear modelling on sample counts in table
spanned by key variables
Model selection techniques forward, backward or
stepwise selection, TABU, Bayesian or other forms
of averaging

Probabilistic Models
Since , fit log-linear
model on
where is a vector containing main
effects and interactions for
key variables.
Estimate vector using IPF.
Cell mean parameter is
and replace
Two models independent and all 2 way
interaction

11
Probabilistic Models

The Argus Model (Benedetti, Capobianchi,
Franconi (1998))
Conditional Distribution
Estimate by sampling weights
for unit
Sampling weights are calibrated to known
population marginal distributions, such as age
by sex by geography

12
Probabilistic Models

Severe underestimation in Argus
Sampling weight is
Population estimate in cell 1 is 1,300
The probability estimate is
is now too small and the individual
risk measure
is under estimated

13
Probabilistic Models 3. The Generalized Negative
Binomial Model (Rinott (2003), Skinner, Elamir
(2004)) Prior on Poisson cell mean
and
Note For and equal weights we
obtain the Argus Model For we obtain
the Poisson Model
14
Probabilistic Models Assume log-linear model fit
with vector of parameters Conditional mean
Conditional variance Poisson variance GLM
variance Negative Binomial variance
Assume equal for all k (each cell has
constant CV) Estimate where
average of expected cell means on sample
uniques and standard deviation.
15

Implementation
Goodness of Fit Criteria
Standard goodness of fit criteria for log-linear
models (Pearson and Likelihood ratio tests) fail
asymptotic assumptions for large and sparse
tables.
Average cell size for LFS 0.012
Use more robust conditional moment tests null
hypothesis equality of conditional mean and
conditional variance, i.e. equal dispersion in
the Poisson model
Testing for the correct management of structural
and random zeros in the model.

16
Implementation

Estimation of parameters in log-linear models
depend on whether there are zeros in the margins
of key variables and their interactions. The
most accurate model has true structural zeros on
the margins and not random zeros

Higher interaction models
Many zeros on margins
Expected cell means too high on non-zero cells
Models over fitted
Disclosure risk measures under estimated

Lower interaction models
Few zeros on margins
Expected cell means too lowon non-zero cells
Models under fitted
Disclosure risk measures over estimated

17
Implementation Test 1 Since
( p - number of parameters),
model under dispersed (over
fitted) and risk measures under estimated,
model over dispersed
(under fitted) and risk measures over
estimated. Problems large also means
lack of fit, degrees of freedom need to be
adjusted for structural zeros Test 2
Define GLM framework
OLS regression NB
framework
OLS regression T-statistic for under null
hypothesis normal distributed

under dispersed, over dispersed
18
Examples
Samples drawn from Census N52,557,060
K3,015,936 Key sex (2), age (96), marital
status (6), ethnicity (17), region (11), economic
status (14) LFS Individual Sample n127,200
K10,540,000 K(Non-zero)35,627
SU21,408 Key sex (2), age (100), marital status
(5), ethnicity (17), region (20), economic status
(31)
19
Examples Sample A n52,558
K(Non-zero)20,136 SU12,908
20
Sample A n52,558 K(Non-zero)20,136
SU12,908
21
Examples Sample C n157,672
K(Non-zero)39,308 SU22,960
22
Sample C n157,672 K(Non-zero)39,308
SU22,960
23
Examples - LFS Poisson 2.0
correct matches NB 1.1 correct matches
Argus 0.2 correct matches
24
Future Research

Model selection techniques for log linear models
especially focused on large contingency tables
Develop new and more robust goodness of fit
criteria for log linear models keeping in mind
the end result of estimating best disclosure
risk measures
Practical implementation of the models such as
partitioning the large tables and model selection
in sub-tables
More complex survey designs, in particular
hierarchical data
Confidence intervals for global risk measures

25
(No Transcript)

Write a Comment

User Comments (0)