Title: Looking for a way
1Estimation of Disclosure Riskfor Sample
Microdata Using Probabilistic Modelling Natalie
Shlomo University of Southampton and the Office
of National Statistics
2Topics for Discussion
- Introduction
- Disclosure risk assessment for microdata
- Probabilistic models for estimating quantitative
disclosure risk measures - Log-linear models and their implementation
- Results of simulations
- Discussion and future research
3Introduction
Protect Confidentiality
High Quality Data
Objective and Quantitative Measures
for Disclosure Risk and Data Utility
- Microdata containing samples where sample design
is known and each unit assigned a sampling weight
- Population is an unknown parameter, though
marginal distributions may be known and
incorporated in the sampling weights
4- Introduction
- Disclosure risk depends on records that are
unique in the sample and in the population - Several methods for estimating disclosure risk in
sample microdata - Need for a unified framework which is robust,
accurate, easily implemented and explained to
users
Individual Risk Measures Per Record Probability
of Re-identification
Global Risk Measures File-level Aggregation of
Individual Risk Measures
5- Disclosure Risk Assessment
- Disclosure Risk Scenario
- Assumptions about prior knowledge, public
use files, software tools available for potential
attacks by an intruder - Key variables
- Indirectly identifying variables (assume no
measurement errors and discreet) - Use most essential and visible key variables
- Sensitive variables
- Confidential and not to be disclosed
6- Assessing Disclosure Risk.
- Key - compounded key variables
-
-
- Cells denoted by
but defined as a contingency table
spanned by key variables. - Sample unique - a combination (cell) of the
key that is unique in the sample. -
Locality
Key
Sex
Age
Marital Status
Ethnicity
Number of Children
7Disclosure Risk Measures
- Population size in cell k of the key
- Sample size in cell k of the key
- Global Risk Measures
- Number of sample uniques that are population
uniques - Expected number of correct matches for sample
uniques to the population
8- Probabilistic Models
- Natural Assumptions
- Assume for each cell k
for and - Poisson or Bernoulli sampling
- Therefore,
- Probabilistic disclosure risk measures estimated
using conditional distribution of
and calculating - and
- Assume srs for
all k and
9- Probabilistic Models
- The Poisson Model (Skinner, Holmes (1998),
Skinner, Elamir (2004)) - Let then
and -
- Parameters (expected cell means) estimated by
log linear modelling on sample counts in table
spanned by key variables - Model selection techniques forward, backward or
stepwise selection, TABU, Bayesian or other forms
of averaging
10- Probabilistic Models
- Since , fit log-linear
model on -
- where is a vector containing main
effects and interactions for
key variables. -
- Estimate vector using IPF.
- Cell mean parameter is
- and replace
-
- Two models independent and all 2 way
interaction
11Probabilistic Models
- The Argus Model (Benedetti, Capobianchi,
Franconi (1998)) - Conditional Distribution
-
- Estimate by sampling weights
for unit - Sampling weights are calibrated to known
population marginal distributions, such as age
by sex by geography
12Probabilistic Models
- Severe underestimation in Argus
- Sampling weight is
Population estimate in cell 1 is 1,300 - The probability estimate is
- is now too small and the individual
risk measure
is under estimated
13Probabilistic Models 3. The Generalized Negative
Binomial Model (Rinott (2003), Skinner, Elamir
(2004)) Prior on Poisson cell mean
and
Note For and equal weights we
obtain the Argus Model For we obtain
the Poisson Model
14Probabilistic Models Assume log-linear model fit
with vector of parameters Conditional mean
Conditional variance Poisson variance GLM
variance Negative Binomial variance
Assume equal for all k (each cell has
constant CV) Estimate where
average of expected cell means on sample
uniques and standard deviation.
15- Implementation
- Goodness of Fit Criteria
- Standard goodness of fit criteria for log-linear
models (Pearson and Likelihood ratio tests) fail
asymptotic assumptions for large and sparse
tables. - Average cell size for LFS 0.012
- Use more robust conditional moment tests null
hypothesis equality of conditional mean and
conditional variance, i.e. equal dispersion in
the Poisson model - Testing for the correct management of structural
and random zeros in the model.
16Implementation
- Estimation of parameters in log-linear models
depend on whether there are zeros in the margins
of key variables and their interactions. The
most accurate model has true structural zeros on
the margins and not random zeros
- Higher interaction models
- Many zeros on margins
- Expected cell means too high on non-zero cells
- Models over fitted
- Disclosure risk measures under estimated
- Lower interaction models
- Few zeros on margins
- Expected cell means too lowon non-zero cells
- Models under fitted
- Disclosure risk measures over estimated
17Implementation Test 1 Since
( p - number of parameters),
model under dispersed (over
fitted) and risk measures under estimated,
model over dispersed
(under fitted) and risk measures over
estimated. Problems large also means
lack of fit, degrees of freedom need to be
adjusted for structural zeros Test 2
Define GLM framework
OLS regression NB
framework
OLS regression T-statistic for under null
hypothesis normal distributed
under dispersed, over dispersed
18Examples
Samples drawn from Census N52,557,060
K3,015,936 Key sex (2), age (96), marital
status (6), ethnicity (17), region (11), economic
status (14) LFS Individual Sample n127,200
K10,540,000 K(Non-zero)35,627
SU21,408 Key sex (2), age (100), marital status
(5), ethnicity (17), region (20), economic status
(31)
19Examples Sample A n52,558
K(Non-zero)20,136 SU12,908
20Sample A n52,558 K(Non-zero)20,136
SU12,908
21Examples Sample C n157,672
K(Non-zero)39,308 SU22,960
22Sample C n157,672 K(Non-zero)39,308
SU22,960
23Examples - LFS Poisson 2.0
correct matches NB 1.1 correct matches
Argus 0.2 correct matches
24Future Research
- Model selection techniques for log linear models
especially focused on large contingency tables - Develop new and more robust goodness of fit
criteria for log linear models keeping in mind
the end result of estimating best disclosure
risk measures - Practical implementation of the models such as
partitioning the large tables and model selection
in sub-tables - More complex survey designs, in particular
hierarchical data - Confidence intervals for global risk measures
25(No Transcript)