Title: Quality Measures for Disclosure Controlled Statistical Data
1Quality Measures for Disclosure Controlled
Statistical Data
- Natalie Shlomo and Caroline Young
- ONS and University of Southampton
2Topics of Discussion
- Introduction and Motivation
- Quality Measures for Assessing the Impact of SDC
methods - Demonstration of Software Application
- Example
- Conclusions and Future Research
3Introduction
- Data suppliers assess disclosure risk before
releasing statistical data - Attribute disclosure - small counts are used to
identify a statistical unit and confidential
information revealed - Data suppliers need to make informed decisions on
an appropriate SDC method that manages disclosure
risk - SDC methods reduce disclosure risk by perturbing,
modifying, or summarizing the data depending on
the format of statistical outputs
4Introduction
- The most common forms of statistical outputs are
tables (containing counts or aggregates) and
microdata. Data can be collected from surveys or
censuses and registers - Choosing an appropriate SDC method is an
iterative process -
- Assess trade-off between managing disclosure
risk and obtaining high quality outputs
5Introduction
- Examples of common SDC methods
- pre-tabular methods (implemented on microdata)
recoding, coarsening and eliminating variables,
sub-sampling, record swapping or a probabilistic
perturbation process, - post-tabular methods (implemented on tables)
table redesign (coarsening and recoding),
suppression and rounding (controlled, full random
rounding, small cell rounding) - For cell suppression, users may want to impute
suppressed cells - zeros, average of the total suppressed cells,
weighted average
6Quality Measures
- Basic Statistics Number of cells and the total
information in the table number of zeros, ones,
and twos average cell size in the table maximum
and minimum average cell size for each row and
column and the standard error of these averages - For suppressed tables number and percent
suppressed cells and total information lost
choice of imputation method - For random rounded tables Binomial hypothesis
test to check for bias in the rounding scheme,
i.e. were the expected number of cells rounded up
and down - For all other SDC methods paired sign rank test
to check for no change in the location
7Quality Measures
- Distance metrics distortions to distributions
on internal cells according to rows Let
be a table for row k, the number of
cells in the row, the number of rows, and
the cell frequency for cell c -
- Hellingers Distance
- (HD)
- Relative Absolute Distance
- (RAD)
- Average Absolute Distance per Cell
- (AAD)
8Quality Measures
- Distance metrics distortions to distributions
on marginal sub-totals and totals - Let be a
sub-total or total of cells
and the number of totals on a row k
9Quality Measures
- Impact on Tests for Independence Cramers V
measure of associationwhere is the
Pearson chi-square statistic - Also, the same measure for entropy and the
Pearson Statistic - Variance of Cell Counts
- For each row
-
10Quality Measures Between variance of target
variables for proportions Let the proportion
in a row k and
the overall proportion Between
variance and For continuous
variables, impact on correlations and
11Quality Measures
Impact on Rank Correlations Sort original
cell counts and define deciles
Repeat on perturbed cell counts where I is
the indicator function and the number of
rows Log Linear Analysis Ratio of the
deviance (likelihood ratio test statistic)
between perturbed table and original table
for a given model
12Risk Measures
- For Census Data
- Proportion of small cells (ones and twos) that
were changed. - For Sample Data
- Probability that a one in the table/microdata is
a population unique
13- Part II - Software Application
14Software Application in SASCompares original
outputs to disclosure controlled outputs
15Windows of the program
16Software Application
17Software Application
18Software Application
19Software Application
20Example
- Census table at ward level
- Sex (2)
- Long-term illness (2)
- Economic status (9)
- Wards (70)
21Example Output in html format
22Example Output in html format
Number of small cells 226 (8.97)
23Example Output in html format
Number of suppressed cells 254
24Basic Measures of Distortion
25Basic Measures of Distortion
Column Cells Moved Percent Moved a4 25 35.71
26Basic Measures of Distortion
Absolute Average Distance 0.1358
27More Complex Measures of Distortion
28More Complex Measures of Distortion
29More Complex Measures of Distortion
30Disclosure Risk Assessment
31Disclosure Risk Assessment
Percent 1s and 2s changed - 100.00
32Additional Features
- Error messages (specifying cause of error)
- Easy to use (click on icon and it runs)
- Handout explaining measures in simple terms
33- Part III - Conclusions and Future Work
34Risk-Utility Confidentiality Map
35Conclusions and Future Work
- Emergence of some guidelines
- - skewed tables (one or two large columns
and the rest small columns) - prefer
rounding to cell suppression - - uniform tables - less information loss
due to SDC methods so choose method with
least changes to the table - - sparse tables need to have benchmarked
totals so control round (if possible) or
semi- control random round - Quality measures for users and guidance on how
to allow for statistical analysis with disclosure
controlled statistical data
36Contact Details
- Natalie Shlomo
- n.shlomo_at_soton.ac.uk
- Caroline Young
- cjy_at_soton.ac.uk