Title: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata
1Protecting the Confidentiality of Tables by
Adding Noise to the Underlying Microdata
- Paul Massell and Jeremy Funk
- Statistical Research Division
- U.S. Census Bureau
- Washington, DC 20233
- Paul.B.Massell_at_census.gov
2Talk Outline
- Overview of EZS Noise
- Measuring Effectiveness of Perturbative
Protection - Noise Applied to Weighted Data
- Noise Applied to Unweighted Data Random vs.
Balanced Noise - Conclusions and Future Research
3The EZS Noise Method (Evans, Zayatz, Slanta)
- Developed by Tim Evans, Laura Zayatz, and John
Slanta in the 1990s - Multiplicative noise is added to the underlying
microdata, before table creation - A noise factor or multiplier is randomly
generated for each record
4The EZS Noise Method (Evans, Zayatz, Slanta)
- The distribution of the multipliers should
produce unbiased estimates, and ensure that no
multipliers are too close to 1 - Weights both known and unknown to users are
combined with the noise factors to obtain noisy
values for all records - When tabulated, in general, sensitive cells are
changed quite a bit and non-sensitive cells are
changed only by a small amount
5Attractive Features of EZS
- Tables with noisy data are created in
- the same way as the original tables
- simply replace var X with var X-noisy
- Tables are automatically additive
- An approximate value could be released for
every cell - (depends on agency policy)
- No Complementary Suppressions
6Attractive Features of EZS
- Linked tables and special tabs are automatically
protected consistently - EZS allows for protection at the company level
(Census requirement) - Ease of implementation compared to methods such
as cell suppression
7Measuring Effectiveness of the EZS Method
- Step 1 Determine which cells in a table are
sensitive e.g., using p Sensitivity Rule - Step 2 Measure level of protection to sensitive
cells (using protection multipliers) - Step 3 Measure amount of perturbation to
non-sensitive cells (via change graph)
8The p Sensitivity Rule
- Unweighted Data
- Let T cell total x1, x2 top 2
contributions - Let rem denote remainder
- Set rem T (x1 x2)
- Let prot denote suggested protection
- Set prot (p/100) x1 rem
- if prot gt 0, when Contributor 2 tries to
- estimate x1, rem does NOT provide enough
uncertainty additional protection is needed
noise may provide this uncertainty
9p Sensitivity Rule
- Weighted Data
- TA Fully Weighted Cell Estimate
- X1 Largest Cell Respondent Contribution
- X2 2nd Largest Cell Contribution
- wkn Known Weights
- wun Unknown Weights
10Extended p rule w. weights rounding
- rem TA (X1 wkn1 X2 wkn2 )
- prot ( (p/100) X1 wkn1 ) rem
11Measuring the Effectiveness of a Perturbative
Protection Method
- Protection of Sensitive Cells
- Define Protection Multiplier (PM)
- PM abs (perturbation) / prot
- Find how many (or ) have PM lt 1
- Data Quality
- Important change for non-sensitive cells
- Less important over-pertubation for
- sensitive cells
12EZS Noise Factors for Unweighted Data
- Let X original microdata value
- Let Y perturbed value
- Let M noise multiplier i.e. a draw from a
specified noise distribution of EZS type - Y X M
13Noise Distribution used for all
examples (a1.05, b1.15) 5 to 15
noise
14Noise Applied to Weighted Data
- Key idea weights (e.g., sample weights)
- provide protection to microdata since users
typically know weights only roughly (except
when close to 1) -
- Not necessary to apply full M factor to X unless
w 1
15 EZS Noise Factor for Weighted Data
- Weighted Data
- For a simple weight w with associated
uncertainty interval at least as wide as 2bw - the noise factor S can be combined with w to
- form the Joint Noise-Weight Factor
16Noise Formula for Known and Unknown Weights
- Calculation of Perturbed Values
- wkn is the known weight
- wun is the unknown weight.
17Noise for Weighted DataCommodity Flow Survey
(CFS)
- Measures flow of goods via transport system in
U.S. - Estimates volume and value of each commodity
shipped by origin, destination, modes of
transport - Used for transport modeling, planning, ... Some
users have objected to disclosure suppressions
18Effect of Noise on High Level Aggregate Cells
- CFS Table National 2-DigitCommodityData
Quality Measure 43 cells 0 are sensitive - 41 cells change by 0 - 1
- 2 cells change by 1 - 2
19 CFS Test Table
- (Origin State by Destination State by 2 digit
Commodity) - 61,174 cells of which 230 are sensitive
- Data Quality and Protection Assessments
- (following slides)
20CFS Noise ResultsData Quality Assessment
- While some cells may receive large doses of
noise, vast majority get less than 1 or 2
21CFS Random NoiseProtection Assessment
- Most sensitive cells receive significant noise,
i.e. 5 to 11 - Only 2 out of 230 sensitive cells do not receive
full protection from noise, as measured by
Protection Multipliers (PM)
22Noise for Unweighted DataNon-Employers
Statistics
- Special Features of Microdata
- Unweighted adminstrative data
- Only 1 variable to protect receipts
- Many small integers (after rounding to
1000) - Special Features of Key Table
- Many cells have a small number of
contributors these include many safe cells - Many sensitive cells with only 1 or 2
contributors
23NE Noise ResultsData Quality Assessment
- Lack of weights results in much more distortion
to non-sensitive cells than occurs for CFS
24NE Noise ResultsProtection Assessment
- Resembles noise factor distribution, due to
prevalence of 1 respondent cells in NE test table
and no weights
25Noise Balancing
- Is there a way to improve data quality in this
situation? - Yes, if one can focus on one key table T
- Idea balance noise at each cell in balancing
sub-table B of T (defn every micro value is in
at most one cell of B) - Choose noise directions to maximize noise
cancellation for each cell of B
26Noise BalancingSupportive NE Characteristics
- Balancing works especially well for NE because a
high of microdata is single unit - After balancing interior cells, need to check
noise effect on aggregate cells in same table - Also need to check noise effect in higher and
lower tables these we call trickle up and
trickle down effects - For NE, there are few of these other tables
- this makes balancing decision easier
-
27NE Balanced NoiseData Quality Assessment
- Vast improvement in data quality
- Resembles that of weighted data in CFS
28NE Balanced NoiseProtection Assessment
- Very similar to Random Noise application
- 91.7 of sensitive cells fully protected
29Random Noise vs. Balanced NoiseNon Employer Test
Data
Percent Fully Protected ( PM gt 1 ) Percent Fully Protected ( PM gt 1 )
Random 92.14
Balanced 91.70
- Data Quality is greatly improved
- Protection Level is not significantly reduced
- Thus Balanced Noise is a Good Choice Here
PM density curves on 0,1 are nearly identical
for 2 methods
30Conclusions
- Conclusions
- EZS Noise is a useful method for protecting
tables from a variety of economic programs - There are now several variations of the basic EZS
method which is best for a survey depends on
both microdata and table characteristics
31 Future Research
- 1. Should some sensitive cells be suppressed
high noise cells flagged ? - 2. How to handle multiple variables ?
- 3. What is the most that users can be told about
noise process without compromising data
protection ? - 4. How to handle company dynamics (births,
deaths, mergers, .) ? - 5. How to coordinate survey protection ?
32(No Transcript)