Title: Combinations of SDC methods for continuous microdata
1Combinations of SDC methods for continuous
microdata
Anna Oganian National
Institute of Statistical Sciences
2Introduction
- SDC methods have two goals
-
-
Methods for continuous microdata
- Rankswapping
- Additive noise
- Resampling
- Microaggregation
- microaggregation based on one variable
- microaggregation based on several variables
3Why combinations?
Methods have very different properties, so
combining them we can improve the utility.
Example Microaggregation with z-scores
projection
Red points microaggregated data Green points
original data
For data close to normal we can add normal
noise Mic(O) N(0, Cov(O)-Cov(Mic(O)))
4Performance measures
- Propensity score utility measure
- (Mi-Ja work)
- Two kinds of DR
- - identification disclosure
- - attribute disclosure
-
5DR
- Identification disclosure
- It is considered that disclosure occurs when
the intruder can correctly identify a record
in the released data file, that is to relate it
to a particular individual. - Attribute disclosure
- Intruder's target is an original value of a
particular attribute, for example a salary of a
particular individual. - So attribute disclosure measures the gain in
information achieved by the intruder about some
attribute after releasing masked data. More
precisely - how tight can be found the bounds for
the original values given masked data.
6Examples of attribute disclosure for several
methods
- Assumption SDC method and parameters are
released together with the data set
Rankswapping
Upper and lower bound for every value in the
masked data are
If the algorithm of rankswapping is known, so the
distribution of the values in these intervals
could be found by the intruder by the means of
running rankswapping large number of times on
the vector of length N.
7Suppose data set X has 1000 records and variable
j in data set X is lognormal. Rankswapping with
p5 was applied to this data set. The range of
the variable is 0.04,25.57. Choose the value
in the dense area of masked data xj0.50, so
lower and upper bounds for the corresponding
original are 0.42, 0.58. Consider the largest
value in the masked data x25.57, using the
distribution for the highest rank we can find 95
confidence interval 5.20, 6.92. Consider the
smallest value in the masked data x0.04, 95
confidence interval for the corresponding
original data is 0.07, 0.019.
8- Noise addition
- Variance of added noise is
and its mean is 0, so 100(1-a)
confidence regions around masked records xm
could be computed based on multivariate normal
distribution
Example
9Several stages of masking
- Ideally the security of the SDC method should be
guarantied by the masking algorithm and not
depend on keeping in secret the parameter or
details of the algorithm.
In cryptography Data Encryption Standard (DES)
10Combinations of the methods
Original data
M1(Original)
M2(M1(Original))
Masking M2
Masking M1
- Decrease the Risk
- We can even increase utility of the resulted data
if we combine properly the methods! - For example
- Combine microaggregation with noise
11Several stages of masking
Or in general case
where
12Combinations of methods
- Microaggregation using z-scores projection, p3
?Microaggregation using z-scores projection, p3
(Micz03_Micz03) - Microaggregation using z-scores projection, p3 ?
Microaggregation using principal component
projection, p3 (Micz03_Micpcp03) - Microaggregation using z-scores projection, p3 ?
Multivariate microaggregation, p10
(Micz03_Micmul10) - Microaggregation using z-scores projection, p3 ?
Rankswapping, p1 (Micz03_Rank1) - Single Microaggregation using z-scores
projection, p3 (Micz03)
13Propensity score utility
sym sym sym nonsym nonsym nonsym
high cor high cor low cor low cor low cor high cor high cor low cor low cor
pos neg pos pos neg pos neg pos neg
micz03__micz03 23.5 28.3 37.3 37.3 33.9 45.1 180.1 40 94.1
micz03__micpcp03 12.8 9.3 7.9 7.9 9.3 5 3.4 8.6 5.8
micz03__micmul10 18.4 16.5 14.8 14.8 13 5.5 9.2 11.2 8.7
micz03__rank1 28.6 34.8 27.3 27.3 29.4 14.8 42 39.5 26.5
micz03 128.1 281.5 132.1 132.1 233.4 592.1 639.4 463.8 639
14Identification DR
sym sym sym nonsym nonsym nonsym
high cor high cor low cor low cor high cor high cor low cor low cor
pos neg pos neg pos neg pos neg
micz03__micz03 0.0275 0.0015 0.0035 0.0023 0.0033 0.0004 0.0033 0.0009
micz03__micpcp03 0.0198 0.2516 0.0133 0.0029 0.0806 0.2926 0.0477 0.18
micz03__micmul10 0.0077 0.0046 0.0203 0.0025 0.1122 0.0947 0.0071 0.1265
micz03__rank1 0.0092 0.0119 0.0087 0.0067 0.034 0.0079 0.0096 0.0091
micz03 0.0036 0.0025 0.0024 0.0019 0.0043 0.0011 0.0012 0.0011