Title: Disclosure control of analytical outputs
1Disclosure control of analytical outputs
- Felix Ritchie, Office for National Statistics
2SDC in a research environment
- Almost all SDC research concerned with
- Preparing tables for publication
- Anonymising datasets for release
- Little work on characteristics of the research
environment - Practically no work on disclosiveness of
analytical results - Does this matter?
- Data custodians may want to apply inappropriate
rules - Analysts assume analytical results are too
complex to be disclosive - No clear agreement or strategy for dealing with
analytical results
3Paper aims to show that, for regressions,
- The analysts are fundamentally correct
- There are a small number of identified problems
- A simple rule is available to assess/quantify
residual risk - Concern over the nature of variables and validity
of analysis is misplaced
4Exact identification in a linear regression1.
Direct disclosure
- Consider estimated result
- In theory, can observe K unknowns by knowing K
values of coeffient vector and all other values
in the data set (eg all Y, all X bar one
observation) - Is this a practical concern?
- Consider more realistic cases where you dont
have every other variable.
5Exact identification, cont.2. Disclosure by
differencing 2 variables
- Two estimates, one with an additional observation
- Solving normal equations and differencing
- 2 equations, 2 unknowns, but in general insoluble
without full knowledge of variables because of
inverse term - so are the analysts correct?
6Exact identification, cont.2. 2 variable case -
exceptions
- y0 can be identified if means of variables are
known - But this only works for the mean of the
additional observations - Binary explanatory variables
- Can determine both y0 and X0
- Works because this is effectively a table
- But key point is that variables counts are
summary statistics for aggregate values ie can be
dealt with in same framework
7Exact identification, cont.2. 2 variable case -
exceptions
- Binary dependent variable
- Works for linear and non-linear regressions
8Exact identification, cont.2. 2 variable case -
summary
- So analysts correct in general
- Specific cases with known information requirement
can be identified - Even non-linear regressions can be differenced to
identify categorical values - Results extend to K-variable case, except that
- orthogonality of regressors is not a sufficient
condition for identification - an incomplete knowledge of the matrix of
explanatory variables is a sufficient condition
for non-disclosiveness, unless - a sufficient statistic for ?xik exists, in which
case an intruder can at best only determine ?y0i
9Exact identification, cont.3. Prevention
- In general, the exact values of variables
underlying a regression cannot plausibly be
determined unless - the regression consists entirely of categorical
variables, or - has a dependent binary variable
- and disclosure by differencing is only possible
route for identification. - A linear regression is completely non-disclosive
if - one or more coefficients is effectively
suppressed (that is, the coefficient could not
reasonably be determined from published
information), and - the relevant variable is not orthogonal to all
other variables
10Approximate disclosure4. Calculating the
prediction error
- Calculate the variance for an individual data
point - Cant be determined without exact value of X? Can
if you have x1 because - So all you need is R2
11Approximate disclosure4. Prediction errors,
continued
- Can calculate the minimum predictive error for
any data point included in regression by
substituting largest value - Can calculate the minimum predictive error for
new data points by using amended formula - But to do this without full knowledge of
explanatory variables requires full set of
coefficients - Suppressing coefficients prevents prediction
error being assessed - Even if coefficients are insignificant
12Non-linear regressions
- Above analysis requires
- Not the case for non-linear models
- Differencing not an issue
- May be other issues not identified yet
13Does statistical validity matter?
- Above analysis carried out without reference to
types of variables - Multicollinearity, measurement error, influential
points, outliers, public variables etc not
necessary to prove regressions disclosive - Bad regressions dont make for safe regressions
- Good regressions dont make safe regressions
14- Felix Ritchie
- Microeconomic Analysis
- Office for National Statistics
- felix.ritchie_at_ons.gov.uk