Title: Statistical Databases
1Statistical Databases Query Auditing
- Li Xiong
- CS573 Data Privacy and Anonymity
Partial slides credit Vitaly Shmatikov, Univ
Texas at Austin
2Query Audit Problem
- Maintaining privacy of data
Auditor
Database
Does answer to Q combined with answers to Q1,,Qn
reveal something?
3Variations of the Problem
Specifies subset of the variables
Auditor
Database
List of real, integer, or Boolean values
Wants to learn value of some variable
Min, max, median, sum, average, or count of
specified subset
4Offline vs. Online
- Offline auditing
- Given a collection of queries and answers to
them, check whether anything forbidden was
revealed - Detects privacy breaches after the fact
- Online auditing
- Queries are presented to auditor one at a time
auditor checks if answering the current query (in
combination with past answers) reveals
forbidden information - Prevents privacy breaches on-the-fly
5Disclosure measures
- Full disclosure (exact-value disclosure) the
exact value of a protected attribute is disclosed - Partial disclosure (interval-based disclosure)
the disclosed range (difference of the lower and
upper bounds) of the protected attribute is
smaller than a predefined threshold - Probability-based disclosure the posterior
distribution of the data after answering queries
is significantly different from its prior
distribution
6Offline Auditors for Full Disclosure
- Sum queries
- Max and min queries
- Median and average queries
7Audit Expert (Chin 1982)
- Query auditing method for SUM queries
- A SUM query can be considered as a linear
equation - where is whether record i belongs to the
query set, xi is the sensitive value, and q is
the query result - A set of SUM queries can be thought of as a
system of linear equations - Maintains the binary matrix representing linearly
independent queries and update it when a new
query is issued - A row with all 0s except for ith column indicates
disclosure
8Offline Auditing for Full Disclosure
- Arbitrary combinations of aggregate queries
- It is unlikely there will be efficient on-line
application algorithms for SUM MAX queries, SUM
MIN queries, or SUM MAX MIN queries. - Example hardness results
- Theorem. There is no polynomial time
full-disclosure auditing algorithm for sum and
max queries unless PNP.
9Auditing Sum Queries on Booleans
- Database collection of secret Boolean variables
- Query specifies subset S of variables
- Answer sum of variables in S
- Privacy breach after asking several queries,
user learns the value of some secret variable(s) - Auditing problem given a set of Boolean
equations, is there a variable that has the same
value in all solutions? - Auditing Boolean attributes, Kleinberg, 2000
10Why Is This Interesting?
- Linear Diaphantine equations
- Query can be safe on real-valued, unbounded data,
but reveal information when the data are
discrete, with known bounds - Hardness results the auditing problem for
Boolean values is coNP-complete.
x y w 1 y z 1 x z 1
Real multiple solutions, secure Boolean unique
solution, insecure (why?)
11Offline Auditor for Partial Disclosure
- Partial disclosure the disclosed range of the
protected attribute is smaller than a predefined
threshold - Sum queries
- Interval-based disclosure monitoring upper and
lower bounds for all confidential attributes - Auditing interval-based inference. Li et
al. 2002
12Offline Auditor for Partial Disclosure
- New query
- A series of linear programming problems
13Incremental evaluation of the LP
- Treats the auditing problem as a series of
updation problems and updates the bounds with
certain rules - Horizontal updation given the same set of
queries, the bounds of one variable, how can the
prior result be modified to get the bounds of
other variables - Vertical updatation given the same set of
variables, and bounds under the previous queries,
how can the prior result modified to get the
bounds when a new query arrives
An efficient online auditing approach to limit
private data disclosure, Lu, 2009
14Online Auditing
Auditor
Qi1
Database
A or Denied
Previous queries Q1 Qi
Denied if answering Qi1 would cause a privacy
breach
15Online Auditing
- Given a sequence of queries and corresponding
answers, and a new query, determine if the new
query should be answered or denied in order to
prevent a privacy breach. - Earliest approaches
- Query set size control
- Query set overlap control
- Limited data utility
16Offline to Online?
- Can an offline auditor directly solve the online
auditing problem?
- Denials leak information!
17Sounds Familiar?
slide stolen from Kobbi Nissim
Colonel Oliver North, on the Iran-Contra arms deal
On the advice of my counsel I respectfully and
regretfully decline to answer the question based
on my constitutional rights.
David Duncan, former auditor for Enron and
partner in Arthur Andersen
Mr. Chairman, I would like to answer the
committee's questions, but on the advice of my
counsel I respectfully decline to answer the
question based on the protection afforded me
under the Constitution of the United States.
18Example Sum/Max
- Variables di are real, privacy breached if
adversary learns some di
Auditor
Database
Denied
Oh well
Wait there must be a reason why second query was
denied
The only possible reason for denial is if
d1d2d35
19Online Audit
- Denials reduce the search space
20One workaround
- Deny whenever the offline algorithm does, in
addition, randomly deny queries. - Issues
- Leakage is not prevented
- Have to remember which queries were randomly
denied - Semantically determine whether two queries are
equivalent
21Simulatable Auditing
- Observation denials have the potential to leak
information if the auditor uses information that
is unavailable to the attacker (the answer to the
new query) - Simlatable auditing the attacker should be able
to simulate or mimic the auditors decisions to
answer or deny a query - Denials provably do not leak information
Simulatable Auditing, Kenthapadi, 2005
22Simulatable Auditing
- An auditor is a function of Q, A and X
- An auditor is simulatable if there exists a
simulator that is a function of only Q and A-ai1
and whose output is always equal to the auditor.
qi1
Auditor
Deny or answer
23Simulatable Auditing
24Simulatable Auditing
- Query-set-size control
- Query-set-overlap control
- Audit expert for sum queries
25Constructing Simulatable Auditor
- General sufficient condition the auditor should
determine if there is any possible dataset,
consistent with all past responses, in which the
answer to the current query would cause some
element to be fully disclosed
26Example revisited Sum/Max
Auditor
Database
Denied
- Simulatable auditor would always deny the max
query following a sum query - Lose some utility due to the requirement of
simulatability
27Challenges
- Privacy definition
- Privacy of groups/families
- Algorithmic limitations
- Simulatable algorithms computationally
prohibitive - Most work on sum queries, some on max, min,
median, hardness results on mixed queries - Collusion
- Reduced utility for legitimate users
- Large audit trail
- Utility
- Percentage of denials may not be the best measure