Title: Towards Privacy in Public Databases
1Towards Privacy in Public Databases
- Shuchi Chawla, Cynthia Dwork,
Frank McSherry, Adam Smith, Hoeteck Wee
2Database Privacy
- A Census problem
- Individuals provide information
- Census Bureau publishes sanitized records
- Privacy is legally mandated what utility can we
achieve? - Inherent Privacy vs. Utility trade-off
- Our goal
- Find a middle path
- preserve macroscopic properties
- disguise individual records (containing private
info) - Establish a framework for meaningful comparison
of techniques
3What about Secure Function Evaluation?
- Secure Function Evaluation Yao, GMW
- Allows parties to collaboratively compute a
function f of their private inputs - ? f(a,b,c,) ( E.g., ?
sum(a,b,c, ) ) - Each player learns only what can be deduced from
? and her own input to f - SFE and privacy are complementary problems
one does not imply the other - SFE Given what must be preserved,
- protect everything else
- Privacy Given what must be protected,
- preserve as much as you can
4This talk
- A formalism for privacy
- What we mean by privacy
- A good sanitization procedure
- Results
- Histograms and Perturbations
- Subsequent work Open problems
5What do we mean by Privacy?
- Ruth Gavison Protection from being brought to
the attention of others - inherently valuable
- attention invites further privacy loss
- Privacy is assured to the extent that one blends
in with the crowd - Appealing definition can be converted into a
precise mathematical statement
6The basic model a geometric approach
- Database consists of pts in high dimensional
space R d - Samples from some underlying distribution
- Points are unlabeled you are your collection of
attributes - (Relative) distance is everything points that
are closer are more similar and vice versa - A real database RDB controlled by a central
authority - n points in d-dimensional space
- Think of d as the number of sensitive attributes
- A sanitized database SDB released to the
world - Information about fake individuals, a summary of
the real data, or, a combination of both
7The adversary or Isolator
- On input SDB and auxiliary information, adversary
outputs a point q ? R d - q isolates a real point x, if it is much closer
to x than to xs neighbors - i.e., if B(q,cd) contains fewer than T other
points from RDB - c, T privacy parameters e.g., c 4, T 100
isolated
d
Not isolated
cd
RDB
8Requirement for the sanitizer
- No way of obtaining privacy if AUX already
reveals too much! - Sanitization compromises privacy if giving the
adversary access to the SDB considerably
increases its probability of success - Definition of considerably can be forgiving,
e.g. 1/1000 - Rigorously
-
- Provides a framework for describing the power of
a sanitization method, and hence for comparisons - Aux is going to cause trouble. Ignore it for now.
2-d in our results
? D ? I ? I w.h.p. over RDB?D ? aux
? S ? RDB Pr ?x?S I(SDB,aux) iso.s x
Pr ?x?S I (aux) iso.s x ? ?
? D ? I ? I w.h.p. over RDB?D ? aux
? x ? RDB Pr I(SDB,aux) isolates x
Pr I (aux) isolates x ? ?
9A bad sanitizer that passes
Abhinandan Das, Cornell
- Disguise one attribute extremely well
- Leave the others in the clear
- Without info about the special attribute, the
adversary cannot isolate any point - However, he knows all other attributes exactly!
- What goes wrong?
- The assumption that distance is everything
- No isolation ? no privacy breach, even if the
adversary knows a lot of information
10Utility goals
- Desirable results
- Macroscopic properties (e.g. means) should be
preserved - Running statistical tests / data-analysis
algorithms should return results similar to those
obtained from real data - We show
- Concrete point-wise results on histograms and
clustering algorithms
11This talk
- A formalism for privacy
- What we mean by privacy
- A good sanitization procedure
- Results
- Histograms and Perturbations
- Subsequent work Open problems
12Two techniques for sanitization
- Recursive histograms
- Assume the universe is a d-dimensional hypercube
-1,1n - As long as a cell contains ? T points
- Subdivide it into 2d hypercubes by splitting each
side evenly - Recurse until all cells have ? T points
- Output a list of cells and counts
d2, T3
13Two techniques for sanitization
- Recursive histograms
- Perturbation
- For every point x, compute its T-radius tx
B(x, tx) T - Add random vector to x of length proportional to
tx
doesnt work by itself
T1
14Two techniques for sanitization
- Recursive histograms
- Perturbation combined with histograms
- Results on privacy
- Rely on randomness in distribution and
sanitization - Do not use any computational assumptions
- When D uniform over a hypercube, c O(1), T
arbitrary - probability of success for the adversary ? ?
2-d - Better results for special cases
15Key results on utility
- Perturbation-based sanitization
- Allows for various clustering algorithms to
perform nearly as well as on real data - Spectral techniques
- Diameter-based clusterings
- Histograms a popular summarization technique in
statistics - Recursive histograms benefit of providing more
detail where required - Provide density information even without the
counts - No randomness involved!
16A brief proof of privacy
- Recall recursive histograms
- Simplifying assumption
- Input distribution is uniform over the hypercube
- Intuition
- The adversarys view a product of uniform
distributions over histogram cells - The uniform distribution is well-spread-out
the adversary cannot conclusively single out a
point in it
17A brief proof of privacy
- Case 1 Sparse cell
- Expected distance q - x proportional to
diameter of cell - c times this distance is larger than diameter of
parent cell - Therefore, B(q,c?) contains at least T points
- Case 2 Dense cell
- Consider the balls B(q,r) and B(q,cr) for
- some radius r
- The adversary wins if
- Pr ? x ? B(q,r) is large, and,
- Pr ? T points in B(q,cr) is small
- However, we show Pr ? x ? B(q,cr) gtgt Pr ? x
? B(q,r)
q
x
18A brief proof of privacy
- Lemma Let c be a large enough constant.
- For any cell and any r lt diam(cell)/c ,
- Pr ? x ? B(q,cr) ? cell ? 2d Pr ? x ?
B(q,r) ? cell - Proof idea
- Pr ? x ? B(q,r) ? cell ? Vol( B(q,r) ? cell
) - Vol( B(q,cr) ? cell ) gt 2d Vol( B(q,r) ? cell )
Uses arguments about Normal and Uniform random
variables
- Corollary Prob. of success for the adversary lt
2-d
B(q,cr)
B(q,r)
cell
19This talk
- A formalism for privacy
- What we mean by privacy
- A good sanitization procedure
- Results
- Histograms and Perturbations
- Subsequent work Open problems
20Follow-up work
- Isolation in few dimensions
- Adversary must be more and more accurate in fewer
dimensions - Randomized recursive histograms Chawla, Dwork,
McSherry, Talwar - Similar privacy guarantees for nearly-uniform
distributions over well-rounded universes - Preserve distances between pairs of points to a
reasonable accuracy (additive error depending on
T) - General-case impossibility
- Cannot allow arbitrary AUX ? utility, and ?
definitions of privacy, ? AUX that prevents
privacy-preserving sanitization
21What about the real world?
- Lessons from the abstract model
- High dimensionality is our friend
- Histograms are powerful Spherical perturbations
promising - Need to scale different attributes appropriately,
so that data is well-rounded - Moving towards real data
- Outliers
- Our notion of c-isolation deals with them
existence may be disclosed - Discrete attributes
- Possible solution Convert them into real-valued
attributes by adding noise? - The low-dimensional case
- Is it inherently impossible?
- Dinur and Nissim show impossibility for
1-dimensional data
22Questions?
23Prior work on privacy
- Long-standing problem in statistics and database
theory a wide-variety of definitions and
techniques - Statistical approaches
- Alter the frequency (PRAN/DS/PERT) of particular
features, while preserving means. - Additionally, erase values that reveal too much
- Query-based approaches
- Disallow queries that reveal too much
- Output perturbation (add noise to true answer)
- No unified definition of privacy incomplete
analysis - e.g., erasure or refusal to answer can disclose
information
24Key results on privacy
Aux info
Prob. of isolation
Sanitization
No. of points
Distribution
Distribution
2-?(d)
One perturbed point all other real points
n
Uniform over a bounding box, or on surface of a
sphere
Distribution subset of the points
2-?(d)
Histogram over all points
n
Uniform over a hypercube
Distribution subset of the points
O(n2-?(d))
Histogram over n/2 points, and n/2 perturbed
points
2o(d)
Uniform over a hypercube
Results rely on the randomness in the input
distribution and sanitization, not on any
computational hardness assumption
25Everybodys first suggestion
- Learn the distribution, then output
- A description of the distribution, or
- Samples from the learned distribution
- Unsatisfactory
- Want to reflect facts on the ground
- Statistically insignificant clusters can be
important for allocating resources - Dont know in advance which aspects of the data
will be useful - Work in statistics on synthetic data has no
compelling privacy arguments