Title: Towards Privacy in Public Databases
1Towards Privacy in Public Databases
- Shuchi Chawla, Cynthia Dwork,
Frank McSherry, Adam Smith,
Larry Stockmeyer, Hoeteck Wee - Work Done at Microsoft Research
2Database Privacy
- Think Census
- Individuals provide information
- Census Bureau publishes sanitized records
- Privacy is legally mandated what utility can we
achieve? - Inherent Privacy vs Utility tension
- One extreme complete privacy no information
- Other extreme complete information no privacy
- Goals
- Find a middle path
- preserve macroscopic properties
- disguise individual identifying information
- Change the nature of discourse
- Establish framework for meaningful comparison of
techniques
3Outline
- Definitions
- privacy, defined in the breach
- sanitization requirements
- utility goals
- Example Recursive Histogram Sanitizations
- description of technique
- a robust proof of privacy
- Example Round Sanitizations
- nice learning properties
- privacy via cross-training
- Setting the Real World Context
- dealing with auxiliary information
4Outline
- Definitions
- privacy, defined in the breach
- sanitization requirements
- utility goals
- Example Recursive Histogram Sanitizations
- description of technique
- a robust proof of privacy
- Example Round Sanitizations
- nice learning properties
- privacy via cross-training
- Setting the Real World Context
- dealing with auxiliary information
5What do WE mean by privacy?
- Ruth Gavison Protection from being brought to
the attention of others - inherently valuable
- attention invites further privacy loss
- Privacy is assured to the extent that one blends
in with the crowd - Appealing definition can be converted into a
precise mathematical statement
6A geometric view
- Abstraction
- Database consists of points in high dimensional
space Rd - Points are unlabeled
- you are your collection of attributes
- Distance is everything
- points are more similar if and only if they are
closer - Real Database (RDB), private
- n unlabeled points in d-dimensional space
think of d as number of sensitive attributes - Sanitized Database (SDB), public
- n new points, possibly in a different space
7The adversary or Isolator - Intuition
- On input SDB and auxiliary information, adversary
outputs a point q ? Rd - q isolates a real DB point x, if it is much
closer to x than to xs near neighbors - q fails to isolate x if q looks roughly as much
like everyone in xs neighborhood as it looks
like x itself - Tightly clustered points have a smaller radius of
isolation
RDB
8(c,T)-Isolation the definition
- I(SDB,aux) q
- x is (c,T)-isolated if B(q,cd) contains fewer
than T other points from RDB
c privacy parameter eg, 4
p
9Requirements for the sanitizer
- No way of obtaining privacy if AUX already
reveals too much! - Sanitization procedure compromises privacy if
giving the adversary access to the SDB
considerably increases its probability of success - Definition of considerably can be forgiving
- Formally, quantify over distributions,
adversaries, choice of database, auxiliary
information - ? D ? I ? I w.h.p. D ? aux
- ?x PrI(SDB,aux) isolates x PrI(aux)
isolates x is small - probabilities over choices made by sanitizer and
I, I - Provides a framework for describing the power of
a sanitization method, and hence for comparisons - Aux is going to cause trouble. Ignore it for now.
10Utility Goals
- Pointwise proofs of specific utilities
- averages, medians, clusters, regressions,
- Prove there is a large class of interesting
utilities for which there are good approximation
procedures using sanitized data
11Outline
- Definitions
- privacy, defined in the breach
- sanitization requirements
- utility goals
- Example Recursive Histogram Sanitizations
- description of technique
- a robust proof of privacy
- Example Round Sanitizations
- nice learning properties
- privacy via cross-training
- Setting the Real World Context
- dealing with auxiliary information
12Recursive Histogram Sanitization
- U d-dim cube, side 2
- Cut into 2d subcubes
- split along each axis
- subcube has side 1
- For each subcube
- if number of RDB points gt 2T
- then recurse
- Output list of cells and counts
13Recursive Histogram Sanitization
- Theorem 9c s.t. if n points are drawn uniformly
from U, then recursive histogram sanitizations
are safe with respect to c-isolation
PrI(SDB) succeeds exp(-d).
14Safety of Recursive Histogram Sanitization
- Rough Intuition
- Expected distance q-x is diameter of cell.
- Distances tightly concentrated around mean.
- Multiplying radius by c captures almost all the
parent cell - contains at least 2T points.
15For Very Large Values of n
- Wlog can switch to ball adversaries (q,r)
- I wins if B(q,r) contains at least one RDB point
and B(q,cr) contains fewer than T RDB points - Define a probability density f(x) that captures
adversarys view of the RDB - To win with probability ?, I needs
- PrfB(q,r) ?/n
- PrfB(q,cr) (2T O(log ?-1))/n
- PrfB(q,r)/PrfB(q,cr) ?/(2T O(log ?-1))
- Bound ? by bounding ratio, 2-?d, ? lt 1
16PrfB(q,r)/PrfB(q,cr)
- f(x) (nC/n) (1 / Vol(C))
- fraction of RDB points landing in cell C, spread
uniformly within C - If r is sufficiently small, the bigger ball
captures exp(d) more mass in each subcube than
does the smaller ball
yields ? lt 2-?(d)
17PrfB(q,r)/PrfB(q,cr)
- f(x) (nC/n) (1 / Vol(C))
- fraction of RDB points landing in cell C, spread
uniformly within C - If r is sufficiently small, the bigger ball
captures exp(d) more mass in each subcube than
does the smaller ball - If r is large, the small ball captures nothing or
the bigger ball captures parent cube - Either way isolation cannot occur (c 16)
18Proof is Very Robust
- Extends to many interesting cases
- non-uniform but bounded-ratio density fns
- isolator knows constant fraction of attribute
vals - isolator knows lots of RDB points
- isolation in few attributes
- very weak bounds
- Can be adapted to round distributions
- balls, spheres, mixtures of Gaussians,
- with effort work in progress w/ K. Talwar
- More General Distributions
- good islands in a sea of zero probability
19Outline
- Definitions
- privacy, defined in the breach
- sanitization requirements
- utility goals
- Example Recursive Histogram Sanitizations
- description of technique
- a robust proof of privacy
- Example Round Sanitizations
- nice learning properties
- privacy via cross-training
- Setting the Real World Context
- dealing with auxiliary information
20Round Sanitizations
- The privacy of x is linked to its T-radius
- ? Randomly perturb it in proportion to its
T-radius - x San(x) ?R B(x,T-rad(x))
- alternatively S(x, T-rad(x)) or d-dim Gaussian
- Intuition
- We are blending x in with its crowd
- We are adding to x random noise with mean zero,
so several macroscopic properties should be
preserved.
21Nice Learning Properties
- Known algorithm for learning mixtures of
Gaussians works for clustering sanitized Gaussian
data - Original distribution (mixture of Gaussians) is
recovered - Technical issue added noise is a function of
the data - Subject of another talk
- Diameter increases by at most x3 when finding k
clusters minimizing the largest diameter -
22Privacy for n Sanitized Points?
- Given n-1 points in the clear, the probability of
isolating the nth is O(exp(-d)) - Intuition for extension to n points is wrong!
- Privacy of xn given xn and all the other points
in the clear does not imply privacy of xn given
xn and sanitizations of others! - Sanitization of other points reveals information
about xn - Worry is for safety of the reference point (the
neighbor defining the T-radius), not the
principal
23Combining the Two Sanitizations
- Partition RDB into two sets A and B
- Cross-training
- Compute histogram sanitization for B
- v 2 A ?v f(side length of C containing v)
- Output GSan(v, ?v)
24Cross-Training Privacy
- Privacy for B only histogram information about B
is used - Privacy for A enough variance for enough
coordinates of v, even given C containing v and
sanitization v of v. - current proof works only for A 2o(d)
25Additional Results
- Impossibility Results
- 9 interesting utilities that have no sanitization
protecting against isolation (cf. SFE) - Impossibility of all-purpose sanitizers
- There is always a choice of aux that defeats a
certain natural version of privacy - Contrived, but places a limit on what can be
proved - Poly-time bounded adversary? Connection to
obfuscation. - Utility
- Exploit literature on power of randomized
histograms for algorithms for data streams (eg,
Indyk) - with assorted collaborators, eg, N, N, S, T
26Outline
- Definitions
- privacy, defined in the breach
- sanitization requirements
- utility goals
- Example Recursive Histogram Sanitizations
- description of technique
- a robust proof of privacy
- Example Round Sanitizations
- nice learning properties
- privacy via cross-training
- Setting the Real World Context
- dealing with auxiliary information
27A Standard Technique Cell Suppression
- Gestalt Tabular Data (many, possibly linked,
tables) - entries are cells
- frequency (count) data
- magnitude data (income, sales, etc.)
- Disclosure small counts
- Provides key for population unique, or
almost-unique - Can be used as a key into a different database
- Enormous literature on suppressing safely
16 8 5 2 31
1 5 20 3 29
17 13 25 5 60
28Connection to Our Definitions
- Protection against isolation yields protection
against learning a key for a population unique - isolation on a subspace does not imply isolation
in the full-dimensional space - but aux may contain other DBs that can be
queried to learn remaining attributes - definition mandates protection against all
possible aux - satisfy def ) cant learn key
29Connection to Our Definitions
- Seems very hard to provide good sanitization in
the presence of arbitrary aux - Provably impossible in general
- Anyway, can probably already isolate people based
solely on aux - Suggests we need to control aux
- How should we redesign the world?
30Two Tools
- Secure Function Evaluation Yao, GMW
- Technique permitting Alice, Bob, Carol, and their
friends to collaboratively compute a function f
of their private inputs ? f(a,b,c,). - eg, ? sum(a,b,c, )
- Each player learns only what can be deduced from
? and her own input to f - SuLQ databases Dwork, Nissim
- Provably preserves privacy of attributes when the
rows of the database are mutually independent - Powerful DwNi Blum, Dwork, McSherry, Nissim
31Statistical Database
Query (S, f) S ? n f 0,1d? 0,1
Exact Answer ?r?S f(row r)
Database DB
Row distributionD
(D1,D2,,Dn)
32Sub-Linear Query (SuLQ) Databases
If the number of queries is ltlt n, then privacy
can be protected with little noise (per
query) E(noise) 0 standard dev ltlt vn Much
less than sampling error!
noise
?
33Our Data, Ourselves
34Our Data, Ourselves
- Individuals maintain their own data records
- join a DB by setting an appropriate attribute
- Statistical queries via a SFE(SuLQ)
- privacy of SuLQ query ) this SFE is safe
- Individuals ensure
- data take part in sufficiently few queries
- sufficient random noise is added
0 4 6 3 1 0
35Summary
- Definitions
- defined isolation and sanitization
- Recursive Histogram Sanitizations
- described approach and sketched a robust proof of
privacy for a special distribution - proof exploits high dimensionality ( columns)
- Sanitization via perturbations
- utility and privacy via cross-training
- Setting the Real World Context
- discussed a radical view of how data might be
organized to prevent a powerful class of attacks
based on auxiliary data - SuLQ tool exploits large membership ( rows)
36Larry Joseph Stockmeyer November 13, 1948 -
July 31, 2004
37Larry Stockmeyer Commemoration
- May 21-22, 2005
- Baltimore, Maryland
- (in conjunction with STOC 2005)
- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
- May 21,
- Tutorial by Nick Pippenger (Princeton) on some
of Stockmeyer's fundamental results in complexity
theory - Lectures by Miki Ajtai (IBM), Anne Condon (UBC),
Cynthia Dwork (Microsoft), Richard Karp
(UC Berkeley), Albert Meyer (MIT), and Chris
Umans (CalTech). - Some time will be reserved for personal remarks.
Contact Cynthia Dwork if you want to participate
in this part of the commemoration. - May 22 Lance Fortnow gives first keynote
address to STOC.
38Larry Stockmeyer
- Larry Stockmeyer, theoretical computer scientist
and a founder of the field of complexity theory
-- that part of computer science exploring the
inherent difficulty of solving computational
problems -- died Saturday, July 31, 2004, of
pancreatic cancer. - Born in Evansville, Indiana, in 1948, Stockmeyer
was educated at MIT, where he received a
bachelor's of science in mathematics and a
master's of science in electrical engineering in
1972, followed by a doctorate in computer science
in 1974. Stockmeyer is famous for his
groundbreaking work proving the extreme
difficulty of solving naturally occurring
computational problems. His pioneering
contributions were soon incorporated into
textbooks on computational complexity. - Stockmeyer joined IBM Research in 1974, working
first at the IBM Thomas J. Watson Research Center
in Yorktown Heights, New York. A founding member
of the Theory Group at the IBM Almaden Research
Center in the early 1980s, Stockmeyer was
elevated to Fellow of the Association of
Computing Machinery in 1996. He remained at
Almaden until he took a bridge to retirement from
IBM in November 2002. After this, Stockmeyer
enjoyed a brief affiliation with the University
of California at Santa Cruz until his death, at
age 55. - Stockmeyer is survived by his father Robert
Stockmeyer, his sister Mary Karen Walker, and his
former wife, dear friend, and colleague Cynthia
Dwork.
39Larry Joseph Stockmeyer November 13, 1948 -
July 31, 2004