Towards Privacy in Public Databases - PowerPoint PPT Presentation

About This Presentation

Title:

Towards Privacy in Public Databases

Description:

nice learning properties. privacy via cross-training. Setting the Real ... Nice ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 37

Provided by: dwo7

Learn more at: https://crypto.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Towards Privacy in Public Databases

1
Towards Privacy in Public Databases

Shuchi Chawla, Cynthia Dwork,
Frank McSherry, Adam Smith,
Larry Stockmeyer, Hoeteck Wee
Work Done at Microsoft Research

2
Database Privacy

Think Census
Individuals provide information
Census Bureau publishes sanitized records
Privacy is legally mandated what utility can we
achieve?
Inherent Privacy vs Utility tension
One extreme complete privacy no information
Other extreme complete information no privacy
Goals
Find a middle path
preserve macroscopic properties
disguise individual identifying information
Change the nature of discourse
Establish framework for meaningful comparison of
techniques

3
Outline

Definitions
privacy, defined in the breach
sanitization requirements
utility goals
Example Recursive Histogram Sanitizations
description of technique
a robust proof of privacy
Example Round Sanitizations
nice learning properties
privacy via cross-training
Setting the Real World Context
dealing with auxiliary information

4
Outline

Definitions
privacy, defined in the breach
sanitization requirements
utility goals
Example Recursive Histogram Sanitizations
description of technique
a robust proof of privacy
Example Round Sanitizations
nice learning properties
privacy via cross-training
Setting the Real World Context
dealing with auxiliary information

5
What do WE mean by privacy?

Ruth Gavison Protection from being brought to
the attention of others
inherently valuable
attention invites further privacy loss
Privacy is assured to the extent that one blends
in with the crowd
Appealing definition can be converted into a
precise mathematical statement

6
A geometric view

Abstraction
Database consists of points in high dimensional
space Rd
Points are unlabeled
you are your collection of attributes
Distance is everything
points are more similar if and only if they are
closer
Real Database (RDB), private
n unlabeled points in d-dimensional space
think of d as number of sensitive attributes
Sanitized Database (SDB), public
n new points, possibly in a different space

7
The adversary or Isolator - Intuition

On input SDB and auxiliary information, adversary
outputs a point q ? Rd
q isolates a real DB point x, if it is much
closer to x than to xs near neighbors
q fails to isolate x if q looks roughly as much
like everyone in xs neighborhood as it looks
like x itself
Tightly clustered points have a smaller radius of
isolation

RDB
8
(c,T)-Isolation the definition

I(SDB,aux) q
x is (c,T)-isolated if B(q,cd) contains fewer
than T other points from RDB

c privacy parameter eg, 4
p
9
Requirements for the sanitizer

No way of obtaining privacy if AUX already
reveals too much!
Sanitization procedure compromises privacy if
giving the adversary access to the SDB
considerably increases its probability of success
Definition of considerably can be forgiving
Formally, quantify over distributions,
adversaries, choice of database, auxiliary
information
? D ? I ? I w.h.p. D ? aux
?x PrI(SDB,aux) isolates x PrI(aux)
isolates x is small
probabilities over choices made by sanitizer and
I, I
Provides a framework for describing the power of
a sanitization method, and hence for comparisons
Aux is going to cause trouble. Ignore it for now.

10
Utility Goals

Pointwise proofs of specific utilities
averages, medians, clusters, regressions,
Prove there is a large class of interesting
utilities for which there are good approximation
procedures using sanitized data

11
Outline

Definitions
privacy, defined in the breach
sanitization requirements
utility goals
Example Recursive Histogram Sanitizations
description of technique
a robust proof of privacy
Example Round Sanitizations
nice learning properties
privacy via cross-training
Setting the Real World Context
dealing with auxiliary information

12
Recursive Histogram Sanitization

U d-dim cube, side 2
Cut into 2d subcubes
split along each axis
subcube has side 1
For each subcube
if number of RDB points gt 2T
then recurse
Output list of cells and counts

13
Recursive Histogram Sanitization

Theorem 9c s.t. if n points are drawn uniformly
from U, then recursive histogram sanitizations
are safe with respect to c-isolation
PrI(SDB) succeeds exp(-d).

14
Safety of Recursive Histogram Sanitization

Rough Intuition
Expected distance q-x is diameter of cell.
Distances tightly concentrated around mean.
Multiplying radius by c captures almost all the
parent cell - contains at least 2T points.

15
For Very Large Values of n

Wlog can switch to ball adversaries (q,r)
I wins if B(q,r) contains at least one RDB point
and B(q,cr) contains fewer than T RDB points
Define a probability density f(x) that captures
adversarys view of the RDB
To win with probability ?, I needs
PrfB(q,r) ?/n
PrfB(q,cr) (2T O(log ?-1))/n
PrfB(q,r)/PrfB(q,cr) ?/(2T O(log ?-1))
Bound ? by bounding ratio, 2-?d, ? lt 1

16
PrfB(q,r)/PrfB(q,cr)

f(x) (nC/n) (1 / Vol(C))
fraction of RDB points landing in cell C, spread
uniformly within C
If r is sufficiently small, the bigger ball
captures exp(d) more mass in each subcube than
does the smaller ball

yields ? lt 2-?(d)
17
PrfB(q,r)/PrfB(q,cr)

f(x) (nC/n) (1 / Vol(C))
fraction of RDB points landing in cell C, spread
uniformly within C
If r is sufficiently small, the bigger ball
captures exp(d) more mass in each subcube than
does the smaller ball
If r is large, the small ball captures nothing or
the bigger ball captures parent cube
Either way isolation cannot occur (c 16)

18
Proof is Very Robust

Extends to many interesting cases
non-uniform but bounded-ratio density fns
isolator knows constant fraction of attribute
vals
isolator knows lots of RDB points
isolation in few attributes
very weak bounds
Can be adapted to round distributions
balls, spheres, mixtures of Gaussians,
with effort work in progress w/ K. Talwar
More General Distributions
good islands in a sea of zero probability

19
Outline

Definitions
privacy, defined in the breach
sanitization requirements
utility goals
Example Recursive Histogram Sanitizations
description of technique
a robust proof of privacy
Example Round Sanitizations
nice learning properties
privacy via cross-training
Setting the Real World Context
dealing with auxiliary information

20
Round Sanitizations

The privacy of x is linked to its T-radius
? Randomly perturb it in proportion to its
T-radius
x San(x) ?R B(x,T-rad(x))
alternatively S(x, T-rad(x)) or d-dim Gaussian
Intuition
We are blending x in with its crowd
We are adding to x random noise with mean zero,
so several macroscopic properties should be
preserved.

21
Nice Learning Properties

Known algorithm for learning mixtures of
Gaussians works for clustering sanitized Gaussian
data
Original distribution (mixture of Gaussians) is
recovered
Technical issue added noise is a function of
the data
Subject of another talk
Diameter increases by at most x3 when finding k
clusters minimizing the largest diameter

22
Privacy for n Sanitized Points?

Given n-1 points in the clear, the probability of
isolating the nth is O(exp(-d))
Intuition for extension to n points is wrong!
Privacy of xn given xn and all the other points
in the clear does not imply privacy of xn given
xn and sanitizations of others!
Sanitization of other points reveals information
about xn
Worry is for safety of the reference point (the
neighbor defining the T-radius), not the
principal

23
Combining the Two Sanitizations

Partition RDB into two sets A and B
Cross-training
Compute histogram sanitization for B
v 2 A ?v f(side length of C containing v)
Output GSan(v, ?v)

24
Cross-Training Privacy

Privacy for B only histogram information about B
is used
Privacy for A enough variance for enough
coordinates of v, even given C containing v and
sanitization v of v.
current proof works only for A 2o(d)

25
Additional Results

Impossibility Results
9 interesting utilities that have no sanitization
protecting against isolation (cf. SFE)
Impossibility of all-purpose sanitizers
There is always a choice of aux that defeats a
certain natural version of privacy
Contrived, but places a limit on what can be
proved
Poly-time bounded adversary? Connection to
obfuscation.
Utility
Exploit literature on power of randomized
histograms for algorithms for data streams (eg,
Indyk)
with assorted collaborators, eg, N, N, S, T

26
Outline

Definitions
privacy, defined in the breach
sanitization requirements
utility goals
Example Recursive Histogram Sanitizations
description of technique
a robust proof of privacy
Example Round Sanitizations
nice learning properties
privacy via cross-training
Setting the Real World Context
dealing with auxiliary information

27
A Standard Technique Cell Suppression

Gestalt Tabular Data (many, possibly linked,
tables)
entries are cells
frequency (count) data
magnitude data (income, sales, etc.)
Disclosure small counts
Provides key for population unique, or
almost-unique
Can be used as a key into a different database
Enormous literature on suppressing safely

16 8 5 2 31
1 5 20 3 29
17 13 25 5 60
28
Connection to Our Definitions

Protection against isolation yields protection
against learning a key for a population unique
isolation on a subspace does not imply isolation
in the full-dimensional space
but aux may contain other DBs that can be
queried to learn remaining attributes
definition mandates protection against all
possible aux
satisfy def ) cant learn key

29
Connection to Our Definitions

Seems very hard to provide good sanitization in
the presence of arbitrary aux
Provably impossible in general
Anyway, can probably already isolate people based
solely on aux
Suggests we need to control aux
How should we redesign the world?

30
Two Tools

Secure Function Evaluation Yao, GMW
Technique permitting Alice, Bob, Carol, and their
friends to collaboratively compute a function f
of their private inputs ? f(a,b,c,).
eg, ? sum(a,b,c, )
Each player learns only what can be deduced from
? and her own input to f
SuLQ databases Dwork, Nissim
Provably preserves privacy of attributes when the
rows of the database are mutually independent
Powerful DwNi Blum, Dwork, McSherry, Nissim

31
Statistical Database
Query (S, f) S ? n f 0,1d? 0,1
Exact Answer ?r?S f(row r)
Database DB
Row distributionD
(D1,D2,,Dn)
32
Sub-Linear Query (SuLQ) Databases
If the number of queries is ltlt n, then privacy
can be protected with little noise (per
query) E(noise) 0 standard dev ltlt vn Much
less than sampling error!
noise
?
33
Our Data, Ourselves
34
Our Data, Ourselves

Individuals maintain their own data records
join a DB by setting an appropriate attribute
Statistical queries via a SFE(SuLQ)
privacy of SuLQ query ) this SFE is safe
Individuals ensure
data take part in sufficiently few queries
sufficient random noise is added

0 4 6 3 1 0
35
Summary

Definitions
defined isolation and sanitization
Recursive Histogram Sanitizations
described approach and sketched a robust proof of
privacy for a special distribution
proof exploits high dimensionality ( columns)
Sanitization via perturbations
utility and privacy via cross-training
Setting the Real World Context
discussed a radical view of how data might be
organized to prevent a powerful class of attacks
based on auxiliary data
SuLQ tool exploits large membership ( rows)

36
Larry Joseph Stockmeyer November 13, 1948 -
July 31, 2004
37
Larry Stockmeyer Commemoration

May 21-22, 2005
Baltimore, Maryland
(in conjunction with STOC 2005)
May 21,
Tutorial by Nick Pippenger (Princeton) on some
of Stockmeyer's fundamental results in complexity
theory
Lectures by Miki Ajtai (IBM), Anne Condon (UBC),
Cynthia Dwork (Microsoft), Richard Karp
(UC Berkeley), Albert Meyer (MIT), and Chris
Umans (CalTech).
Some time will be reserved for personal remarks.
Contact Cynthia Dwork if you want to participate
in this part of the commemoration.
May 22 Lance Fortnow gives first keynote
address to STOC.

38
Larry Stockmeyer

Larry Stockmeyer, theoretical computer scientist
and a founder of the field of complexity theory
-- that part of computer science exploring the
inherent difficulty of solving computational
problems -- died Saturday, July 31, 2004, of
pancreatic cancer.
Born in Evansville, Indiana, in 1948, Stockmeyer
was educated at MIT, where he received a
bachelor's of science in mathematics and a
master's of science in electrical engineering in
1972, followed by a doctorate in computer science
in 1974. Stockmeyer is famous for his
groundbreaking work proving the extreme
difficulty of solving naturally occurring
computational problems. His pioneering
contributions were soon incorporated into
textbooks on computational complexity.
Stockmeyer joined IBM Research in 1974, working
first at the IBM Thomas J. Watson Research Center
in Yorktown Heights, New York. A founding member
of the Theory Group at the IBM Almaden Research
Center in the early 1980s, Stockmeyer was
elevated to Fellow of the Association of
Computing Machinery in 1996. He remained at
Almaden until he took a bridge to retirement from
IBM in November 2002. After this, Stockmeyer
enjoyed a brief affiliation with the University
of California at Santa Cruz until his death, at
age 55.
Stockmeyer is survived by his father Robert
Stockmeyer, his sister Mary Karen Walker, and his
former wife, dear friend, and colleague Cynthia
Dwork.