Towards Privacy in Public Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Towards Privacy in Public Databases

Description:

nice learning properties. privacy via cross-training. Setting the Real ... Nice ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 37
Provided by: dwo7
Category:

less

Transcript and Presenter's Notes

Title: Towards Privacy in Public Databases


1
Towards Privacy in Public Databases
  • Shuchi Chawla, Cynthia Dwork,
    Frank McSherry, Adam Smith,
    Larry Stockmeyer, Hoeteck Wee
  • Work Done at Microsoft Research

2
Database Privacy
  • Think Census
  • Individuals provide information
  • Census Bureau publishes sanitized records
  • Privacy is legally mandated what utility can we
    achieve?
  • Inherent Privacy vs Utility tension
  • One extreme complete privacy no information
  • Other extreme complete information no privacy
  • Goals
  • Find a middle path
  • preserve macroscopic properties
  • disguise individual identifying information
  • Change the nature of discourse
  • Establish framework for meaningful comparison of
    techniques

3
Outline
  • Definitions
  • privacy, defined in the breach
  • sanitization requirements
  • utility goals
  • Example Recursive Histogram Sanitizations
  • description of technique
  • a robust proof of privacy
  • Example Round Sanitizations
  • nice learning properties
  • privacy via cross-training
  • Setting the Real World Context
  • dealing with auxiliary information

4
Outline
  • Definitions
  • privacy, defined in the breach
  • sanitization requirements
  • utility goals
  • Example Recursive Histogram Sanitizations
  • description of technique
  • a robust proof of privacy
  • Example Round Sanitizations
  • nice learning properties
  • privacy via cross-training
  • Setting the Real World Context
  • dealing with auxiliary information

5
What do WE mean by privacy?
  • Ruth Gavison Protection from being brought to
    the attention of others
  • inherently valuable
  • attention invites further privacy loss
  • Privacy is assured to the extent that one blends
    in with the crowd
  • Appealing definition can be converted into a
    precise mathematical statement

6
A geometric view
  • Abstraction
  • Database consists of points in high dimensional
    space Rd
  • Points are unlabeled
  • you are your collection of attributes
  • Distance is everything
  • points are more similar if and only if they are
    closer
  • Real Database (RDB), private
  • n unlabeled points in d-dimensional space
    think of d as number of sensitive attributes
  • Sanitized Database (SDB), public
  • n new points, possibly in a different space

7
The adversary or Isolator - Intuition
  • On input SDB and auxiliary information, adversary
    outputs a point q ? Rd
  • q isolates a real DB point x, if it is much
    closer to x than to xs near neighbors
  • q fails to isolate x if q looks roughly as much
    like everyone in xs neighborhood as it looks
    like x itself
  • Tightly clustered points have a smaller radius of
    isolation

RDB
8
(c,T)-Isolation the definition
  • I(SDB,aux) q
  • x is (c,T)-isolated if B(q,cd) contains fewer
    than T other points from RDB

c privacy parameter eg, 4
p
9
Requirements for the sanitizer
  • No way of obtaining privacy if AUX already
    reveals too much!
  • Sanitization procedure compromises privacy if
    giving the adversary access to the SDB
    considerably increases its probability of success
  • Definition of considerably can be forgiving
  • Formally, quantify over distributions,
    adversaries, choice of database, auxiliary
    information
  • ? D ? I ? I w.h.p. D ? aux
  • ?x PrI(SDB,aux) isolates x PrI(aux)
    isolates x is small
  • probabilities over choices made by sanitizer and
    I, I
  • Provides a framework for describing the power of
    a sanitization method, and hence for comparisons
  • Aux is going to cause trouble. Ignore it for now.

10
Utility Goals
  • Pointwise proofs of specific utilities
  • averages, medians, clusters, regressions,
  • Prove there is a large class of interesting
    utilities for which there are good approximation
    procedures using sanitized data

11
Outline
  • Definitions
  • privacy, defined in the breach
  • sanitization requirements
  • utility goals
  • Example Recursive Histogram Sanitizations
  • description of technique
  • a robust proof of privacy
  • Example Round Sanitizations
  • nice learning properties
  • privacy via cross-training
  • Setting the Real World Context
  • dealing with auxiliary information

12
Recursive Histogram Sanitization
  • U d-dim cube, side 2
  • Cut into 2d subcubes
  • split along each axis
  • subcube has side 1
  • For each subcube
  • if number of RDB points gt 2T
  • then recurse
  • Output list of cells and counts

13
Recursive Histogram Sanitization
  • Theorem 9c s.t. if n points are drawn uniformly
    from U, then recursive histogram sanitizations
    are safe with respect to c-isolation
    PrI(SDB) succeeds exp(-d).

14
Safety of Recursive Histogram Sanitization
  • Rough Intuition
  • Expected distance q-x is diameter of cell.
  • Distances tightly concentrated around mean.
  • Multiplying radius by c captures almost all the
    parent cell - contains at least 2T points.

15
For Very Large Values of n
  • Wlog can switch to ball adversaries (q,r)
  • I wins if B(q,r) contains at least one RDB point
    and B(q,cr) contains fewer than T RDB points
  • Define a probability density f(x) that captures
    adversarys view of the RDB
  • To win with probability ?, I needs
  • PrfB(q,r) ?/n
  • PrfB(q,cr) (2T O(log ?-1))/n
  • PrfB(q,r)/PrfB(q,cr) ?/(2T O(log ?-1))
  • Bound ? by bounding ratio, 2-?d, ? lt 1

16
PrfB(q,r)/PrfB(q,cr)
  • f(x) (nC/n) (1 / Vol(C))
  • fraction of RDB points landing in cell C, spread
    uniformly within C
  • If r is sufficiently small, the bigger ball
    captures exp(d) more mass in each subcube than
    does the smaller ball

yields ? lt 2-?(d)
17
PrfB(q,r)/PrfB(q,cr)
  • f(x) (nC/n) (1 / Vol(C))
  • fraction of RDB points landing in cell C, spread
    uniformly within C
  • If r is sufficiently small, the bigger ball
    captures exp(d) more mass in each subcube than
    does the smaller ball
  • If r is large, the small ball captures nothing or
    the bigger ball captures parent cube
  • Either way isolation cannot occur (c 16)

18
Proof is Very Robust
  • Extends to many interesting cases
  • non-uniform but bounded-ratio density fns
  • isolator knows constant fraction of attribute
    vals
  • isolator knows lots of RDB points
  • isolation in few attributes
  • very weak bounds
  • Can be adapted to round distributions
  • balls, spheres, mixtures of Gaussians,
  • with effort work in progress w/ K. Talwar
  • More General Distributions
  • good islands in a sea of zero probability

19
Outline
  • Definitions
  • privacy, defined in the breach
  • sanitization requirements
  • utility goals
  • Example Recursive Histogram Sanitizations
  • description of technique
  • a robust proof of privacy
  • Example Round Sanitizations
  • nice learning properties
  • privacy via cross-training
  • Setting the Real World Context
  • dealing with auxiliary information

20
Round Sanitizations
  • The privacy of x is linked to its T-radius
  • ? Randomly perturb it in proportion to its
    T-radius
  • x San(x) ?R B(x,T-rad(x))
  • alternatively S(x, T-rad(x)) or d-dim Gaussian
  • Intuition
  • We are blending x in with its crowd
  • We are adding to x random noise with mean zero,
    so several macroscopic properties should be
    preserved.

21
Nice Learning Properties
  • Known algorithm for learning mixtures of
    Gaussians works for clustering sanitized Gaussian
    data
  • Original distribution (mixture of Gaussians) is
    recovered
  • Technical issue added noise is a function of
    the data
  • Subject of another talk
  • Diameter increases by at most x3 when finding k
    clusters minimizing the largest diameter

22
Privacy for n Sanitized Points?
  • Given n-1 points in the clear, the probability of
    isolating the nth is O(exp(-d))
  • Intuition for extension to n points is wrong!
  • Privacy of xn given xn and all the other points
    in the clear does not imply privacy of xn given
    xn and sanitizations of others!
  • Sanitization of other points reveals information
    about xn
  • Worry is for safety of the reference point (the
    neighbor defining the T-radius), not the
    principal

23
Combining the Two Sanitizations
  • Partition RDB into two sets A and B
  • Cross-training
  • Compute histogram sanitization for B
  • v 2 A ?v f(side length of C containing v)
  • Output GSan(v, ?v)

24
Cross-Training Privacy
  • Privacy for B only histogram information about B
    is used
  • Privacy for A enough variance for enough
    coordinates of v, even given C containing v and
    sanitization v of v.
  • current proof works only for A 2o(d)

25
Additional Results
  • Impossibility Results
  • 9 interesting utilities that have no sanitization
    protecting against isolation (cf. SFE)
  • Impossibility of all-purpose sanitizers
  • There is always a choice of aux that defeats a
    certain natural version of privacy
  • Contrived, but places a limit on what can be
    proved
  • Poly-time bounded adversary? Connection to
    obfuscation.
  • Utility
  • Exploit literature on power of randomized
    histograms for algorithms for data streams (eg,
    Indyk)
  • with assorted collaborators, eg, N, N, S, T

26
Outline
  • Definitions
  • privacy, defined in the breach
  • sanitization requirements
  • utility goals
  • Example Recursive Histogram Sanitizations
  • description of technique
  • a robust proof of privacy
  • Example Round Sanitizations
  • nice learning properties
  • privacy via cross-training
  • Setting the Real World Context
  • dealing with auxiliary information

27
A Standard Technique Cell Suppression
  • Gestalt Tabular Data (many, possibly linked,
    tables)
  • entries are cells
  • frequency (count) data
  • magnitude data (income, sales, etc.)
  • Disclosure small counts
  • Provides key for population unique, or
    almost-unique
  • Can be used as a key into a different database
  • Enormous literature on suppressing safely

16 8 5 2 31
1 5 20 3 29
17 13 25 5 60
28
Connection to Our Definitions
  • Protection against isolation yields protection
    against learning a key for a population unique
  • isolation on a subspace does not imply isolation
    in the full-dimensional space
  • but aux may contain other DBs that can be
    queried to learn remaining attributes
  • definition mandates protection against all
    possible aux
  • satisfy def ) cant learn key

29
Connection to Our Definitions
  • Seems very hard to provide good sanitization in
    the presence of arbitrary aux
  • Provably impossible in general
  • Anyway, can probably already isolate people based
    solely on aux
  • Suggests we need to control aux
  • How should we redesign the world?

30
Two Tools
  • Secure Function Evaluation Yao, GMW
  • Technique permitting Alice, Bob, Carol, and their
    friends to collaboratively compute a function f
    of their private inputs ? f(a,b,c,).
  • eg, ? sum(a,b,c, )
  • Each player learns only what can be deduced from
    ? and her own input to f
  • SuLQ databases Dwork, Nissim
  • Provably preserves privacy of attributes when the
    rows of the database are mutually independent
  • Powerful DwNi Blum, Dwork, McSherry, Nissim

31
Statistical Database
Query (S, f) S ? n f 0,1d? 0,1
Exact Answer ?r?S f(row r)
Database DB
Row distributionD
(D1,D2,,Dn)
32
Sub-Linear Query (SuLQ) Databases
If the number of queries is ltlt n, then privacy
can be protected with little noise (per
query) E(noise) 0 standard dev ltlt vn Much
less than sampling error!
noise
?
33
Our Data, Ourselves
34
Our Data, Ourselves
  • Individuals maintain their own data records
  • join a DB by setting an appropriate attribute
  • Statistical queries via a SFE(SuLQ)
  • privacy of SuLQ query ) this SFE is safe
  • Individuals ensure
  • data take part in sufficiently few queries
  • sufficient random noise is added

0 4 6 3 1 0
35
Summary
  • Definitions
  • defined isolation and sanitization
  • Recursive Histogram Sanitizations
  • described approach and sketched a robust proof of
    privacy for a special distribution
  • proof exploits high dimensionality ( columns)
  • Sanitization via perturbations
  • utility and privacy via cross-training
  • Setting the Real World Context
  • discussed a radical view of how data might be
    organized to prevent a powerful class of attacks
    based on auxiliary data
  • SuLQ tool exploits large membership ( rows)

36
Larry Joseph Stockmeyer November 13, 1948 -
July 31, 2004
37
Larry Stockmeyer Commemoration
  • May 21-22, 2005
  • Baltimore, Maryland
  • (in conjunction with STOC 2005)
  •                                      
  • May 21,
  • Tutorial by Nick Pippenger (Princeton) on some
    of Stockmeyer's fundamental results in complexity
    theory
  • Lectures by Miki Ajtai (IBM), Anne Condon (UBC),
    Cynthia Dwork (Microsoft), Richard Karp
    (UC Berkeley), Albert Meyer (MIT), and Chris
    Umans (CalTech).
  • Some time will be reserved for personal remarks.
    Contact Cynthia Dwork if you want to participate
    in this part of the commemoration.
  • May 22 Lance Fortnow gives first keynote
    address to STOC.

38
Larry Stockmeyer
  • Larry Stockmeyer, theoretical computer scientist
    and a founder of the field of complexity theory
    -- that part of computer science exploring the
    inherent difficulty of solving computational
    problems -- died Saturday, July 31, 2004, of
    pancreatic cancer.
  • Born in Evansville, Indiana, in 1948, Stockmeyer
    was educated at MIT, where he received a
    bachelor's of science in mathematics and a
    master's of science in electrical engineering in
    1972, followed by a doctorate in computer science
    in 1974. Stockmeyer is famous for his
    groundbreaking work proving the extreme
    difficulty of solving naturally occurring
    computational problems. His pioneering
    contributions were soon incorporated into
    textbooks on computational complexity.
  • Stockmeyer joined IBM Research in 1974, working
    first at the IBM Thomas J. Watson Research Center
    in Yorktown Heights, New York. A founding member
    of the Theory Group at the IBM Almaden Research
    Center in the early 1980s, Stockmeyer was
    elevated to Fellow of the Association of
    Computing Machinery in 1996. He remained at
    Almaden until he took a bridge to retirement from
    IBM in November 2002. After this, Stockmeyer
    enjoyed a brief affiliation with the University
    of California at Santa Cruz until his death, at
    age 55.
  • Stockmeyer is survived by his father Robert
    Stockmeyer, his sister Mary Karen Walker, and his
    former wife, dear friend, and colleague Cynthia
    Dwork.

39
Larry Joseph Stockmeyer November 13, 1948 -
July 31, 2004
Write a Comment
User Comments (0)
About PowerShow.com