Towards Privacy in Public Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Towards Privacy in Public Databases

Description:

Towards Privacy in Public Databases. Shuchi Chawla, Cynthia Dwork, ... Census Bureau publishes ... Consider the balls B(q,r) and B(q,cr) for. some radius r. The ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 23
Provided by: dwo76
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Towards Privacy in Public Databases


1
Towards Privacy in Public Databases
  • Shuchi Chawla, Cynthia Dwork,
    Frank McSherry, Adam Smith, Hoeteck Wee

2
Database Privacy
  • A Census problem
  • Individuals provide information
  • Census Bureau publishes sanitized records
  • Privacy is legally mandated what utility can we
    achieve?
  • Inherent Privacy vs. Utility trade-off
  • Our goal
  • Find a middle path
  • preserve macroscopic properties
  • disguise individual records (containing private
    info)
  • Establish a framework for meaningful comparison
    of techniques

3
What about Secure Function Evaluation?
  • Secure Function Evaluation Yao, GMW
  • Allows parties to collaboratively compute a
    function f of their private inputs
  • ? f(a,b,c,) ( E.g., ?
    sum(a,b,c, ) )
  • Each player learns only what can be deduced from
    ? and her own input to f
  • SFE and privacy are complementary problems
    one does not imply the other
  • SFE Given what must be preserved,
  • protect everything else
  • Privacy Given what must be protected,
  • preserve as much as you can

4
This talk
  • A formalism for privacy
  • What we mean by privacy
  • A good sanitization procedure
  • Results
  • Histograms and Perturbations
  • Subsequent work Open problems

5
What do we mean by Privacy?
  • Ruth Gavison Protection from being brought to
    the attention of others
  • inherently valuable
  • attention invites further privacy loss
  • Privacy is assured to the extent that one blends
    in with the crowd
  • Appealing definition can be converted into a
    precise mathematical statement

6
The basic model a geometric approach
  • Database consists of pts in high dimensional
    space R d
  • Samples from some underlying distribution
  • Points are unlabeled you are your collection of
    attributes
  • (Relative) distance is everything points that
    are closer are more similar and vice versa
  • A real database RDB controlled by a central
    authority
  • n points in d-dimensional space
  • Think of d as the number of sensitive attributes
  • A sanitized database SDB released to the
    world
  • Information about fake individuals, a summary of
    the real data, or, a combination of both

7
The adversary or Isolator
  • On input SDB and auxiliary information, adversary
    outputs a point q ? R d
  • q isolates a real point x, if it is much closer
    to x than to xs neighbors
  • i.e., if B(q,cd) contains fewer than T other
    points from RDB
  • c, T privacy parameters e.g., c 4, T 100

isolated
d
Not isolated
cd
RDB
8
Requirement for the sanitizer
  • No way of obtaining privacy if AUX already
    reveals too much!
  • Sanitization compromises privacy if giving the
    adversary access to the SDB considerably
    increases its probability of success
  • Definition of considerably can be forgiving,
    e.g. 1/1000
  • Rigorously
  • Provides a framework for describing the power of
    a sanitization method, and hence for comparisons
  • Aux is going to cause trouble. Ignore it for now.

2-d in our results
? D ? I ? I w.h.p. over RDB?D ? aux
? S ? RDB Pr ?x?S I(SDB,aux) iso.s x
Pr ?x?S I (aux) iso.s x ? ?
? D ? I ? I w.h.p. over RDB?D ? aux
? x ? RDB Pr I(SDB,aux) isolates x
Pr I (aux) isolates x ? ?
9
A bad sanitizer that passes
Abhinandan Das, Cornell
  • Disguise one attribute extremely well
  • Leave the others in the clear
  • Without info about the special attribute, the
    adversary cannot isolate any point
  • However, he knows all other attributes exactly!
  • What goes wrong?
  • The assumption that distance is everything
  • No isolation ? no privacy breach, even if the
    adversary knows a lot of information

10
Utility goals
  • Desirable results
  • Macroscopic properties (e.g. means) should be
    preserved
  • Running statistical tests / data-analysis
    algorithms should return results similar to those
    obtained from real data
  • We show
  • Concrete point-wise results on histograms and
    clustering algorithms

11
This talk
  • A formalism for privacy
  • What we mean by privacy
  • A good sanitization procedure
  • Results
  • Histograms and Perturbations
  • Subsequent work Open problems

12
Two techniques for sanitization
  • Recursive histograms
  • Assume the universe is a d-dimensional hypercube
    -1,1n
  • As long as a cell contains ? T points
  • Subdivide it into 2d hypercubes by splitting each
    side evenly
  • Recurse until all cells have ? T points
  • Output a list of cells and counts

d2, T3
13
Two techniques for sanitization
  • Recursive histograms
  • Perturbation
  • For every point x, compute its T-radius tx
    B(x, tx) T
  • Add random vector to x of length proportional to
    tx

doesnt work by itself
T1
14
Two techniques for sanitization
  • Recursive histograms
  • Perturbation combined with histograms
  • Results on privacy
  • Rely on randomness in distribution and
    sanitization
  • Do not use any computational assumptions
  • When D uniform over a hypercube, c O(1), T
    arbitrary
  • probability of success for the adversary ? ?
    2-d
  • Better results for special cases

15
Key results on utility
  • Perturbation-based sanitization
  • Allows for various clustering algorithms to
    perform nearly as well as on real data
  • Spectral techniques
  • Diameter-based clusterings
  • Histograms a popular summarization technique in
    statistics
  • Recursive histograms benefit of providing more
    detail where required
  • Provide density information even without the
    counts
  • No randomness involved!

16
A brief proof of privacy
  • Recall recursive histograms
  • Simplifying assumption
  • Input distribution is uniform over the hypercube
  • Intuition
  • The adversarys view a product of uniform
    distributions over histogram cells
  • The uniform distribution is well-spread-out
    the adversary cannot conclusively single out a
    point in it

17
A brief proof of privacy
  • Case 1 Sparse cell
  • Expected distance q - x proportional to
    diameter of cell
  • c times this distance is larger than diameter of
    parent cell
  • Therefore, B(q,c?) contains at least T points
  • Case 2 Dense cell
  • Consider the balls B(q,r) and B(q,cr) for
  • some radius r
  • The adversary wins if
  • Pr ? x ? B(q,r) is large, and,
  • Pr ? T points in B(q,cr) is small
  • However, we show Pr ? x ? B(q,cr) gtgt Pr ? x
    ? B(q,r)

q
x
18
A brief proof of privacy
  • Lemma Let c be a large enough constant.
  • For any cell and any r lt diam(cell)/c ,
  • Pr ? x ? B(q,cr) ? cell ? 2d Pr ? x ?
    B(q,r) ? cell
  • Proof idea
  • Pr ? x ? B(q,r) ? cell ? Vol( B(q,r) ? cell
    )
  • Vol( B(q,cr) ? cell ) gt 2d Vol( B(q,r) ? cell )

Uses arguments about Normal and Uniform random
variables
  • Corollary Prob. of success for the adversary lt
    2-d

B(q,cr)
B(q,r)
cell
19
This talk
  • A formalism for privacy
  • What we mean by privacy
  • A good sanitization procedure
  • Results
  • Histograms and Perturbations
  • Subsequent work Open problems

20
Follow-up work
  • Isolation in few dimensions
  • Adversary must be more and more accurate in fewer
    dimensions
  • Randomized recursive histograms Chawla, Dwork,
    McSherry, Talwar
  • Similar privacy guarantees for nearly-uniform
    distributions over well-rounded universes
  • Preserve distances between pairs of points to a
    reasonable accuracy (additive error depending on
    T)
  • General-case impossibility
  • Cannot allow arbitrary AUX ? utility, and ?
    definitions of privacy, ? AUX that prevents
    privacy-preserving sanitization

21
What about the real world?
  • Lessons from the abstract model
  • High dimensionality is our friend
  • Histograms are powerful Spherical perturbations
    promising
  • Need to scale different attributes appropriately,
    so that data is well-rounded
  • Moving towards real data
  • Outliers
  • Our notion of c-isolation deals with them
    existence may be disclosed
  • Discrete attributes
  • Possible solution Convert them into real-valued
    attributes by adding noise?
  • The low-dimensional case
  • Is it inherently impossible?
  • Dinur and Nissim show impossibility for
    1-dimensional data

22
Questions?
23
Prior work on privacy
  • Long-standing problem in statistics and database
    theory a wide-variety of definitions and
    techniques
  • Statistical approaches
  • Alter the frequency (PRAN/DS/PERT) of particular
    features, while preserving means.
  • Additionally, erase values that reveal too much
  • Query-based approaches
  • Disallow queries that reveal too much
  • Output perturbation (add noise to true answer)
  • No unified definition of privacy incomplete
    analysis
  • e.g., erasure or refusal to answer can disclose
    information

24
Key results on privacy
Aux info
Prob. of isolation
Sanitization
No. of points
Distribution
Distribution
2-?(d)
One perturbed point all other real points
n
Uniform over a bounding box, or on surface of a
sphere
Distribution subset of the points
2-?(d)
Histogram over all points
n
Uniform over a hypercube
Distribution subset of the points
O(n2-?(d))
Histogram over n/2 points, and n/2 perturbed
points
2o(d)
Uniform over a hypercube
Results rely on the randomness in the input
distribution and sanitization, not on any
computational hardness assumption
25
Everybodys first suggestion
  • Learn the distribution, then output
  • A description of the distribution, or
  • Samples from the learned distribution
  • Unsatisfactory
  • Want to reflect facts on the ground
  • Statistically insignificant clusters can be
    important for allocating resources
  • Dont know in advance which aspects of the data
    will be useful
  • Work in statistics on synthetic data has no
    compelling privacy arguments
Write a Comment
User Comments (0)
About PowerShow.com