LDiversity: Privacy Beyond KAnonymity - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

LDiversity: Privacy Beyond KAnonymity

Description:

K-anonymous tables are tables where every tuple or record is indistinguishable ... are attributes that are not to be disclosed or identifiable to any individual. ... – PowerPoint PPT presentation

Number of Views:485
Avg rating:3.0/5.0
Slides: 50
Provided by: yvonne88
Category:

less

Transcript and Presenter's Notes

Title: LDiversity: Privacy Beyond KAnonymity


1
L-Diversity Privacy Beyond K-Anonymity
  • Ashwin Machanavajjhala, Johannes Gehrke, Daniel
    Kifer, Muthuramakrishnan Venkitasubramaniam

2
Overview
  • k-Anonymity
  • Model and Notation
  • Bayes Optimal Privacy
  • l-Diversity Principle
  • l-Diversity Instantiations
  • Multiple Sensitive Attributes
  • Montotonicity Property
  • Utility
  • Conclusion

3
Introduction
  • Quasi-identifiers Identifiers that can be used
    to link data in different set together.
  • K-anonymous tables are tables where every tuple
    or record is indistinguishable from at least k-1
    other records with respect to every set of
    quasi-identifiers.
  • Sensitive attributes are attributes that are not
    to be disclosed or identifiable to any individual.

4
  • Weaknesses in K-anonymous tables
  • Homogeneity Attacks
  • Background Knowledge Attacks

5
Homogeneity Attacks
  • k-Anonymity is focused on generalizing the
    quasi-identifiers but does not address the
    sensitive attributes which can reveal information
    to an attacker.

6
Homogeneity Attacks
  • Since Alice is Bobs neighbor, she knows that Bob
    is a 31-year-old American male who lives in the
    zip code 13053. Therefore, Alice knows that Bobs
    record number is 9,10,11, or 12. She can also
    see from the data that Bob has cancer.

7
Background Knowledge Attacks
  • Depending on other information available to an
    attacker, an attacker may have increased
    probability of being able to determine sensitive
    information.

8
Background Knowledge Attacks
  • Alice knows that Umeko is a 21 year-old Japanese
    female who currently lives in zip code 13068.
    Based on this information, Alice learns that
    Umekos information is contained in record number
    1,2,3, or 4. With additional information, Umeko
    being Japanese and Alice knowing that Japanese
    have an extremely low incidence of heart disease,
    Alice can concluded with near certainty that
    Umeko has a viral infection.

9
Weaknesses in k-anonymous tables
  • Given these two weaknesses there needs to be a
    stronger method to ensure privacy.
  • Based on this, the authors begin to build their
    solution.

10
Model and Notation
  • Basic Notation.
  • Let T t1,t2,...,tn be a table with attributes
    A1,...,Am. We assume that T is a subset of some
    larger population O where each tuple represents
    an individual from the population. For example,
    if T is a medical dataset then O could be the
    population of the United States.
  • Let A denote the set of all attributes
    A1,A2,...,Am and tAi denote the value of
    attribute Ai for tuple t. If C C1,C2,...,Cp
    ?A then we use the notation tC to denote the
    tuple (tC1,...,tCp), which is the projection
    of t onto the attributes in C.

11
Basic Definitions
  • All actual identifiers such as name, SSN,
    address, etc., are removed from the data leaving
    sensitive attributes and non-sensitive
    attributes.

12
Definitions (cont.)
13
Adversaries Background Knowledge
  • The adversary has access to T and knows it was
    derived from table T. The domain of each
    attribute is also known.
  • The adversary may know some individuals are in
    the table and may also know some of the sensitive
    information for specific individuals in the table.

14
Adversaries Background Knowledge (cont.)
  • The adversary may also know demographic
    background data such as the probability of a
    condition given an age.

15
Bayes-Optimal Privacy
  • Models background knowledge as a probability
    distribution over the attributes and uses
    Bayesian inference techniques to reason about
    privacy.
  • However, Bayes-Optimal Privacy is only used as a
    starting point for a definition of privacy so
    there are 2 simplifying assumptions made.
  • T is a simple random sample of a larger
    population.
  • Assume a single sensitive value

16
Changes in Belief
  • In the attack, Alice has partial knowledge of the
    distribution of sensitive and non-sensitive
    attributes.
  • Alices goal is to use her background knowledge
    to determine Bobs sensitive information.
  • To gauge her success prior belief and posterior
    belief are used.

17
Prior belief is defined as
  • Posterior belief is defined as

18
Calculating the posterior belief
19
Privacy Principles
20
  • Positive and negative disclosures are not always
    bad, but depending on the sensitive attribute and
    background knowledge they may provide information
    to the attacker.

21
Drawbacks to Bayes-Optimal Privacy
  • Insufficient knowledge because the publisher is
    unlikely to know the full distribution of
    sensitive and non-sensitive attributes over the
    full population.
  • The data publisher does not know the knowledge of
    a would be attacker.

22
Drawbacks to Bayes-Optimal Privacy (cont.)
  • Instance level knowledge cannot be modeled.
  • There are likely to be many adversaries with
    varying levels of knowledge

23
L-Diversity Principle
  • Theorem 3.1 defines a method of calculating the
    observed belief of the adversary
  • In the case of positive disclosures, Alice wants
    to determine Bobs sensitive attribute with a
    very high probability. Given Theorem 3.1 this
    can only happen when

24
The condition of equation 2 can be satisfied by a
lack of diversity in the sensitive attribute(s)
and/or strong background knowledge.Lack of
diversity in the sensitive attribute can be
described as follows
  • Indicating that almost all tuples have the same
    value the sensitive value and therefore the
    posterior belief is almost 1.

25
Strong background knowledge can also allow
information leakage, even if the values are well
represented. An attacker may still be able to
use background knowledge when the following is
true
  • Which means based on the attackers knowledge the
    probability of certain sensitive attributes is
    near zero so they can be discarded, such as
    Japanese and heart disease.

26
  • In a l-diverse q-block an attacker would need
    l-1 pieces of background knowledge to eliminate
    l-1 sensitive values to infer positive
    disclosure.

27
L-Diversity Principle
  • Given the previous discussions we arrive at the
    l-Diversity principle

28
Revisiting the example
  • Using a 3-diverse table, we no longer are able to
    tell if Bob has heart disease or cancer. We also
    cannot tell if Umeko has a viral infection or
    cancer.

29
L-Diversity Principle
  • The l-Diversity principle advocates ensuring well
    represented values for sensitive attributes but
    does not define what well represented values
    mean.

30
L-Diversity Instantiations
  • Entropy l-Diversity
  • Recursive (c, l) Diversity
  • Positive Disclosure-Recursive (c, l)-Diversity
  • Negative/Positive Disclosure-Recursive (c1, c2,
    l)-Diversity

31
Entropy l-Diversity
  • This implies that for a table to be entropy
    l-Diverse, the entropy of the entire table must
    be at least log(l). Therefore, entropy
    l-Diversity may be too restrictive to be
    practical.

32
Recursive (c, l) Diversity
  • Less restrictive than entropy l-diversity
  • Let s1, , sm be the possible values of sensitive
    attribute S in a q-block
  • Assume we sort the counts n(q,s1), ..., n(q,sm)
    in descending order with the resulting sequence
    r1, , rm.
  • We can say a q-block is recursive (c,l)-diverse
    if r1 lt c(r2 . rm) for a specified constant c.

33
Recursive (c, l) Diversity (cont.)
34
Positive Disclosure-Recursive (c, l)-Diversity
  • As mentioned earlier some cases of positive
    disclosure may be acceptable such as when medical
    condition is healthy.
  • To allow these values the authors define
    pd-recursive (c, l)-diversity

35
pd-recursive (c, l)-diversity
  • Allows for most sensitive not in Y

36
Negative/Positive Disclosure-Recursive (c1, c2,
l)-Diversity
  • Npd-recursive (c1, c2, l)-diversity prevents
    negative disclosure by requiring attributes for
    which negative disclosure is not allowed to occur
    in at least c2 of the tuples in every q-block

37
Npd-recursive (c1, c2, l)-diversity (cont.)
  • Allows a user defined percentage to be used to
    set a minimum number of tuples.

38
Multiple Sensitive Attributes
  • Previous discussions only addressed single
    sensitive attributes
  • Suppose S and V are two sensitive attributes, and
    consider the q-block with the following tuples
  • (q ,s1,v1),(q ,s1,v2),(q ,s2,v3),(q ,s3,v3).
    This q-block is 3-diverse (actually recursive
    (2,3)-diverse) with respect to S (ignoring V) and
    3-diverse with respect to V (ignoring S).
    However, if we know that Bob is in this block and
    his value for S is not s1 then his value for
    attribute V cannot be v1 or v2, and therefore
    must be v3.

39
Multiple Sensitive Attributes (cont.)
  • We can then see that multiple sensitive
    attributes can be l-diverse alone but they may
    not be l-diverse in combination with other
    sensitive attributes.
  • To address this problem we can add the additional
    sensitive attributes to the quasi-identifier.

40
Multiple Sensitive Attributes (cont.)
  • Make additional sensitive attributes part of the
    quasi-identifier.

41
Implementing Privacy Preserving Data Publishing
  • Domain generalization is used to define a
    generalization lattice.
  • For discussion, all non-sensitive attributes are
    combined into a multi-dimensional attribute (Q)
    where the bottom element on the lattice is the
    domain of Q and the top of the lattice is the
    domain where each dimension of Q is generalized
    to a single value.

42
Implementing Privacy Data Publishing (cont.)
  • The algorithm for publishing should find the
    point on the lattice where the table T preserves
    privacy and is useful as possible.
  • The usefulness (utility) of table T is
    diminished as the data becomes more generalized,
    so the most utility is at the bottom of the
    lattice.

43
Implementing Privacy Data Publishing (cont.)
  • Monotonicity property is described as a stopping
    point in the lattice search where the privacy is
    protected and further generalization does not
    increase privacy.
  • An example is if zip 13065 can be generalized to
    1306 and it preserves privacy, generalizing it
    to 130 also preserves privacy. However, the
    additional generalization reduces utility.

44
Monotonicity Property
  • k-Anonymity satisfies the monotonicity property
    which guarantees the correctness of all efficient
    lattice search algorithms, so if l-diversity
    satisfies the monotonicity property these
    algorithms can be used by l-diversity.

45
Monotonicity Property
46
Monotonicity Property
  • Therefore, to create an algorithm for
    l-diversity, a k-anonymity routine can be used by
    substituting l-diversity processing instead of
    k-anonymity.
  • Since l-diversity is local to each q-block and
    the l-diversity test as based on the counts of
    sensitive attributes the testing is quite
    efficient.

47
Comparison Testing
  • In tests comparing k-anonymization with
    l-diversity, entropy l-diversity displayed
    similar if not better run times. As the size of
    the quasi-identifier grows l-diversity performs
    better.

48
Utility
  • Using three metrics for utility
  • Generalization height of the anonymized table,
  • Minimum average group size of the q-block
  • Discernibility metric is the number of tuples
    indistinguishable from each other

49
Conclusions
  • The paper presents l-diversity as a means of
    anonymizing data. They have shown that the
    algorithms provide a stronger level of privacy
    than k-anonymity routines.
  • They have also shown data that supports the claim
    that the performance and utility differences are
    minor.
Write a Comment
User Comments (0)
About PowerShow.com