Incognito Efficient FullDomain KAnonymity - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Incognito Efficient FullDomain KAnonymity

Description:

Incognito (Cont'd) ... Super-roots Incognito, scans the database and calculates the frequency sets of ... Even incognito is an exponential algorithm ... – PowerPoint PPT presentation

Number of Views:576
Avg rating:3.0/5.0
Slides: 54
Provided by: cmpeBo
Category:

less

Transcript and Presenter's Notes

Title: Incognito Efficient FullDomain KAnonymity


1
IncognitoEfficient Full-Domain K-Anonymity
  • Presenter
  • Melih Çelik

2
Agenda
  • Introduction
  • Full-Domain Generalization
  • Incognito
  • Basic Incognito
  • Algorithm Optimizations
  • Performance Analysis
  • Taxonomy of k-anonymization models
  • Related Work Conclusions

3
  • Introduction
  • Full-Domain Generalization
  • Incognito
  • Basic Incognito
  • Algorithm Optimizations
  • Performance Analysis
  • Taxonomy of k-anonymization models
  • Related Work Conclusions

4
Introduction
  • Data published in public environments generally
    have some removed attributes
  • e.g. Name, Social Security Number
  • Remaining attributes, sometimes, can be used to
    remove the anonymity of the published data
  • Joining multiple public data to obtain
    unambiguous information

5
Paper Overview
  • This paper provides a framework for one model of
    k-anonymization technique called full-domain
    generalization is proposed
  • What to expect from this paper
  • A set of algorithms for producing minimal
    full-domain generalizations
  • Taxonomy of k-anonymization models

6
K-Anonymization
  • K-anonymization is a technique that prevents
    joining attacks
  • Generalizing
  • or
  • Suppressing
  • portions of released microdata so that no
    individual can be uniquely distinguished among a
    group of size k. That is to say, records remain
    nearly unique within a group of size k

7
Sample Database
  • Hospital Patient Database

8
Quasi-Identifier Attribute Set
  • Quasi-Identifier Attribute Set A
    quasi-identifier attribute set Q is a minimal set
    of attributes in table T that can be joined with
    external information to re-identify individual
    records (With sufficiently high probability)
  • Minimum number of columns in a table that can be
    used to nearly uniquely identify records
  • Elements in a quasi-identifier attribute set is
    assumed to be known based on specific knowledge
    of the domain

9
Frequency Set
  • Frequency Set
  • Relation T
  • A set of attributes Q of size n
  • A mapping from each unique combination of values
    of ltq0, .., qngt of Q, to the total number of
    tuples in T with certain q values
  • Can be seen as the mapping from the counts of
    records resulting from a query with certain q
    values in the group by clause

10
K-Anonimity Property
  • K-Anonymity Property A relation T is said to be
    k-anonymous with respect to attribute set Q if
    every count of frequency sets of T with respect
    to Q is greater than of equal to k
  • Many GROUP BY clauses are formed from all
    possible combinations of attributes in Q
  • Each and every count of elements formed by these
    clauses have to be greater than a certain k

11
K-Anonimity Property (Contd)
  • Sample k-anonymity on patient database
  • k 2 (2-anonymity)
  • Q Zipcode, Sex
  • SELECT COUNT() FROM Patients GROUP BY Zipcode,
    Sex
  • Results show that Patients
  • is not 2-anonymous with
  • respect to set Q since results
  • contain values smaller than 2

12
K-Anonymization
  • K-Anonymization Obtaining a view V of a relation
    T such that, the view modifies, distorts, or
    suppresses the data of T according to some
    mechanism such that, V satisfies the k-anonymity
    property with respect to the set of
    quasi-identifier attributes.
  • T and V should consist of multiple attributes
  • Relation T is k-anonymized to view V, in order to
    disallow the de-identification of data by using k
    columns of T (depicted as quasi-identifier
    attributes of T)

13
  • Introduction
  • Full-Domain Generalization
  • Incognito
  • Basic Incognito
  • Algorithm Optimizations
  • Performance Analysis
  • Taxonomy of k-anonymization models
  • Related Work Conclusions

14
Domain Generalization
  • In a relational database, there is a domain
    (integer, date, ..) related to each attribute
  • Constructing a more general domain from
    existing domains is called
  • Domain Generalization
  • e.g. Generalizing Zipcode domain by ignoring
    the least significant digit
  • Domain generalization can be achieved in several
    ways

15
Domain Generalization Relationship
  • A domain generalization relationship is defined
    as Di D Dj to denote that domain Dj is either
    identical, or a domain generalization of Di
  • Values in Dj domain are thus, generalizations of
    values in domain Di. This property results in a
    many-to-one relationship between original domain
    values and derived domain values
  • ? Di ? Dj this function depicts the many-to-one
    relationship which is called value
    generalization function

16
Domain Generalization Relationship (Contd)
  • If there is an edge from Di to Dj, Dj is called
    the direct generalization
  • Domain generalization relationship is transitive
  • If Di D Dj and Dj D Dk then Di D Dk.
  • Transitivity property proceeds to another
    definition Domain Generalization Hierarchy

17
Domain Generalization Hierarchy
  • Domain Generalization Hierarchy is defined to be
    the set of domains that is totally ordered by the
    domain generalization relationship D
  • A Domain Generalization Hierarchy can be thought
    as the nodes in a chain of direct generalizations
  • Edges direct generalizations
  • Paths implied generalizations
  • Domain generalization hierarchies of Patients
    relation can be shown as in the following slide

18
Domain Generalization Hierarchy (Contd)
a
c
e
b
d
f
Figure 2 Domain and value generalization
hierarchies
19
Domain Generalization Hierarchy (Contd)
  • Recall ? Di ? Dj is called value generalization
    function
  • ? is used as a shorthand for the composition of
    one or more value generalization functions
    producing direct and implied value
    generalizations
  • These functions form a value-level tree in which
  • Edges are defined by ?
  • Paths are defined by ?
  • For example in figure 2-b
  • 5371 ?(53715) and 537 ?(53715)

20
Domain Generalization (Contd)
  • Domain generalization of multiple attributes with
    each having a different domain form a
    multi-attribute generalization lattice (Fig.3)
  • n single-attributes form a complete lattice of
    n-vectors of domains with following properties
  • Each edge is a direct multi-attribute domain
    generalization relationship
  • The bottom element is the source of hierarchy
    chain and has the most specific domains
  • The top element is the sink of the chain and has
    the most general domains

21
Domain Generalization (Contd)
Figure 3. Generalization lattice for the Zipcode
and Sex attributes and corresponding lattice of
distance vectors
  • Height of a the multi-attribute generalization is
    defined as the sum of values in the corresponding
    vector
  • Height value will be used for finding the minimal
    full-domain generalization

22
Full-Domain Generalization
  • Full-domain Generalization maps the entire domain
    of each quasi-identifier attribute in T to a more
    general domain in its domain generalization
    hierarchy
  • This scheme guarantees that all values of a
    particular attribute in V belong to the same
    domain
  • In order to prevent a k-anonymization to
    generalize domains more than necessary, some kind
    of minimality should be introduced as follows
  • V should be k-anonymous
  • Height of the resulting generalization is less
    than or equal to that of any other k-anonymous
    full-domain generalizations (a search mechanism
    is needed)

23
Full-Domain Generalization Algorithms
  • In order to guarantee the minimality of the
    anonymization a search technique should be used
  • Binary Search Tree
  • Breadth-first Search Start with the least
    general domain at the root, check whether each
    generalization satisfies k-anonymity
  • This paper refines the BFS using a bottom-up
    aggregation rollup along the domain hierarchies
  • Frequency sets which will determine the
    k-anonymity is computed from the generalizations
    of the current node

24
  • Introduction
  • Full-Domain Generalization
  • Incognito
  • Basic Incognito
  • Algorithm Optimizations
  • Performance Analysis
  • Taxonomy of k-anonymization models
  • Related Work Conclusions

25
Incognito
  • Three important techniques are combined
  • Generalization framework of Samarati and Sweeney
  • Managing multi-dimensional data
  • Mining association rules
  • Since COUNT measure is an important aspect, each
    domain generalization hierarchy can be thought as
    a dimension
  • Introducing the dimension idea, a relation T and
    its quasi-identifiers form a relational
    star-schema

26
Incognito (Contd)
  • Figure 4. Star schema including generalization
  • dimensions for quasi-identifier attributes

27
Incognito (Contd)
  • A full-domain k-anonymization is obtained in two
    steps
  • Joining the relation T with its dimension tables
  • Projecting the appropriate domain attributes
  • Two key properties of dimension generalization
    are
  • Generalization Property
  • Rollup Property

28
Generalization Property
  • Assume two sets of attributes P and Q in a
    relation T such that DP D DQ
  • Domain of Q is generalized from domain of P
  • If T is k-anonymous with respect to P, then it is
    also k-anonymous with respect to Q
  • This results from the fact that if Q is
    generalized from P, than its set (number of
    elements in groups) contains more element than P
  • P contains k, or more than k-elements in each of
    its sets
  • Q contains either same, or more elements

29
Rollup Property
  • Assume two sets of attributes P and Q in a
    relation T such that DP D DQ
  • If f1, the frequency set of T with respect to P
    is known, then each count in f2, the frequency
    set of T with respect to Q can be generated by
    summing the set of counts in f1 associated by
    generalization function with each value set of f2

30
Rollup Property (Contd)
  • Assume P is ltB, S, Z0gt and Q is
  • ltB, S, Z1gt
  • Frequency set of P is calculated
  • by a COUNT() query with
  • Birthdate, Sex and Zipcode attributes
  • in the GROUP BY clause
  • Frequency set of P is calculated
  • by summing the counts of groups
  • formed by a GROUP BY clause with
  • Birthdate, Sex and Z1

31
Incognito (Contd)
  • One more property is derived from the observation
    of a dynamic-programming approach for mining
    frequent itemsets
  • Subset Property Let Q be a set of attributes in
    relation T. If T is k-anonymous with respect to
    Q, then it is k-anonymous to any other set of
    attributes P such that P is a subset of Q
  • Set P contains either the same, or less amount of
    attributes which will be used for grouping
  • Less number of records for grouping means the
    groups to remain either same, or merged by other
    groups which will increase their sizes
    k-anonymity is satisfied in each case

32
Basic Incognito Algorithm
  • Generates set of all possible k-anonymous
    full-domain generalizations of T
  • A graph of candidate multi-attribute
    generalizations are constructed from a subset of
    quasi-identifier of size i. This set is called
    Ci.
  • The set of direct multi-attribute generalization
    relationships connecting these nodes is denoted
    Ei.
  • Each iteration consists of two main parts
  • A modified breadth-first search over the graph
    produces set Si. This set contains k-anonymous
    generalizations of size i.
  • After obtaining Si, the algorithm constructs the
    set of candidate nodes of size i1 (Ei1) and the
    edges connecting them (Ci1)

33
Breadth-First Search
  • At ith iteration a search determines the
    k-anonymity of table T with respect to each
    candidates in Ci
  • Search starts with nodes that are not direct
    generalizations of some other nodes
  • Rollup property provides optimization in the
    bottom-up aggregation
  • Generalization property is used to mark some
    nodes as k-anonymous if a direct generalization
    of them is found to be k-anonymous
  • Less number of calculations at latter iterations

34
Breadth-First Search
35
Graph Generation
  • Graphs are implemented as two relational tables
    one for nodes, one for edges
  • Figure 7 Graph Representation

36
Graph Generation
  • Graph generation is done in three phases
  • Join phase Creates a superset of Ci based on
    Si-1
  • Prune phase Uses a hash tree structure to remove
    nodes with subsets not in Si-1
  • Edge Generation phase Direct multi-attribute
    generalization relationships among candidate
    nodes are selected
  • Figure 8 Graph Generation

37
Algorithm Optimizations
  • Two different techniques were applied
  • Super-roots
  • Bottom-up Pre-computation

38
Super-roots
  • A candidate node n in Ci is a root if there is
    no generalization edge in Ei directed from
    another node in Ci to n
  • Although pruning phase eliminates some nodes,
    since some roots may come from the same family,
    these have to be eliminated
  • Same family term means that the roots are
    generalizations of same quasi-identifier subset
  • Super-roots Incognito, scans the database and
    calculates the frequency sets of each roots that
    come from the same family, by computing the
    frequency set of their parents

39
Super-roots example
  • In the following figure
  • ltB1, S1, Z0gt, ltB1, S0, Z2gt, ltB0, S1, Z2gt
  • are roots but all of them come from the same
    family ltB0, S0, Z0gt
  • This approach will first calculate the frequency
    set of Patients with respect to the parent, then
    use this value to calculate the frequency set for
    each root

40
Bottom-up Pre-computation
  • Aim is eliminate the necessity of scanning T once
    per each subset of quasi-identifier in order to
    generate the necessary frequency sets
  • Frequency set of T with respect to ltSex, Zipcodegt
    has to be recalculated even though frequency set
    of T with respect to ltZipcodegt is known
  • Strategy
  • First generate the frequency sets of T with
    respect to all subsets of the quasi-identifier at
    the lowest level of generalization
  • Then use computed frequency sets in a bottom-up
    aggregation manner to calculate more generalized
    frequency sets

41
  • Introduction
  • Full-Domain Generalization
  • Incognito
  • Basic Incognito
  • Algorithm Optimizations
  • Performance Analysis
  • Taxonomy of k-anonymization models
  • Related Work Conclusions

42
Performance Analysis
  • Real world data is used
  • Basic incognito, Super-roots Incognito, and
    Cube-Incognito are implemented
  • Results are compared with Samaratis Binary
    Search, Bottom-up search (with rollup), and
    Bottom-up search (without rollup)
  • Incognito algorithms uniformly outperformed the
    previous algorithms

43
Experimental Data
  • Previous algorithms were tested on small
    databases
  • Full-domain k-anonimity were tested on 256
    records
  • Binary search was not experimented
  • Genetic algorithm ran on larger databases but did
    not guarantee minimality
  • Two databases are used in experiments
  • Nine attributes, all of which are elements of
    quasi-identifier set
  • First database table contained 45,222 records
  • Second was even larger with 4,591,581 records and
    268MB
  • Implementations are made using Java and IBM DB2
  • AMD Athlon 1.5Ghz with 2GB physical memory is used

44
Experimental Results
  • Even incognito is an exponential algorithm
  • Rollup and priori pruning optimization provides
    linear speedups
  • Test results show that incognito linearly
    outperforms bottom-up approach
  • Incognito finds all possible k-anonymous
    generalizations
  • Bottom-up search finds only one!!

45
Effects of Rollup
  • Bottom-up search is re-implemented, this time
    considering the rollup property

46
Effects of Pruning
  • Pruning substantially decreased the number of
    nodes to be inspected
  • For small values of size of quasi-identifier set,
    the values are near
  • For large values however, an important number of
    nodes are eliminated

47
Effects of Super-Roots
  • Since many frequency sets are calculated from
    other frequency sets, access to the original data
    has substantially decreased. This caused the
    runtime of the entire algorithm.
  • E.g. By creating a single super-root frequency
    set, 4-5 scans of entire data is eliminated

48
  • Introduction
  • Full-Domain Generalization
  • Incognito
  • Basic Incognito
  • Algorithm Optimizations
  • Performance Analysis
  • Taxonomy of k-anonymization models
  • Related Work Conclusions

49
Taxonomy of K-Anonymization Models
  • Existing k-anonymization techniques can be
    categorized according to 3 main criteria
  • Generalization vs. Suppression
  • Considering intermediate steps or suppressing
    them
  • Global vs. Local Recording
  • Global recording maps the values in domains into
    modified values
  • Local recording modify individual instances of
    data items
  • Hierarchy-Based vs. Partition-Based
  • Hierarchy-based uses fixed generalization
    hierarchies
  • Generalization by partitioning into disjoint
    ranges

50
  • Introduction
  • Full-Domain Generalization
  • Incognito
  • Basic Incognito
  • Algorithm Optimizations
  • Performance Analysis
  • Taxonomy of k-anonymization models
  • Related Work Conclusions

51
Related Work Conclusions
  • µ-Argus system
  • Considered attribute combinations of a limited
    size
  • Results was not always guaranteed to be anonymous
  • Binary Search Algorithm
  • Discovers a single minimal full-domain
    generalization
  • Datafly Full-domain generalization
  • Results are k-anonymous, but minimality not
    guaranteed

52
Conclusions
  • Multi-dimensional data model is simple and a
    clear way to describe full-domain generalization
  • Two key ideas for k-anonymization provided good
    results, namely
  • Bottom-up aggregation
  • A Priori computation
  • Provided performing full-domain generalization on
    large databases in feasible times

53
  • Thanks for your attention

QUESTIONS??
Write a Comment
User Comments (0)
About PowerShow.com