Incognito Efficient FullDomain KAnonymity - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Incognito Efficient FullDomain KAnonymity

Description:

Incognito (Cont'd) ... Super-roots Incognito, scans the database and calculates the frequency sets of ... Even incognito is an exponential algorithm ... – PowerPoint PPT presentation

Number of Views:576

Avg rating:3.0/5.0

Slides: 54

Provided by: cmpeBo

Category:

more less

Transcript and Presenter's Notes

Title: Incognito Efficient FullDomain KAnonymity

1
IncognitoEfficient Full-Domain K-Anonymity

Presenter
Melih Çelik

2
Agenda

Introduction
Full-Domain Generalization
Incognito
Basic Incognito
Algorithm Optimizations
Performance Analysis
Taxonomy of k-anonymization models
Related Work Conclusions

Introduction
Full-Domain Generalization
Incognito
Basic Incognito
Algorithm Optimizations
Performance Analysis
Taxonomy of k-anonymization models
Related Work Conclusions

4
Introduction

Data published in public environments generally
have some removed attributes
e.g. Name, Social Security Number
Remaining attributes, sometimes, can be used to
remove the anonymity of the published data
Joining multiple public data to obtain
unambiguous information

5
Paper Overview

This paper provides a framework for one model of
k-anonymization technique called full-domain
generalization is proposed
What to expect from this paper
A set of algorithms for producing minimal
full-domain generalizations
Taxonomy of k-anonymization models

6
K-Anonymization

K-anonymization is a technique that prevents
joining attacks
Generalizing
or
Suppressing
portions of released microdata so that no
individual can be uniquely distinguished among a
group of size k. That is to say, records remain
nearly unique within a group of size k

7
Sample Database

Hospital Patient Database

8
Quasi-Identifier Attribute Set

Quasi-Identifier Attribute Set A
quasi-identifier attribute set Q is a minimal set
of attributes in table T that can be joined with
external information to re-identify individual
records (With sufficiently high probability)
Minimum number of columns in a table that can be
used to nearly uniquely identify records
Elements in a quasi-identifier attribute set is
assumed to be known based on specific knowledge
of the domain

9
Frequency Set

Frequency Set
Relation T
A set of attributes Q of size n
A mapping from each unique combination of values
of ltq0, .., qngt of Q, to the total number of
tuples in T with certain q values
Can be seen as the mapping from the counts of
records resulting from a query with certain q
values in the group by clause

10
K-Anonimity Property

K-Anonymity Property A relation T is said to be
k-anonymous with respect to attribute set Q if
every count of frequency sets of T with respect
to Q is greater than of equal to k
Many GROUP BY clauses are formed from all
possible combinations of attributes in Q
Each and every count of elements formed by these
clauses have to be greater than a certain k

11
K-Anonimity Property (Contd)

Sample k-anonymity on patient database
k 2 (2-anonymity)
Q Zipcode, Sex
SELECT COUNT() FROM Patients GROUP BY Zipcode,
Sex
Results show that Patients
is not 2-anonymous with
respect to set Q since results
contain values smaller than 2

12
K-Anonymization

K-Anonymization Obtaining a view V of a relation
T such that, the view modifies, distorts, or
suppresses the data of T according to some
mechanism such that, V satisfies the k-anonymity
property with respect to the set of
quasi-identifier attributes.
T and V should consist of multiple attributes
Relation T is k-anonymized to view V, in order to
disallow the de-identification of data by using k
columns of T (depicted as quasi-identifier
attributes of T)

Introduction
Full-Domain Generalization
Incognito
Basic Incognito
Algorithm Optimizations
Performance Analysis
Taxonomy of k-anonymization models
Related Work Conclusions

14
Domain Generalization

In a relational database, there is a domain
(integer, date, ..) related to each attribute
Constructing a more general domain from
existing domains is called
Domain Generalization
e.g. Generalizing Zipcode domain by ignoring
the least significant digit
Domain generalization can be achieved in several
ways

15
Domain Generalization Relationship

A domain generalization relationship is defined
as Di D Dj to denote that domain Dj is either
identical, or a domain generalization of Di
Values in Dj domain are thus, generalizations of
values in domain Di. This property results in a
many-to-one relationship between original domain
values and derived domain values
? Di ? Dj this function depicts the many-to-one
relationship which is called value
generalization function

16
Domain Generalization Relationship (Contd)

If there is an edge from Di to Dj, Dj is called
the direct generalization
Domain generalization relationship is transitive
If Di D Dj and Dj D Dk then Di D Dk.
Transitivity property proceeds to another
definition Domain Generalization Hierarchy

17
Domain Generalization Hierarchy

Domain Generalization Hierarchy is defined to be
the set of domains that is totally ordered by the
domain generalization relationship D
A Domain Generalization Hierarchy can be thought
as the nodes in a chain of direct generalizations
Edges direct generalizations
Paths implied generalizations
Domain generalization hierarchies of Patients
relation can be shown as in the following slide

18
Domain Generalization Hierarchy (Contd)
a
c
e
b
d
f
Figure 2 Domain and value generalization
hierarchies
19
Domain Generalization Hierarchy (Contd)

Recall ? Di ? Dj is called value generalization
function
? is used as a shorthand for the composition of
one or more value generalization functions
producing direct and implied value
generalizations
These functions form a value-level tree in which
Edges are defined by ?
Paths are defined by ?
For example in figure 2-b
5371 ?(53715) and 537 ?(53715)

20
Domain Generalization (Contd)

Domain generalization of multiple attributes with
each having a different domain form a
multi-attribute generalization lattice (Fig.3)
n single-attributes form a complete lattice of
n-vectors of domains with following properties
Each edge is a direct multi-attribute domain
generalization relationship
The bottom element is the source of hierarchy
chain and has the most specific domains
The top element is the sink of the chain and has
the most general domains

21
Domain Generalization (Contd)
Figure 3. Generalization lattice for the Zipcode
and Sex attributes and corresponding lattice of
distance vectors

Height of a the multi-attribute generalization is
defined as the sum of values in the corresponding
vector
Height value will be used for finding the minimal
full-domain generalization

22
Full-Domain Generalization

Full-domain Generalization maps the entire domain
of each quasi-identifier attribute in T to a more
general domain in its domain generalization
hierarchy
This scheme guarantees that all values of a
particular attribute in V belong to the same
domain
In order to prevent a k-anonymization to
generalize domains more than necessary, some kind
of minimality should be introduced as follows
V should be k-anonymous
Height of the resulting generalization is less
than or equal to that of any other k-anonymous
full-domain generalizations (a search mechanism
is needed)

23
Full-Domain Generalization Algorithms

In order to guarantee the minimality of the
anonymization a search technique should be used
Binary Search Tree
Breadth-first Search Start with the least
general domain at the root, check whether each
generalization satisfies k-anonymity
This paper refines the BFS using a bottom-up
aggregation rollup along the domain hierarchies
Frequency sets which will determine the
k-anonymity is computed from the generalizations
of the current node

Introduction
Full-Domain Generalization
Incognito
Basic Incognito
Algorithm Optimizations
Performance Analysis
Taxonomy of k-anonymization models
Related Work Conclusions

25
Incognito

Three important techniques are combined
Generalization framework of Samarati and Sweeney
Managing multi-dimensional data
Mining association rules
Since COUNT measure is an important aspect, each
domain generalization hierarchy can be thought as
a dimension
Introducing the dimension idea, a relation T and
its quasi-identifiers form a relational
star-schema

26
Incognito (Contd)

Figure 4. Star schema including generalization
dimensions for quasi-identifier attributes

27
Incognito (Contd)

A full-domain k-anonymization is obtained in two
steps
Joining the relation T with its dimension tables
Projecting the appropriate domain attributes
Two key properties of dimension generalization
are
Generalization Property
Rollup Property

28
Generalization Property

Assume two sets of attributes P and Q in a
relation T such that DP D DQ
Domain of Q is generalized from domain of P
If T is k-anonymous with respect to P, then it is
also k-anonymous with respect to Q
This results from the fact that if Q is
generalized from P, than its set (number of
elements in groups) contains more element than P
P contains k, or more than k-elements in each of
its sets
Q contains either same, or more elements

29
Rollup Property

Assume two sets of attributes P and Q in a
relation T such that DP D DQ
If f1, the frequency set of T with respect to P
is known, then each count in f2, the frequency
set of T with respect to Q can be generated by
summing the set of counts in f1 associated by
generalization function with each value set of f2

30
Rollup Property (Contd)

Assume P is ltB, S, Z0gt and Q is
ltB, S, Z1gt
Frequency set of P is calculated
by a COUNT() query with
Birthdate, Sex and Zipcode attributes
in the GROUP BY clause
Frequency set of P is calculated
by summing the counts of groups
formed by a GROUP BY clause with
Birthdate, Sex and Z1

31
Incognito (Contd)

One more property is derived from the observation
of a dynamic-programming approach for mining
frequent itemsets
Subset Property Let Q be a set of attributes in
relation T. If T is k-anonymous with respect to
Q, then it is k-anonymous to any other set of
attributes P such that P is a subset of Q
Set P contains either the same, or less amount of
attributes which will be used for grouping
Less number of records for grouping means the
groups to remain either same, or merged by other
groups which will increase their sizes
k-anonymity is satisfied in each case

32
Basic Incognito Algorithm

Generates set of all possible k-anonymous
full-domain generalizations of T
A graph of candidate multi-attribute
generalizations are constructed from a subset of
quasi-identifier of size i. This set is called
Ci.
The set of direct multi-attribute generalization
relationships connecting these nodes is denoted
Ei.
Each iteration consists of two main parts
A modified breadth-first search over the graph
produces set Si. This set contains k-anonymous
generalizations of size i.
After obtaining Si, the algorithm constructs the
set of candidate nodes of size i1 (Ei1) and the
edges connecting them (Ci1)

33
Breadth-First Search

At ith iteration a search determines the
k-anonymity of table T with respect to each
candidates in Ci
Search starts with nodes that are not direct
generalizations of some other nodes
Rollup property provides optimization in the
bottom-up aggregation
Generalization property is used to mark some
nodes as k-anonymous if a direct generalization
of them is found to be k-anonymous
Less number of calculations at latter iterations

34
Breadth-First Search
35
Graph Generation

Graphs are implemented as two relational tables
one for nodes, one for edges
Figure 7 Graph Representation

36
Graph Generation

Graph generation is done in three phases
Join phase Creates a superset of Ci based on
Si-1
Prune phase Uses a hash tree structure to remove
nodes with subsets not in Si-1
Edge Generation phase Direct multi-attribute
generalization relationships among candidate
nodes are selected
Figure 8 Graph Generation

37
Algorithm Optimizations

Two different techniques were applied
Super-roots
Bottom-up Pre-computation

38
Super-roots

A candidate node n in Ci is a root if there is
no generalization edge in Ei directed from
another node in Ci to n
Although pruning phase eliminates some nodes,
since some roots may come from the same family,
these have to be eliminated
Same family term means that the roots are
generalizations of same quasi-identifier subset
Super-roots Incognito, scans the database and
calculates the frequency sets of each roots that
come from the same family, by computing the
frequency set of their parents

39
Super-roots example

In the following figure
ltB1, S1, Z0gt, ltB1, S0, Z2gt, ltB0, S1, Z2gt
are roots but all of them come from the same
family ltB0, S0, Z0gt
This approach will first calculate the frequency
set of Patients with respect to the parent, then
use this value to calculate the frequency set for
each root

40
Bottom-up Pre-computation

Aim is eliminate the necessity of scanning T once
per each subset of quasi-identifier in order to
generate the necessary frequency sets
Frequency set of T with respect to ltSex, Zipcodegt
has to be recalculated even though frequency set
of T with respect to ltZipcodegt is known
Strategy
First generate the frequency sets of T with
respect to all subsets of the quasi-identifier at
the lowest level of generalization
Then use computed frequency sets in a bottom-up
aggregation manner to calculate more generalized
frequency sets

Introduction
Full-Domain Generalization
Incognito
Basic Incognito
Algorithm Optimizations
Performance Analysis
Taxonomy of k-anonymization models
Related Work Conclusions

42
Performance Analysis

Real world data is used
Basic incognito, Super-roots Incognito, and
Cube-Incognito are implemented
Results are compared with Samaratis Binary
Search, Bottom-up search (with rollup), and
Bottom-up search (without rollup)
Incognito algorithms uniformly outperformed the
previous algorithms

43
Experimental Data

Previous algorithms were tested on small
databases
Full-domain k-anonimity were tested on 256
records
Binary search was not experimented
Genetic algorithm ran on larger databases but did
not guarantee minimality
Two databases are used in experiments
Nine attributes, all of which are elements of
quasi-identifier set
First database table contained 45,222 records
Second was even larger with 4,591,581 records and
268MB
Implementations are made using Java and IBM DB2
AMD Athlon 1.5Ghz with 2GB physical memory is used

44
Experimental Results

Even incognito is an exponential algorithm
Rollup and priori pruning optimization provides
linear speedups
Test results show that incognito linearly
outperforms bottom-up approach
Incognito finds all possible k-anonymous
generalizations
Bottom-up search finds only one!!

45
Effects of Rollup

Bottom-up search is re-implemented, this time
considering the rollup property

46
Effects of Pruning

Pruning substantially decreased the number of
nodes to be inspected
For small values of size of quasi-identifier set,
the values are near
For large values however, an important number of
nodes are eliminated

47
Effects of Super-Roots

Since many frequency sets are calculated from
other frequency sets, access to the original data
has substantially decreased. This caused the
runtime of the entire algorithm.
E.g. By creating a single super-root frequency
set, 4-5 scans of entire data is eliminated

Introduction
Full-Domain Generalization
Incognito
Basic Incognito
Algorithm Optimizations
Performance Analysis
Taxonomy of k-anonymization models
Related Work Conclusions

49
Taxonomy of K-Anonymization Models

Existing k-anonymization techniques can be
categorized according to 3 main criteria
Generalization vs. Suppression
Considering intermediate steps or suppressing
them
Global vs. Local Recording
Global recording maps the values in domains into
modified values
Local recording modify individual instances of
data items
Hierarchy-Based vs. Partition-Based
Hierarchy-based uses fixed generalization
hierarchies
Generalization by partitioning into disjoint
ranges

Introduction
Full-Domain Generalization
Incognito
Basic Incognito
Algorithm Optimizations
Performance Analysis
Taxonomy of k-anonymization models
Related Work Conclusions

51
Related Work Conclusions

µ-Argus system
Considered attribute combinations of a limited
size
Results was not always guaranteed to be anonymous
Binary Search Algorithm
Discovers a single minimal full-domain
generalization
Datafly Full-domain generalization
Results are k-anonymous, but minimality not
guaranteed

52
Conclusions

Multi-dimensional data model is simple and a
clear way to describe full-domain generalization
Two key ideas for k-anonymization provided good
results, namely
Bottom-up aggregation
A Priori computation
Provided performing full-domain generalization on
large databases in feasible times