Title: Incognito Efficient FullDomain KAnonymity
1IncognitoEfficient Full-Domain K-Anonymity
2Agenda
- Introduction
- Full-Domain Generalization
- Incognito
- Basic Incognito
- Algorithm Optimizations
- Performance Analysis
- Taxonomy of k-anonymization models
- Related Work Conclusions
3- Introduction
- Full-Domain Generalization
- Incognito
- Basic Incognito
- Algorithm Optimizations
- Performance Analysis
- Taxonomy of k-anonymization models
- Related Work Conclusions
4Introduction
- Data published in public environments generally
have some removed attributes - e.g. Name, Social Security Number
- Remaining attributes, sometimes, can be used to
remove the anonymity of the published data - Joining multiple public data to obtain
unambiguous information
5Paper Overview
- This paper provides a framework for one model of
k-anonymization technique called full-domain
generalization is proposed - What to expect from this paper
- A set of algorithms for producing minimal
full-domain generalizations - Taxonomy of k-anonymization models
6K-Anonymization
- K-anonymization is a technique that prevents
joining attacks - Generalizing
- or
- Suppressing
- portions of released microdata so that no
individual can be uniquely distinguished among a
group of size k. That is to say, records remain
nearly unique within a group of size k
7Sample Database
- Hospital Patient Database
8Quasi-Identifier Attribute Set
- Quasi-Identifier Attribute Set A
quasi-identifier attribute set Q is a minimal set
of attributes in table T that can be joined with
external information to re-identify individual
records (With sufficiently high probability) - Minimum number of columns in a table that can be
used to nearly uniquely identify records - Elements in a quasi-identifier attribute set is
assumed to be known based on specific knowledge
of the domain
9Frequency Set
- Frequency Set
- Relation T
- A set of attributes Q of size n
- A mapping from each unique combination of values
of ltq0, .., qngt of Q, to the total number of
tuples in T with certain q values - Can be seen as the mapping from the counts of
records resulting from a query with certain q
values in the group by clause
10K-Anonimity Property
- K-Anonymity Property A relation T is said to be
k-anonymous with respect to attribute set Q if
every count of frequency sets of T with respect
to Q is greater than of equal to k - Many GROUP BY clauses are formed from all
possible combinations of attributes in Q - Each and every count of elements formed by these
clauses have to be greater than a certain k
11K-Anonimity Property (Contd)
- Sample k-anonymity on patient database
- k 2 (2-anonymity)
- Q Zipcode, Sex
- SELECT COUNT() FROM Patients GROUP BY Zipcode,
Sex - Results show that Patients
- is not 2-anonymous with
- respect to set Q since results
- contain values smaller than 2
12K-Anonymization
- K-Anonymization Obtaining a view V of a relation
T such that, the view modifies, distorts, or
suppresses the data of T according to some
mechanism such that, V satisfies the k-anonymity
property with respect to the set of
quasi-identifier attributes. - T and V should consist of multiple attributes
- Relation T is k-anonymized to view V, in order to
disallow the de-identification of data by using k
columns of T (depicted as quasi-identifier
attributes of T)
13- Introduction
- Full-Domain Generalization
- Incognito
- Basic Incognito
- Algorithm Optimizations
- Performance Analysis
- Taxonomy of k-anonymization models
- Related Work Conclusions
14Domain Generalization
- In a relational database, there is a domain
(integer, date, ..) related to each attribute - Constructing a more general domain from
existing domains is called - Domain Generalization
- e.g. Generalizing Zipcode domain by ignoring
the least significant digit - Domain generalization can be achieved in several
ways
15Domain Generalization Relationship
- A domain generalization relationship is defined
as Di D Dj to denote that domain Dj is either
identical, or a domain generalization of Di - Values in Dj domain are thus, generalizations of
values in domain Di. This property results in a
many-to-one relationship between original domain
values and derived domain values - ? Di ? Dj this function depicts the many-to-one
relationship which is called value
generalization function
16Domain Generalization Relationship (Contd)
- If there is an edge from Di to Dj, Dj is called
the direct generalization - Domain generalization relationship is transitive
- If Di D Dj and Dj D Dk then Di D Dk.
- Transitivity property proceeds to another
definition Domain Generalization Hierarchy
17Domain Generalization Hierarchy
- Domain Generalization Hierarchy is defined to be
the set of domains that is totally ordered by the
domain generalization relationship D - A Domain Generalization Hierarchy can be thought
as the nodes in a chain of direct generalizations - Edges direct generalizations
- Paths implied generalizations
- Domain generalization hierarchies of Patients
relation can be shown as in the following slide
18Domain Generalization Hierarchy (Contd)
a
c
e
b
d
f
Figure 2 Domain and value generalization
hierarchies
19Domain Generalization Hierarchy (Contd)
- Recall ? Di ? Dj is called value generalization
function - ? is used as a shorthand for the composition of
one or more value generalization functions
producing direct and implied value
generalizations - These functions form a value-level tree in which
- Edges are defined by ?
- Paths are defined by ?
- For example in figure 2-b
- 5371 ?(53715) and 537 ?(53715)
20Domain Generalization (Contd)
- Domain generalization of multiple attributes with
each having a different domain form a
multi-attribute generalization lattice (Fig.3) - n single-attributes form a complete lattice of
n-vectors of domains with following properties - Each edge is a direct multi-attribute domain
generalization relationship - The bottom element is the source of hierarchy
chain and has the most specific domains - The top element is the sink of the chain and has
the most general domains
21Domain Generalization (Contd)
Figure 3. Generalization lattice for the Zipcode
and Sex attributes and corresponding lattice of
distance vectors
- Height of a the multi-attribute generalization is
defined as the sum of values in the corresponding
vector - Height value will be used for finding the minimal
full-domain generalization
22Full-Domain Generalization
- Full-domain Generalization maps the entire domain
of each quasi-identifier attribute in T to a more
general domain in its domain generalization
hierarchy - This scheme guarantees that all values of a
particular attribute in V belong to the same
domain - In order to prevent a k-anonymization to
generalize domains more than necessary, some kind
of minimality should be introduced as follows - V should be k-anonymous
- Height of the resulting generalization is less
than or equal to that of any other k-anonymous
full-domain generalizations (a search mechanism
is needed)
23Full-Domain Generalization Algorithms
- In order to guarantee the minimality of the
anonymization a search technique should be used - Binary Search Tree
- Breadth-first Search Start with the least
general domain at the root, check whether each
generalization satisfies k-anonymity - This paper refines the BFS using a bottom-up
aggregation rollup along the domain hierarchies - Frequency sets which will determine the
k-anonymity is computed from the generalizations
of the current node
24- Introduction
- Full-Domain Generalization
- Incognito
- Basic Incognito
- Algorithm Optimizations
- Performance Analysis
- Taxonomy of k-anonymization models
- Related Work Conclusions
25Incognito
- Three important techniques are combined
- Generalization framework of Samarati and Sweeney
- Managing multi-dimensional data
- Mining association rules
- Since COUNT measure is an important aspect, each
domain generalization hierarchy can be thought as
a dimension - Introducing the dimension idea, a relation T and
its quasi-identifiers form a relational
star-schema
26Incognito (Contd)
- Figure 4. Star schema including generalization
- dimensions for quasi-identifier attributes
27Incognito (Contd)
- A full-domain k-anonymization is obtained in two
steps - Joining the relation T with its dimension tables
- Projecting the appropriate domain attributes
- Two key properties of dimension generalization
are - Generalization Property
- Rollup Property
28Generalization Property
- Assume two sets of attributes P and Q in a
relation T such that DP D DQ - Domain of Q is generalized from domain of P
- If T is k-anonymous with respect to P, then it is
also k-anonymous with respect to Q - This results from the fact that if Q is
generalized from P, than its set (number of
elements in groups) contains more element than P - P contains k, or more than k-elements in each of
its sets - Q contains either same, or more elements
29Rollup Property
- Assume two sets of attributes P and Q in a
relation T such that DP D DQ - If f1, the frequency set of T with respect to P
is known, then each count in f2, the frequency
set of T with respect to Q can be generated by
summing the set of counts in f1 associated by
generalization function with each value set of f2
30Rollup Property (Contd)
- Assume P is ltB, S, Z0gt and Q is
- ltB, S, Z1gt
- Frequency set of P is calculated
- by a COUNT() query with
- Birthdate, Sex and Zipcode attributes
- in the GROUP BY clause
- Frequency set of P is calculated
- by summing the counts of groups
- formed by a GROUP BY clause with
- Birthdate, Sex and Z1
31Incognito (Contd)
- One more property is derived from the observation
of a dynamic-programming approach for mining
frequent itemsets - Subset Property Let Q be a set of attributes in
relation T. If T is k-anonymous with respect to
Q, then it is k-anonymous to any other set of
attributes P such that P is a subset of Q - Set P contains either the same, or less amount of
attributes which will be used for grouping - Less number of records for grouping means the
groups to remain either same, or merged by other
groups which will increase their sizes
k-anonymity is satisfied in each case
32Basic Incognito Algorithm
- Generates set of all possible k-anonymous
full-domain generalizations of T - A graph of candidate multi-attribute
generalizations are constructed from a subset of
quasi-identifier of size i. This set is called
Ci. - The set of direct multi-attribute generalization
relationships connecting these nodes is denoted
Ei. - Each iteration consists of two main parts
- A modified breadth-first search over the graph
produces set Si. This set contains k-anonymous
generalizations of size i. - After obtaining Si, the algorithm constructs the
set of candidate nodes of size i1 (Ei1) and the
edges connecting them (Ci1)
33Breadth-First Search
- At ith iteration a search determines the
k-anonymity of table T with respect to each
candidates in Ci - Search starts with nodes that are not direct
generalizations of some other nodes - Rollup property provides optimization in the
bottom-up aggregation - Generalization property is used to mark some
nodes as k-anonymous if a direct generalization
of them is found to be k-anonymous - Less number of calculations at latter iterations
34Breadth-First Search
35Graph Generation
- Graphs are implemented as two relational tables
one for nodes, one for edges - Figure 7 Graph Representation
36Graph Generation
- Graph generation is done in three phases
- Join phase Creates a superset of Ci based on
Si-1 - Prune phase Uses a hash tree structure to remove
nodes with subsets not in Si-1 - Edge Generation phase Direct multi-attribute
generalization relationships among candidate
nodes are selected - Figure 8 Graph Generation
37Algorithm Optimizations
- Two different techniques were applied
- Super-roots
- Bottom-up Pre-computation
38Super-roots
- A candidate node n in Ci is a root if there is
no generalization edge in Ei directed from
another node in Ci to n - Although pruning phase eliminates some nodes,
since some roots may come from the same family,
these have to be eliminated - Same family term means that the roots are
generalizations of same quasi-identifier subset - Super-roots Incognito, scans the database and
calculates the frequency sets of each roots that
come from the same family, by computing the
frequency set of their parents
39Super-roots example
- In the following figure
- ltB1, S1, Z0gt, ltB1, S0, Z2gt, ltB0, S1, Z2gt
- are roots but all of them come from the same
family ltB0, S0, Z0gt - This approach will first calculate the frequency
set of Patients with respect to the parent, then
use this value to calculate the frequency set for
each root
40Bottom-up Pre-computation
- Aim is eliminate the necessity of scanning T once
per each subset of quasi-identifier in order to
generate the necessary frequency sets - Frequency set of T with respect to ltSex, Zipcodegt
has to be recalculated even though frequency set
of T with respect to ltZipcodegt is known - Strategy
- First generate the frequency sets of T with
respect to all subsets of the quasi-identifier at
the lowest level of generalization - Then use computed frequency sets in a bottom-up
aggregation manner to calculate more generalized
frequency sets
41- Introduction
- Full-Domain Generalization
- Incognito
- Basic Incognito
- Algorithm Optimizations
- Performance Analysis
- Taxonomy of k-anonymization models
- Related Work Conclusions
42Performance Analysis
- Real world data is used
- Basic incognito, Super-roots Incognito, and
Cube-Incognito are implemented - Results are compared with Samaratis Binary
Search, Bottom-up search (with rollup), and
Bottom-up search (without rollup) - Incognito algorithms uniformly outperformed the
previous algorithms
43Experimental Data
- Previous algorithms were tested on small
databases - Full-domain k-anonimity were tested on 256
records - Binary search was not experimented
- Genetic algorithm ran on larger databases but did
not guarantee minimality - Two databases are used in experiments
- Nine attributes, all of which are elements of
quasi-identifier set - First database table contained 45,222 records
- Second was even larger with 4,591,581 records and
268MB - Implementations are made using Java and IBM DB2
- AMD Athlon 1.5Ghz with 2GB physical memory is used
44Experimental Results
- Even incognito is an exponential algorithm
- Rollup and priori pruning optimization provides
linear speedups - Test results show that incognito linearly
outperforms bottom-up approach - Incognito finds all possible k-anonymous
generalizations - Bottom-up search finds only one!!
45Effects of Rollup
- Bottom-up search is re-implemented, this time
considering the rollup property
46Effects of Pruning
- Pruning substantially decreased the number of
nodes to be inspected - For small values of size of quasi-identifier set,
the values are near - For large values however, an important number of
nodes are eliminated
47Effects of Super-Roots
- Since many frequency sets are calculated from
other frequency sets, access to the original data
has substantially decreased. This caused the
runtime of the entire algorithm. - E.g. By creating a single super-root frequency
set, 4-5 scans of entire data is eliminated
48- Introduction
- Full-Domain Generalization
- Incognito
- Basic Incognito
- Algorithm Optimizations
- Performance Analysis
- Taxonomy of k-anonymization models
- Related Work Conclusions
49Taxonomy of K-Anonymization Models
- Existing k-anonymization techniques can be
categorized according to 3 main criteria - Generalization vs. Suppression
- Considering intermediate steps or suppressing
them - Global vs. Local Recording
- Global recording maps the values in domains into
modified values - Local recording modify individual instances of
data items - Hierarchy-Based vs. Partition-Based
- Hierarchy-based uses fixed generalization
hierarchies - Generalization by partitioning into disjoint
ranges
50- Introduction
- Full-Domain Generalization
- Incognito
- Basic Incognito
- Algorithm Optimizations
- Performance Analysis
- Taxonomy of k-anonymization models
- Related Work Conclusions
51Related Work Conclusions
- µ-Argus system
- Considered attribute combinations of a limited
size - Results was not always guaranteed to be anonymous
- Binary Search Algorithm
- Discovers a single minimal full-domain
generalization - Datafly Full-domain generalization
- Results are k-anonymous, but minimality not
guaranteed
52Conclusions
- Multi-dimensional data model is simple and a
clear way to describe full-domain generalization - Two key ideas for k-anonymization provided good
results, namely - Bottom-up aggregation
- A Priori computation
- Provided performing full-domain generalization on
large databases in feasible times
53- Thanks for your attention
QUESTIONS??