Title: Anonymizing Tables for Privacy Protection
1Anonymizing Tables for Privacy Protection
Gagan Aggarwal, Tomás Feder, Krishnaram
Kenthapadi, Rajeev Motwani, Rina Panigrahy,
Dilys Thomas, An Zhu
2An example Medical Records
Identifying Identifying Sensitive
SSN Name Age Race Zipcode Disease
614 Sara 31 Cauc 94305 Flu
615 Joan 34 Cauc 94307 Cold
629 Kelly 27 Cauc 94301 Diabetes
710 Mike 41 Afr-A 94305 Flu
840 Carl 41 Afr-A 94059 Arthritis
780 Joe 65 Hisp 94042 Heart problem
614 Rob 46 Hisp 94042 Arthritis
3Medical Records De-identify Release
Sensitive
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
4Not sufficient! Swe02, SS98
Sensitive
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
Uniquely identify you!
Public Database
5k-anonymity Problem Definition
- Input Database consisting of n rows, each with m
attributes drawn from a finite alphabet. - Goal Suppress some entries in the table such
that each modified row becomes identical to at
least k-1 other rows. - More the suppression, lesser the utility of the
modified table. - Objective Minimize the number of suppressed
entries.
6Medical Records 2-anonymized table
Age Race Zipcode Disease
Cauc Flu
Cauc Cold
Cauc Diabetes
41 Afr-A Flu
41 Afr-A Arthritis
Hisp 94042 Heart problem
Hisp 94042 Arthritis
Suppress entries
Cost 10
7k-anonymity Results
- MW04
- NP-hardness for a linear size alphabet
- O(k log k) - approximation algorithm
- NP-hardness (even for ternary alphabet)
- O(k) - approximation for k-anonymity
- 1.5 - approximation for 2-anonymity
- 2 - approximation for 3-anonymity
8O(k)-approximation algorithm (for k3)
- Create a complete graph s.t.
- Each row vector in the table is a vertex.
- Weight of an edge is the number of attributes on
which the two rows differ (Hamming distance).
Age Race Zipcode
31 Cauc 94305
34 Cauc 94307
41 Afr-A 94305
41 Afr-A 94059
9O(k)-approximation algorithm (for k3)
- We create a forest as follows
- Each node picks its nearest neighbor and connects
to it. - If the resulting graph has a component with only
two nodes, connect this component to the second
nearest neighbor of one of the two nodes.
10An example graph
3
2
7
5
10
9
9
7
12
7
4
5
1
3
1
2
Nearest-neighbor edge
Other edges
11The forest obtained
3
2
4
1
3
1
2
12O(k)-approximation algorithm (for k3)
- The forest has
- Components of size at least 3.
- The total cost of edges in the forest is no more
than the cost of the optimal solution. - In optimal solution, each node has at least as
many s as its Hamming distance to its second
nearest neighbor. - Each node has at most as many s as the cost of
the tree containing the node. - If there is any component with size greater than
5, break it into components of size at least 3
(resp. k).
13The final partition
3
2
3
4
1
3
1
2
14Analysis of the algorithm
- Cluster the row vectors according to this
partition - Cost incurred OPT (size of largest
partition) 5 OPT. - For general k, the cost of this solution is
within max3k-5,2k-1 of the cost of optimal
solution.
15Better than O(k)-approximation?
- Not possible, using only the graph representation
- Lose information about the structure of the
problem - There exist two instances with
- Same underlying graph
- k-anonymity costs differing by a factor of O(k)
16Open problems
- Lower bounds on the approximation factor (without
assuming the graph representation) - Extend the k-anonymity model to account for
changes in the database - Handle inserts, deletes and updates