Title: A Privacy Preserving Index for Range Queries
1A Privacy Preserving Index for Range Queries
- Bijit Hore, Sharad Mehrotra, Gene Tsudik
2Database as a Service (DAS) Hacigumus et. al,
SIGMOD2002
- A client wants to store data on a remote server
run queries on it - BUT he does not trust the server
- Solution Encrypt the data store it
- How do you query the encrypted data ?
Untrusted
Trusted
True Results
Encrypted Results
Query Post Processor
Encrypted Indexed Client Data
Server
Query Translator
Query over Encrypted Data
User
Original Query
Service Provider
Client
3Data storage in DAS
Client side storage
Meta data
Server side data
buckets
Z0 Z1 Z2 Z3 Z4
0 200 450 600 650 700
Server side Table (encrypted indexed) RA
Original Table (plain text) R
etuple sharesA ageA salA
X_at_FJ X1 Y2 Z1
CH(G! X2 Y1 Z1
DL X3 Y2 Z2
GH) X3 Y3 Z3
eid name addr shares age sal
345 Tom Maple 5400 32 390K
876 Mary Main 5800 22 423K
234 John River 6000 34 598K
780 Jerry Ocean 6200 48 632K
Bucket-tags
4Querying in DAS
Select from R where R.sal ? 400K, 600K
Client-side query Server-side query
Select etuple from RA where RA.salA z1 ? z2
Server side Table (encrypted indexed) RA
Client side Table (plain text) R
Client side Table (plain text) R
etuple sharesA ageA salA
X_at_FJ X1 Y2 Z1
CH(G! X2 Y1 Z1
DL X3 Y2 Z2
GH) X3 Y3 Z3
eid name addr shares age sal
345 Tom Maple 5400 32 390K
876 Mary Main 5800 22 426K
234 John River 6000 34 598K
780 Jerry Ocean 6200 48 634K
Bucket-tags
5Issues in partitioning
- How many buckets should one use ?
- How to partition the data ?
6Data Privacy in DAS
- Adversary
- Access to sever-side data
-
- Malicious Intentions
- Privacy issue in partitioned data
- Small range of a bucket B
-
- 1 sample value from B
- Privacy goal of client
- To hide all useful information from A
- Put all values of an attribute in a single
bucket !
Adversary (A)
Almost total disclosure of all elements in B
7Research challenges our contributions
- Precision how to partition data
- Definition
- Optimal partitioning to maximize precision
- Privacy quantifying disclosure
- Adversarys goals
- Measures of information disclosure
- Privacy-Precision trade-off
- Controlled diffusion algorithm ?
- Experiments Conclusion
Privacy
Precision
8Precision of range queries
- Given a partition of data into M parts
- Precision (q) 1 ( false positives / tuples
returned for q) - Recall 1
- Workload All O(N2) range queries are
equiprobable (uniform)
false positive a ? NBFB 532 518 250
B
Precision 1 20/50 0.6
q
M 2
10
10
Frequency
NB5,FB18
6
4
4
4
4
4
N 10 (domain size)
2
2
1 2 3 4 5 6
7 8 9 10
Salary (100Ks)
9Query optimal buckets (QOB)
- Optimization problem
- For the uniform workload find a partition of the
data into M buckets that minimizes total false
positives i.e.
4
Minimize ? NBFB
B1
Optimal solution to a sub-problem
Cost of rightmost bucket
Cost(8,10)
QOB (1,7,3)
QOB (1,10,4)
10
10
Frequency
NBFB 24
6
4
4
4
4
4
N 10 (domain size)
2
2
1 2 3 4 5 6
7 8 9 10
Salary (100Ks)
10QOB (cont.)
4
Optimal cost ?NBFB 123 202 102 83
110
1
B1
B2
B3
B4
10
10
6
Frequency
4
4
4
4
4
2
2
1 2 3 4 5 6
7 8 9 10
Salary(100Ks)
Time complexity O(n2M), Space O(nM) n
distinct values in dataset M buckets
11Outline
- Optimal data partitioning for range queries
- Adversarial goals privacy measures
- Balancing privacy and precision
- Experiments conclusion
12Adversarys learning model
- Need to learn bucket properties to estimate
- sensitive values
- Model
- As Domain knowledge
-
- Sample values from buckets
- Worst case assumption for Privacy Analysis
- A knows exact value distribution for every bucket
A learns distribution of values in buckets
13Adversarial Goal (I)
- Individual Centric Information
-
- Eg What is the salary of an individual I
- Value Estimation Power (VEP) of A
- Variance of bucket-distribution is an inverse
- measure of VEP
Average error of value estimation for Adversary
Preferred Large variance
Small variance
Large
Small
Bucket range
Bucket range
14Adversarial Goal (II)
- Query Centric Information
-
- Eg Which individuals have salary ? 100k,150k
- Set Estimation Power (SEP) of A
- Entropy of bucket-distribution is an inverse
- measure of SEP
Best case high entropy large variance
Average error of query-set estimation for
Adversary
low entropy large variance
Large
Small
100k
150k
100k
150k
H(X) - ? pilogpi
Bucket range
Bucket range
15Outline
- Optimal data partitioning for range queries
- Adversarial goals privacy measures
- Balancing privacy and precision
- Experiments conclusion
16Privacy-Precision Trade-off
- Optimal buckets might offer less privacy than
desired - Small variance ?
- partial disclosure of numeric value
- Small entropy ?
- Total disclosure with high probability (e.g.
categorical data) - Partial detection of query-sets (for all cases)
-
- Algorithm that allows trading-off bounded amount
of query precision for greater variance and
entropy
Objective
17The controlled diffusion algorithm
Q
- Let a query Q overlap only with B0
- If elements of B0 are distributed
- into CB1, CB2 CB3 randomly
- Now Q overlaps with CB1, CB2 CB3
- With new buckets, the precision for Q drops by
factor of - (CB1CB2CB3) / B0
-
- Any re-distribution scheme where ? Bi this ratio
K ? precision degradation is bounded above by K
B0
CB1
CB2
CB3
18Controlled diffusion Algorithm
- Compute optimal buckets on data set D ? B1 BM
- Fix max degradation factor K
- Initialize M empty composite buckets ? CB1 CBM
- Set target size of each CB to
- fCB D/M (equidepth)
- ? Bi
- select di CBs at random, where
- di KBi/fCB
- Diffuse elements of Bi into these uniformly at
random
19Controlled Diffusion (Example)
Degradation factor k 2
Query optimal buckets
Metadata size increases from O(M) to O(KM)
10
10
10
Freq
B1
B2
B3
B4
6
Final set of buckets on server
4
4
4
4
4
2
2
1 2 3 4 5 6 7 8 9
10
2 4 2
2 2 2
Values
CB1
4 2 2 3
CB1
CB2
CB2
2 2 2 3 4
CB3
CB3
3 4 2 3
CB4
CB4
1 2 3 4 5 6 7 8 9
10
Composite Buckets
20Some features of the diffusion algorithm
- Many consecutive optimal buckets might get
diffused into common set of CBs ? - Observed precision degradation lt K
- Elements with same values can go to multiple
buckets ? - Giving it an extra degree of freedom compared to
hashing - Not best for point queries
- Random choice in the algorithm ?
- Each bucket distribution approaches data
distribution as K increases ? reducing
information gained by adversary by learning
buckets
21Outline
- Optimal data partitioning for range queries
- Adversarial goals privacy measures
- Balancing privacy and precision
- Experiments conclusion
22Experiments
- Data sets
- Synthetic Data 105 Integers in 0,999
uniformly at random - Real Data 104 Real values in -0.8,8.0 Corel
Image dataset (UCI KDD archive) - Query workloads (2 of size 104 each)
- End points chosen uniformly at random from the
respective ranges
23- Relative decrease in precision of composite
buckets - Relative increase in standard deviation in
composite buckets - Relative increase in entropy in composite buckets
24Composite buckets (sample)
K 6, M 350
K 10, M 250
25- Visualizing trade-offs for various bucketization
parameters - Eg The marked points show the average entropy
precision we get for 100 buckets degradation
factor of 2 - The same point in the precision vs standard
deviation trade-off space ? - Provides an easy way to visualize the design
space and choose parameters of interest
26Summary
- An optimal algorithm for partitioning data for
range queries - Statistical measures of data privacy
- Variance
- Entropy
- Fast simple algorithm for re-bucketizing data
- Bounded amount of precision degradation
- Substantial increase in privacy level
27Related work
- Hacigumus et. al, SIGMOD 2002, Executing SQL
over Encrypted Data in the Database Service
Provider Model. - Damiani et. al, ACM CCS 2003, Balancing
Confidentiality and Efficiency in Untrusted
Relation DBMS. - Bouganim et. al, VLDB 2002 Chip-Secured Data
Access Confidential Data on Untrusted Servers.
28THANK YOU !
Questions ?
29Privacy in DAS
- Here goal of Data Privacy is not just ensuring
non-disclosure of identity. It is more general !
Privacy-preserving DM Statistical DB
DAS
- Privacy criteria Hide as much information as
possible (even at the aggregate level) - Utility criteria Maintain only the necessary
information required for server-side query
evaluation (at desired degree of accuracy)
- Privacy criteria Protect against disclosure of
identity - Utility criteria Minimizing information loss
i.e. maximize utility for data miners, retain as
much aggregate level information as possible
30Individual Privacy Measure
- Average Squared Error of Estimation (ASEE)
- Error in approximating true value of a r.v XB by
- another r.v XB (learned by A)
-
- ASEE(XB,XB)
- Var(XB) Var(XB) (E(XB) E(XB))2
- Variance of bucket distribution, Var(XB) is our
- measure of individual privacy (lower bound)
31Set oriented Privacy Measure
- Entropy of bucket distribution is our measure
for query-centric privacy - Measures uncertainty associated with a r.v (Eg.
True class of an element for categorical data) - An inverse measure of the quality of partial
solution sets that A can derive for a query
H(X) - ? pilogpi
32Meta data size increase in diffusion
- The meta data increases from O(M) to
- KB1/fcb KB2/fcb KBM/fcb
- (K/fcb) (B1 B2 BM)
- (KM/D)D O(KM)