PrivacyPreserving Data Mining presentation

About This Presentation

Transcript and Presenter's Notes

Title: PrivacyPreserving Data Mining

1
Privacy-PreservingData Mining

Jaideep Vaidya (jsvaidya_at_rbs.rutgers.edu)
Joint work with
Chris Clifton (Purdue University)

2
Outline

Introduction
Privacy-Preserving Data Mining
Horizontal / Vertical Partitioning of Data
Secure Multi-party Computation
Privacy-Preserving Outlier Detection
Privacy-Preserving Association Rule Mining
Conclusion

3
Back in the good ol days
Future
Now
Dominicks
Safeway
Jewel
4
A real example

Ford / Firestone
Individual databases
Possible to join both databases (find
corresponding transactions)
Commercial reasons to not share data
Valuable corporate information - Cost structures
/ business structures
Ford Explorers with Firestone tires ? Tread
Separation Problems (Accidents!)
Might have been able to figure out a bit earlier
(Tires from Decatur, Ill. Plant, certain
situations)

5
Public (mis)Perception of Data Mining Attack on
Privacy

Fears of loss of privacy constrain data mining
Protests over a National Registry
In Japan
Data Mining Moratorium Act
Would stop all data mining RD by DoD
Terrorism Information Awareness ended
Data Mining could be key technology

6
Is Data Mining a Threat?

Data Mining summarizes data
(Possible?) exception Anomaly / Outlier
detection
Summaries arent private
Or are they?
Does generating them raise issues?
Data mining can be a privacy solution
Data mining enables safe use of private data

7
Privacy Problems withData Mining

The problem isnt Data Mining, it is the
infrastructure to support it!
Japanese registry data already held by
prefectures
Protests arose over moving to a National registry
Total Information Awareness program doesnt
generate new data
Goal is to enable use of data from multiple
agencies
Loss of Separation of Control
Increases potential for misuse
Find patterns while seeing only your own data!

8
Privacy-Preserving Data Mining

How can we mine data if we cannot see it?
Perturbation
Agrawal Srikant, Evfimievski et al.
Extremely scalable, approximate results
Debate about security properties
Cryptographic
Lindell Pinkas, Vaidya Clifton
Completely accurate, completely secure (tight
bound on disclosure), appropriate for small
number of parties
Condensation/Hybrid

9
Assumptions

Data distributed
Each data set held by source authorized to see it
Nobody is allowed to see aggregate data
Knowing all data about an individual violates
privacy
Data holders dont want to disclose data
Wont collude to violate privacy

10
Gold StandardTrusted Third Party
11
Horizontal Partitioning of Data
Bank of America
Chase Manhattan
12
Vertical Partitioning of Data
Global Database View
Cell Phone Data
Medical Records
13
Secure Multi-Party Computation (SMC)

Given a function f and n inputs, distributed
at n sites, compute the result
while revealing nothing to any site except
its own input(s) and the result.

14
Secure Multi-Party ComputationIt can be done!

Yaos Millionaires problem (Yao 86)
Secure computation possible if function can be
represented as a circuit
Idea Securely compute gate
Continue to evaluate circuit
Extended to multiple parties (BGW/GMW 87)
Biggest Problem - Efficiency
Will not work for lots of parties / large
quantities of data

15
SMC Models of Computation

Semi-honest Model
Parties follow the protocol faithfully
Malicious Model
Anything goes!
Provably Secure
In either case, input can always be modified

16
Incentive compatibility

From a higher level perspective (economic notion)
If a party cheats
Either party is caught
Or party suffers an economic loss
Possible for many useful collaboration problems
If protocol is incentive compatible, semi-honest
model sufficient for security

17
What is an Outlier?

An object O in a dataset T is a DB(p,dt)-outlier
if at least fraction p of the objects in T lie at
distance greater than dt from O
Centralized solution from Knorr and Ng
Nested loop comparison
Maintain count of objects inside threshold
If count exceeds threshold, declare non-outlier
and move to next
Clever processing order minimizes I/O cost

1
2
1
18
Privacy-Preserving Solution

Key idea share splitting
Computations leave results (randomly) split
between parties
Only outcome is if the count of points within
distance threshold exceeds outlier threshold
Requires pairwise comparison of all points
But failure to compare all points reveals
information about non-outliers
This alone makes it possible to cluster points
This is a privacy violation
Asymptotically equivalent to Knorr Ng

19
Solution Horizontal Partition

Compare locally with your own points
For remote points, get random share of distance
Calculate random share of exceeds threshold or
doesnt
Sum shares and test if enough close points

1.5
-0.9
32
-31
0.3
0.9
3
-3
2.5
-0.7
-12
12
1.5
3.2
1
-1
1
24
-23
20
Random share of distance

x2, y2 local sum of xy is scalar product
Several protocols for share-splitting scalar
product(DuAtallah01 VaidyaClifton02
Ioannidis, Grama, Atallah02)

21
Shares of Within Threshold

Goal is x y dt ?
Essentially Yaos Millionaires problem (Yao86)
Represent function to be computed as circuit
Cryptographic protocol gives random shares of
each wire
Solves sum of shares from within dt exceeds
minimum as well

22
Vertically Partitioned Data

Each party computes its part of distance
Secure comparison (circuit evaluation) gives each
party shares of 1/0 (close/not)
Sum and compare as with horizontal partitioning

23
Why is this Secure?

Random shares indistinguishable from random
values
Contain no knowledge in isolation
Assuming no collusion so shares viewed in
isolation
Number of values ( number of shares) known
Nothing new revealed
Too few close points is outlier definition
This is the desired result
No knowledge that cant be discovered from ones
own input and the result!

24
Conclusion (Outlier Detection)

Outlier detection feasible without revealing
anything but the outliers
Possibly expensive (quadratic)
But more efficient solution for this definition
of outlier inherently reveals potential
privacy-violating information
Key Privacy of non-outliers preserved
Reason why outliers are outliers also hidden
Allows search for unusual entities without
disclosing private information about entities

25
Association Rules

Association rules a common data mining task
Find A, B, C such that AB ? C holds frequently
(e.g. Diapers ? Beer)
Fast algorithms for centralized and distributed
computation
Basic idea For AB ? C to be frequent, AB, AC,
and BC must all be frequent
Require sharing data
Secure Multiparty Computation too expensive

26
Association Rule Mining

Find out if itemset A1, B1 is frequent (i.e. If
support of A1, B1 k)
A B
Support of itemset is defined as number of
transactions in which all attributes of the
itemset are present
For binary data, support Ai ? Bi.

27
Association Rule Mining

Idea based on TID-list representation of data
Represent attribute A as TID-list Atid
Support of ABC is Atid n Btid n Ctid
Use a secure protocol to find size of set
intersection to find candidate sets

28
Cardinality of Set Intersection

Use a secure commutative hash function
Pohlig-Hellman Encryption
Each party generates own encryption key
All parties encrypt all the input sets

29
Cardinality of Set Intersection

Hashing
All parties hash all sets with their key
Initial intersection
Each party finds intersection of all sets (except
its own)
Final intersection
Parties exchange the final intersection set, and
compute the intersection of all sets

30
Computing Size of Intersection
1 X
Za,ß,?,?,?
E1(E2(E3(Z)))
E1(E2(Y))
E1(X)
XnYnZ?,ß
Za,ß,?,?,?
YnZ?,ß
2 Y
3 Z
E3(E1(X))
E2(E3(Z))
Y?,s,f,?,ß
E3(Z)
Xa,?,s,ß
E3(E1(E2(Y)))
E2(E3(E1(X)))
E2(Y)
XnYnZ?,ß
XnYnZ?,ß
Xa,?,s,ß
Y?,s,f,?,ß
XnZa,ß,?
XnY?,s,ß
31
Why need an intermediate intersection step?

Probing
1 party only interested in a particular item
Input set composed of interesting item and junk
Output reveals information about the presence /
absence of item
Solution
Intermediate step, every party receives encrypted
sets of all other parties (but not its own)
If Intersection size lower than a threshold,
possibility of probing gt Abort protocol

32
Proof of Security

Proof by Simulation
What is known
The size of the intersection set
Site i learns
How it can be simulated
Protocol is symmetric, simulating view of one
party is sufficient

33
Proof of Security

Hashing
Party i receives encrypted set from party i-1
Can use random numbers to simulate this
Intersection
Party i receives fully hashed sets of all parties

34
Simulating Fully Encrypted Sets
ABC 2, AB 3, AC 4, BC 2, A 6,
B 7, C 8
ABC
2
AB
AC
4-2 2
BC
3-2 1
2-2 0
A
B
C
7-2-1-0 4
6-2-1-2 1
8-2-2-0 4
35
A
B
C
36
Optimized version
37
Association Rule Mining (Revisited)

Naïve algorithm gt Simply use APRIORI. A single
set intersection determines the frequency of a
single candidate itemset
Thousands of itemsets
Key intuition
Set Intersection algorithm developed also allows
computation of intermediate sets
All parties get fully encrypted sets for all
attributes
Local computation allows efficient discovery of
all association rules

38
Communication Cost

k parties, m set size, p frequent attributes
k(2k-2) O(k2) messages
p(2p-2)mencrypted message size O(p2m) bits
k rounds
Independent of number of itemsets found

39
Other Results

ID3 Decision Tree learning
Horizontal Partitioning LindellPinkas 00
Also vertical partitioning (Du, Vaidya)
Association Rules
Horizontal Partitioning Kantarcioglu
K-Means / EM Clustering
K-Nearest Neighbor
Naïve Bayes, Bayes network structure
And many more

40
Challenges

What do the results reveal?
A general approach (instead of per data mining
technique)
Experimental results
Incentive Compatibility
Note Upcoming book in the Advances in
Information Security series by Springer-Verlag

41
Questions

Write a Comment

User Comments (0)

About PowerShow.com

PrivacyPreserving Data Mining PowerPoint PPT Presentation