Title: PrivacyPreserving Data Mining
1Privacy-PreservingData Mining
- Jaideep Vaidya (jsvaidya_at_rbs.rutgers.edu)
- Joint work with
- Chris Clifton (Purdue University)
2Outline
- Introduction
- Privacy-Preserving Data Mining
- Horizontal / Vertical Partitioning of Data
- Secure Multi-party Computation
- Privacy-Preserving Outlier Detection
- Privacy-Preserving Association Rule Mining
- Conclusion
3Back in the good ol days
Future
Now
Dominicks
Safeway
Jewel
4A real example
- Ford / Firestone
- Individual databases
- Possible to join both databases (find
corresponding transactions) - Commercial reasons to not share data
- Valuable corporate information - Cost structures
/ business structures - Ford Explorers with Firestone tires ? Tread
Separation Problems (Accidents!) - Might have been able to figure out a bit earlier
(Tires from Decatur, Ill. Plant, certain
situations)
5Public (mis)Perception of Data Mining Attack on
Privacy
- Fears of loss of privacy constrain data mining
- Protests over a National Registry
- In Japan
- Data Mining Moratorium Act
- Would stop all data mining RD by DoD
- Terrorism Information Awareness ended
- Data Mining could be key technology
6Is Data Mining a Threat?
- Data Mining summarizes data
- (Possible?) exception Anomaly / Outlier
detection - Summaries arent private
- Or are they?
- Does generating them raise issues?
- Data mining can be a privacy solution
- Data mining enables safe use of private data
7Privacy Problems withData Mining
- The problem isnt Data Mining, it is the
infrastructure to support it! - Japanese registry data already held by
prefectures - Protests arose over moving to a National registry
- Total Information Awareness program doesnt
generate new data - Goal is to enable use of data from multiple
agencies - Loss of Separation of Control
- Increases potential for misuse
- Find patterns while seeing only your own data!
8Privacy-Preserving Data Mining
- How can we mine data if we cannot see it?
- Perturbation
- Agrawal Srikant, Evfimievski et al.
- Extremely scalable, approximate results
- Debate about security properties
- Cryptographic
- Lindell Pinkas, Vaidya Clifton
- Completely accurate, completely secure (tight
bound on disclosure), appropriate for small
number of parties - Condensation/Hybrid
9Assumptions
- Data distributed
- Each data set held by source authorized to see it
- Nobody is allowed to see aggregate data
- Knowing all data about an individual violates
privacy - Data holders dont want to disclose data
- Wont collude to violate privacy
10Gold StandardTrusted Third Party
11Horizontal Partitioning of Data
Bank of America
Chase Manhattan
12Vertical Partitioning of Data
Global Database View
Cell Phone Data
Medical Records
13Secure Multi-Party Computation (SMC)
- Given a function f and n inputs, distributed
at n sites, compute the result - while revealing nothing to any site except
its own input(s) and the result.
14Secure Multi-Party ComputationIt can be done!
- Yaos Millionaires problem (Yao 86)
- Secure computation possible if function can be
represented as a circuit - Idea Securely compute gate
- Continue to evaluate circuit
- Extended to multiple parties (BGW/GMW 87)
- Biggest Problem - Efficiency
- Will not work for lots of parties / large
quantities of data
15SMC Models of Computation
- Semi-honest Model
- Parties follow the protocol faithfully
- Malicious Model
- Anything goes!
- Provably Secure
- In either case, input can always be modified
16Incentive compatibility
- From a higher level perspective (economic notion)
- If a party cheats
- Either party is caught
- Or party suffers an economic loss
- Possible for many useful collaboration problems
- If protocol is incentive compatible, semi-honest
model sufficient for security
17What is an Outlier?
- An object O in a dataset T is a DB(p,dt)-outlier
if at least fraction p of the objects in T lie at
distance greater than dt from O - Centralized solution from Knorr and Ng
- Nested loop comparison
- Maintain count of objects inside threshold
- If count exceeds threshold, declare non-outlier
and move to next - Clever processing order minimizes I/O cost
1
2
1
18Privacy-Preserving Solution
- Key idea share splitting
- Computations leave results (randomly) split
between parties - Only outcome is if the count of points within
distance threshold exceeds outlier threshold - Requires pairwise comparison of all points
- But failure to compare all points reveals
information about non-outliers - This alone makes it possible to cluster points
- This is a privacy violation
- Asymptotically equivalent to Knorr Ng
19Solution Horizontal Partition
- Compare locally with your own points
- For remote points, get random share of distance
- Calculate random share of exceeds threshold or
doesnt - Sum shares and test if enough close points
1.5
-0.9
32
-31
0.3
0.9
3
-3
2.5
-0.7
-12
12
1.5
3.2
1
-1
1
24
-23
20Random share of distance
-
-
-
- x2, y2 local sum of xy is scalar product
- Several protocols for share-splitting scalar
product(DuAtallah01 VaidyaClifton02
Ioannidis, Grama, Atallah02)
21Shares of Within Threshold
- Goal is x y dt ?
- Essentially Yaos Millionaires problem (Yao86)
- Represent function to be computed as circuit
- Cryptographic protocol gives random shares of
each wire - Solves sum of shares from within dt exceeds
minimum as well
22Vertically Partitioned Data
- Each party computes its part of distance
-
-
- Secure comparison (circuit evaluation) gives each
party shares of 1/0 (close/not) - Sum and compare as with horizontal partitioning
23Why is this Secure?
- Random shares indistinguishable from random
values - Contain no knowledge in isolation
- Assuming no collusion so shares viewed in
isolation - Number of values ( number of shares) known
- Nothing new revealed
- Too few close points is outlier definition
- This is the desired result
- No knowledge that cant be discovered from ones
own input and the result!
24Conclusion (Outlier Detection)
- Outlier detection feasible without revealing
anything but the outliers - Possibly expensive (quadratic)
- But more efficient solution for this definition
of outlier inherently reveals potential
privacy-violating information - Key Privacy of non-outliers preserved
- Reason why outliers are outliers also hidden
- Allows search for unusual entities without
disclosing private information about entities
25Association Rules
- Association rules a common data mining task
- Find A, B, C such that AB ? C holds frequently
(e.g. Diapers ? Beer) - Fast algorithms for centralized and distributed
computation - Basic idea For AB ? C to be frequent, AB, AC,
and BC must all be frequent - Require sharing data
- Secure Multiparty Computation too expensive
26Association Rule Mining
- Find out if itemset A1, B1 is frequent (i.e. If
support of A1, B1 k) - A B
- Support of itemset is defined as number of
transactions in which all attributes of the
itemset are present - For binary data, support Ai ? Bi.
27Association Rule Mining
- Idea based on TID-list representation of data
- Represent attribute A as TID-list Atid
- Support of ABC is Atid n Btid n Ctid
- Use a secure protocol to find size of set
intersection to find candidate sets
28Cardinality of Set Intersection
- Use a secure commutative hash function
- Pohlig-Hellman Encryption
- Each party generates own encryption key
- All parties encrypt all the input sets
29Cardinality of Set Intersection
- Hashing
- All parties hash all sets with their key
- Initial intersection
- Each party finds intersection of all sets (except
its own) - Final intersection
- Parties exchange the final intersection set, and
compute the intersection of all sets
30Computing Size of Intersection
1 X
Za,Ăź,?,?,?
E1(E2(E3(Z)))
E1(E2(Y))
E1(X)
XnYnZ?,Ăź
Za,Ăź,?,?,?
YnZ?,Ăź
2 Y
3 Z
E3(E1(X))
E2(E3(Z))
Y?,s,f,?,Ăź
E3(Z)
Xa,?,s,Ăź
E3(E1(E2(Y)))
E2(E3(E1(X)))
E2(Y)
XnYnZ?,Ăź
XnYnZ?,Ăź
Xa,?,s,Ăź
Y?,s,f,?,Ăź
XnZa,Ăź,?
XnY?,s,Ăź
31Why need an intermediate intersection step?
- Probing
- 1 party only interested in a particular item
- Input set composed of interesting item and junk
- Output reveals information about the presence /
absence of item - Solution
- Intermediate step, every party receives encrypted
sets of all other parties (but not its own) - If Intersection size lower than a threshold,
possibility of probing gt Abort protocol
32Proof of Security
- Proof by Simulation
- What is known
- The size of the intersection set
- Site i learns
- How it can be simulated
- Protocol is symmetric, simulating view of one
party is sufficient
33Proof of Security
- Hashing
- Party i receives encrypted set from party i-1
- Can use random numbers to simulate this
- Intersection
- Party i receives fully hashed sets of all parties
34Simulating Fully Encrypted Sets
ABC 2, AB 3, AC 4, BC 2, A 6,
B 7, C 8
ABC
2
AB
AC
4-2 2
BC
3-2 1
2-2 0
A
B
C
7-2-1-0 4
6-2-1-2 1
8-2-2-0 4
35A
B
C
36Optimized version
37Association Rule Mining (Revisited)
- NaĂŻve algorithm gt Simply use APRIORI. A single
set intersection determines the frequency of a
single candidate itemset - Thousands of itemsets
- Key intuition
- Set Intersection algorithm developed also allows
computation of intermediate sets - All parties get fully encrypted sets for all
attributes - Local computation allows efficient discovery of
all association rules
38Communication Cost
- k parties, m set size, p frequent attributes
- k(2k-2) O(k2) messages
- p(2p-2)mencrypted message size O(p2m) bits
- k rounds
- Independent of number of itemsets found
39Other Results
- ID3 Decision Tree learning
- Horizontal Partitioning LindellPinkas 00
- Also vertical partitioning (Du, Vaidya)
- Association Rules
- Horizontal Partitioning Kantarcioglu
- K-Means / EM Clustering
- K-Nearest Neighbor
- NaĂŻve Bayes, Bayes network structure
- And many more
40Challenges
- What do the results reveal?
- A general approach (instead of per data mining
technique) - Experimental results
- Incentive Compatibility
- Note Upcoming book in the Advances in
Information Security series by Springer-Verlag
41Questions