PrivacyPreserving Data Mining - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

PrivacyPreserving Data Mining

Description:

Possible to join both databases (find corresponding transactions) ... If a party cheats. Either party is caught. Or party suffers an economic loss ... – PowerPoint PPT presentation

Number of Views:293
Avg rating:3.0/5.0
Slides: 39
Provided by: clif9
Category:

less

Transcript and Presenter's Notes

Title: PrivacyPreserving Data Mining


1
Privacy-PreservingData Mining
  • Jaideep Vaidya (jsvaidya_at_rbs.rutgers.edu)
  • Joint work with
  • Chris Clifton (Purdue University)

2
Outline
  • Introduction
  • Privacy-Preserving Data Mining
  • Horizontal / Vertical Partitioning of Data
  • Secure Multi-party Computation
  • Privacy-Preserving Outlier Detection
  • Privacy-Preserving Association Rule Mining
  • Conclusion

3
Back in the good ol days
Future
Now
Dominicks
Safeway
Jewel
4
A real example
  • Ford / Firestone
  • Individual databases
  • Possible to join both databases (find
    corresponding transactions)
  • Commercial reasons to not share data
  • Valuable corporate information - Cost structures
    / business structures
  • Ford Explorers with Firestone tires ? Tread
    Separation Problems (Accidents!)
  • Might have been able to figure out a bit earlier
    (Tires from Decatur, Ill. Plant, certain
    situations)

5
Public (mis)Perception of Data Mining Attack on
Privacy
  • Fears of loss of privacy constrain data mining
  • Protests over a National Registry
  • In Japan
  • Data Mining Moratorium Act
  • Would stop all data mining RD by DoD
  • Terrorism Information Awareness ended
  • Data Mining could be key technology

6
Is Data Mining a Threat?
  • Data Mining summarizes data
  • (Possible?) exception Anomaly / Outlier
    detection
  • Summaries arent private
  • Or are they?
  • Does generating them raise issues?
  • Data mining can be a privacy solution
  • Data mining enables safe use of private data

7
Privacy Problems withData Mining
  • The problem isnt Data Mining, it is the
    infrastructure to support it!
  • Japanese registry data already held by
    prefectures
  • Protests arose over moving to a National registry
  • Total Information Awareness program doesnt
    generate new data
  • Goal is to enable use of data from multiple
    agencies
  • Loss of Separation of Control
  • Increases potential for misuse
  • Find patterns while seeing only your own data!

8
Privacy-Preserving Data Mining
  • How can we mine data if we cannot see it?
  • Perturbation
  • Agrawal Srikant, Evfimievski et al.
  • Extremely scalable, approximate results
  • Debate about security properties
  • Cryptographic
  • Lindell Pinkas, Vaidya Clifton
  • Completely accurate, completely secure (tight
    bound on disclosure), appropriate for small
    number of parties
  • Condensation/Hybrid

9
Assumptions
  • Data distributed
  • Each data set held by source authorized to see it
  • Nobody is allowed to see aggregate data
  • Knowing all data about an individual violates
    privacy
  • Data holders dont want to disclose data
  • Wont collude to violate privacy

10
Gold StandardTrusted Third Party
11
Horizontal Partitioning of Data
Bank of America
Chase Manhattan
12
Vertical Partitioning of Data
Global Database View
Cell Phone Data
Medical Records
13
Secure Multi-Party Computation (SMC)
  • Given a function f and n inputs, distributed
    at n sites, compute the result
  • while revealing nothing to any site except
    its own input(s) and the result.

14
Secure Multi-Party ComputationIt can be done!
  • Yaos Millionaires problem (Yao 86)
  • Secure computation possible if function can be
    represented as a circuit
  • Idea Securely compute gate
  • Continue to evaluate circuit
  • Extended to multiple parties (BGW/GMW 87)
  • Biggest Problem - Efficiency
  • Will not work for lots of parties / large
    quantities of data

15
SMC Models of Computation
  • Semi-honest Model
  • Parties follow the protocol faithfully
  • Malicious Model
  • Anything goes!
  • Provably Secure
  • In either case, input can always be modified

16
Incentive compatibility
  • From a higher level perspective (economic notion)
  • If a party cheats
  • Either party is caught
  • Or party suffers an economic loss
  • Possible for many useful collaboration problems
  • If protocol is incentive compatible, semi-honest
    model sufficient for security

17
What is an Outlier?
  • An object O in a dataset T is a DB(p,dt)-outlier
    if at least fraction p of the objects in T lie at
    distance greater than dt from O
  • Centralized solution from Knorr and Ng
  • Nested loop comparison
  • Maintain count of objects inside threshold
  • If count exceeds threshold, declare non-outlier
    and move to next
  • Clever processing order minimizes I/O cost

1
2
1
18
Privacy-Preserving Solution
  • Key idea share splitting
  • Computations leave results (randomly) split
    between parties
  • Only outcome is if the count of points within
    distance threshold exceeds outlier threshold
  • Requires pairwise comparison of all points
  • But failure to compare all points reveals
    information about non-outliers
  • This alone makes it possible to cluster points
  • This is a privacy violation
  • Asymptotically equivalent to Knorr Ng

19
Solution Horizontal Partition
  • Compare locally with your own points
  • For remote points, get random share of distance
  • Calculate random share of exceeds threshold or
    doesnt
  • Sum shares and test if enough close points

1.5
-0.9
32
-31
0.3
0.9
3
-3
2.5
-0.7
-12
12
1.5
3.2
1
-1
1
24
-23
20
Random share of distance
  • x2, y2 local sum of xy is scalar product
  • Several protocols for share-splitting scalar
    product(DuAtallah01 VaidyaClifton02
    Ioannidis, Grama, Atallah02)

21
Shares of Within Threshold
  • Goal is x y dt ?
  • Essentially Yaos Millionaires problem (Yao86)
  • Represent function to be computed as circuit
  • Cryptographic protocol gives random shares of
    each wire
  • Solves sum of shares from within dt exceeds
    minimum as well

22
Vertically Partitioned Data
  • Each party computes its part of distance
  • Secure comparison (circuit evaluation) gives each
    party shares of 1/0 (close/not)
  • Sum and compare as with horizontal partitioning

23
Why is this Secure?
  • Random shares indistinguishable from random
    values
  • Contain no knowledge in isolation
  • Assuming no collusion so shares viewed in
    isolation
  • Number of values ( number of shares) known
  • Nothing new revealed
  • Too few close points is outlier definition
  • This is the desired result
  • No knowledge that cant be discovered from ones
    own input and the result!

24
Conclusion (Outlier Detection)
  • Outlier detection feasible without revealing
    anything but the outliers
  • Possibly expensive (quadratic)
  • But more efficient solution for this definition
    of outlier inherently reveals potential
    privacy-violating information
  • Key Privacy of non-outliers preserved
  • Reason why outliers are outliers also hidden
  • Allows search for unusual entities without
    disclosing private information about entities

25
Association Rules
  • Association rules a common data mining task
  • Find A, B, C such that AB ? C holds frequently
    (e.g. Diapers ? Beer)
  • Fast algorithms for centralized and distributed
    computation
  • Basic idea For AB ? C to be frequent, AB, AC,
    and BC must all be frequent
  • Require sharing data
  • Secure Multiparty Computation too expensive

26
Association Rule Mining
  • Find out if itemset A1, B1 is frequent (i.e. If
    support of A1, B1 k)
  • A B
  • Support of itemset is defined as number of
    transactions in which all attributes of the
    itemset are present
  • For binary data, support Ai ? Bi.

27
Association Rule Mining
  • Idea based on TID-list representation of data
  • Represent attribute A as TID-list Atid
  • Support of ABC is Atid n Btid n Ctid
  • Use a secure protocol to find size of set
    intersection to find candidate sets

28
Cardinality of Set Intersection
  • Use a secure commutative hash function
  • Pohlig-Hellman Encryption
  • Each party generates own encryption key
  • All parties encrypt all the input sets

29
Cardinality of Set Intersection
  • Hashing
  • All parties hash all sets with their key
  • Initial intersection
  • Each party finds intersection of all sets (except
    its own)
  • Final intersection
  • Parties exchange the final intersection set, and
    compute the intersection of all sets

30
Computing Size of Intersection
1 X
Za,Ăź,?,?,?
E1(E2(E3(Z)))
E1(E2(Y))
E1(X)
XnYnZ?,Ăź
Za,Ăź,?,?,?
YnZ?,Ăź
2 Y
3 Z
E3(E1(X))
E2(E3(Z))
Y?,s,f,?,Ăź
E3(Z)
Xa,?,s,Ăź
E3(E1(E2(Y)))
E2(E3(E1(X)))
E2(Y)
XnYnZ?,Ăź
XnYnZ?,Ăź
Xa,?,s,Ăź
Y?,s,f,?,Ăź
XnZa,Ăź,?
XnY?,s,Ăź
31
Why need an intermediate intersection step?
  • Probing
  • 1 party only interested in a particular item
  • Input set composed of interesting item and junk
  • Output reveals information about the presence /
    absence of item
  • Solution
  • Intermediate step, every party receives encrypted
    sets of all other parties (but not its own)
  • If Intersection size lower than a threshold,
    possibility of probing gt Abort protocol

32
Proof of Security
  • Proof by Simulation
  • What is known
  • The size of the intersection set
  • Site i learns
  • How it can be simulated
  • Protocol is symmetric, simulating view of one
    party is sufficient

33
Proof of Security
  • Hashing
  • Party i receives encrypted set from party i-1
  • Can use random numbers to simulate this
  • Intersection
  • Party i receives fully hashed sets of all parties

34
Simulating Fully Encrypted Sets
ABC 2, AB 3, AC 4, BC 2, A 6,
B 7, C 8
ABC
2
AB
AC
4-2 2
BC
3-2 1
2-2 0
A
B
C
7-2-1-0 4
6-2-1-2 1
8-2-2-0 4
35
A
B
C
36
Optimized version
37
Association Rule Mining (Revisited)
  • NaĂŻve algorithm gt Simply use APRIORI. A single
    set intersection determines the frequency of a
    single candidate itemset
  • Thousands of itemsets
  • Key intuition
  • Set Intersection algorithm developed also allows
    computation of intermediate sets
  • All parties get fully encrypted sets for all
    attributes
  • Local computation allows efficient discovery of
    all association rules

38
Communication Cost
  • k parties, m set size, p frequent attributes
  • k(2k-2) O(k2) messages
  • p(2p-2)mencrypted message size O(p2m) bits
  • k rounds
  • Independent of number of itemsets found

39
Other Results
  • ID3 Decision Tree learning
  • Horizontal Partitioning LindellPinkas 00
  • Also vertical partitioning (Du, Vaidya)
  • Association Rules
  • Horizontal Partitioning Kantarcioglu
  • K-Means / EM Clustering
  • K-Nearest Neighbor
  • NaĂŻve Bayes, Bayes network structure
  • And many more

40
Challenges
  • What do the results reveal?
  • A general approach (instead of per data mining
    technique)
  • Experimental results
  • Incentive Compatibility
  • Note Upcoming book in the Advances in
    Information Security series by Springer-Verlag

41
Questions
Write a Comment
User Comments (0)
About PowerShow.com