Title: Private Matching
1Privacy Preserving Data Mining Lecture
2 Cryptographic Solutions
Benny Pinkas HP Labs, Israel
2Secure two-party computation - definition
y
x
Input
F(x,y) and nothing else
Output
y
As if
x
F(x,y)
F(x,y)
3Secure Function Evaluation
- A major topic of cryptographic research
- How to let n parties, P1,..,Pn compute a function
F(x1,..,xn) - Where input xi is known to party Pi
- Parties learn the final input and nothing else
- Caveat cryptographic definitions of secure
computation are both too strong and too weak - Too strong do not allow leakage of harmless
information the price of this extra security is
in efficiency. - Too weak do not address leakage or misuse caused
by the function itself (e.g., information implied
by the outputs, or misbehavior in choosing an
input).
4Leak no other information
- A protocol is secure if it emulates the ideal
solution - Alice learns F(x,y), and therefore can compute
everything that is implied by x, her prior
knowledge of y, and F(x,y). - Alice must not be able to compute anything else
- Simulation
- A protocol is considered secure if
- For every adversary in the real world
- There exists a simulator in the ideal world,
which outputs an indistinguishable transcript
, given access to the information that the
adversary is allowed to learn in the ideal model.
5Secure Function Evaluation
- Major Result Yao Any function that can be
evaluated using polynomial resources can be
securely evaluated using polynomial
resources(under some cryptographic assumption)
6SFE Building Block 1-out-of 2 Oblivious Transfer
Y0, Y1
j?0,1
Bob
Alice
- 1-out-of-2 OT can be based on most public key
systems - There are implementations with two communication
rounds
7General Two party Computation
- Two party protocol
- Input
- Sender Function F (some representation)
- The senders input Y is already embedded in F
- Receiver X ??0,1?n
- Output
- Receiver F(x) and nothing else about F
- Sender nothing about x
8Representations of F
- Boolean circuits Yao,GMW,
- Algebraic circuits BGW,
- Low deg polynomials BFKR
- Matrices product over a large field FKN,IK
- Randomizing polynomials IK
- Communication Complexity Protocol NN
9Secure two-party computation of general functions
Yao
- First, represent the function F as a Boolean
circuit C - Its always possible
- Sometimes its easy (additions, comparisons)
- Sometimes the result is inefficient (e.g. for
indirect addressing, e.g. Ax ) - Then, garble the circuit
- Finally, evaluate the garbled circuit
10Garbling the circuit
- Bob constructs the circuit, and then garbles it.
W values will serve as cryptographic keys Wk0 ?
0 on wire k Wk1 ? 1 on wire k (Alice will learn
one string per wire, but not which bit it
corresponds to.)
11Gate tables
- For every gate, every combination of input values
is used as a key for encrypting the corresponding
output - Assume GAND. Bob constructs a table
- Encryption of wk0 using keys wi0,wJ0
(AND(0,0)0) - Encryption of wk0 using keys wi0,wJ1
(AND(0,1)0) - Encryption of wk0 using keys wi1,wJ0
(AND(1,0)0) - Encryption of wk1 using keys wi1,wJ1
(AND(1,1)1) - Result given wix,wJy, can compute wkG(x,y)
12Secure computation
- Bob sends the table of gate G to Alice
- Given, e.g., wi0,wJ1, Alice computes wk0 by
decrypting the corresponding entry in the table,
but she does not know the actual values of the
wires.
Encryption of wk0 using keys wi0,wJ0 Encryption
of wk0 using keys wi0,wJ1 Encryption of wk1 using
keys wi1,wJ1 Encryption of wk0 using keys wi1,wJ0
Permuted order
13Secure computation
- Bob sends to Alice
- Tables encoding each circuit gate.
- Garbled values (ws) of his input values.
- Translation from garbled values of output wires
to actual 0/1 values. - If Alice gets garbled values (ws) of her input
values, she can compute the output of the
circuit, and nothing else.
14Alices input
- For every wire i of Alices input
- The parties run an OT protocol
- Alices input is her input bit (s).
- Bobs input is wi0,wi1
- Alice learns wis
- The OTs for all input wires can be run in
parallel. - Afterwards Alice can compute the circuit by
herself.
15Secure computation the big picture
- Represent the function as a circuit C
- Bob sends to Alice 4C encryptions (e.g. 64C
Bytes), 4 encryptions for every gate. - Alice performs an OT for every input bit. (Can
do, e.g. 100-1000 OTs per sec.) - One round of communication.
- Efficient for medium size circuits!
16Example
- The Millionaires problem comparing two N bit
numbers - Whats the overhead?
17Applications
- Two parties. Two large data sets.
- Max?
- Mean?
- Median?
- Intersection?
- Decision Tree learning? ID3?
18Fairplay a secure two-party computation
systemMalkhi, Nissan, P., Sella
- A a full fledged secure two-party computation
system, implementing Yaos garbled circuit
protocol. - Goals
- Investigate whether two-party SFE is practical
- Actual measurements of overall computation
- Breakdown of computation into parts
- Computation versus communication?
- Test-bed for various optimizations
19Fairplay
- The Compilation paradigm
- Programs written in SFDL, a high-level
programming language - Allows clear, formal, easily understandable
definition and requirements by humans - SHDL Low-level language describing Boolean
circuits - SFDL ? SHDL compiler and optimizer
- SHDL ? Java programs implementing Yaos protocol
20Fairplay SFDL example
- program Millionaires
- type int Intlt20gt // 20-bit integer
- type AliceInput int
- type BobInput int
- type AliceOutput Boolean
- type BobOutput Boolean
- type Output struct AliceOutput alice,
BobOutput bob - type Input struct AliceInput alice,
BobInput bob - function Output output(Input input)
- output.alice input.alice gt input.bob
- output.bob input.bob gt input.alice
-
21SFDL properties
- Conventional syntax (C/Pascal-like)
- Type system Boolean, integer, enumerated
- Program structure
- Declarations global constants, types
- Sequence of functions (no nesting C, no
recursion) - Function name is its return value Pascal
- Conditional execution and loops
- if-then, if-then-else statements, For-loop (loop
boundaries should be known at compile time) - Assignments and expressions
- constants, variables, array entries, structure
items, function calls, operators (, -, logical,
comparison), parenthesis
22SHDL example
- 0 input //outputinput.bob0
- 1 input //outputinput.bob1
- 2 input //outputinput.bob2
- 3 input //outputinput.bob3
- 4 input //outputinput.alice0
- 5 input //outputinput.alice1
- 6 input //outputinput.alice2
- 7 input //outputinput.alice3
- 8 gate arity 2 table 1 0 0 0 inputs 4 5
- 9 gate arity 2 table 0 1 1 0 inputs 4 5
23kth-ranked element (e.g. median)
- Inputs
- Alice SA Bob SB
- Large sets of unique items (?D).
- Output
- x ? SA ? SB s.t. x has k-1 elements smaller than
it. - The rank k
- Could depend on the size of input datasets.
- Median k (SA SB) / 2
- Motivation
- Basic statistical analysis of distributed data.
- E.g. histogram of salaries in CS departments
- The Problem Generic constructions using circuits
Yao yield an overhead which is at least
linear in k.
24An (insecure) two-party median protocol
RA
LA
SA
mA
mA lt mB
SB
RB
LB
mB
LA lies below the median, RB lies above the
median. New median is same as original median.
Recursion ? Need log n rounds (assume each set
contains n2i items)
25A Secure two-party median protocol
A deletes elements mA. B deletes elements gt
mB.
YES
A finds its median mA B finds its median mB
mA lt mB
A deletes elements gt mA. B deletes elements
mB.
NO
Secure comparison (e.g. a small circuit)
26An example
B
A
mAgtmB
mAltmB
mAltmB
mAgtmB
Median found!!
mAltmB
27Proof of security
median
B
A
mAgtmB
mAgtmB
mAltmB
mAltmB
mAltmB
mAltmB
mAgtmB
mAgtmB
mAltmB
mAltmB
28Arbitrary input size, arbitrary k
SA
k
SB
Now, compute the median of two sets of size k.
Size should be a power of 2.
median of new inputs kth element of original
inputs
29Hiding size of inputs
- Can search for kth element without revealing size
of input sets. - However, kn/2 (median) reveals input size.
- Solution Let S2i be a bound on input size.
Median of new datasets is same as median of
original datasets.
SA
SB
30Privacy preserving data mining
P2
P1
Confidential database D1
Confidential database D2
Wish to mine D1 ? D2 without revealing more info
- Examples
- Medical databases protected by law
- Competing businesses
- Government agencies (privacy, need to know)
31The classification problem
Goal based on available data design an algorithm
to classify new data
32Classification using Decision Trees
33Privacy Preserving ID3
- Scenario The inputs are private information of
P1 and P2 - Main technical problem Comparing entropies while
preserving privacy. (entropy ?x logx) - Efficiency
- most computation done independently by parties.
- The overhead of cryptographic operations depends
only on the size of the decision tree (not on the
input size). - Basic task compute x log x.
- x x1x2 e.g., total number of customers
with (age gt 30) and (fraud yes)
34Privacy Preserving ID3
- Computing x log x
- x x1 x2, known to P1 and P2 respectively
(independently computed from databases). - Might as well compute x lnx, or lnx.
- First run a protocol to compute random shares, y1
y2 ln x - ln x is Real. Crypto works over finite fields.
Must do numerical analysis.
35Cryptographic Tools
- Secure Function Evaluation (SFE) Yao
- Oblivious Polynomial Evaluation NP
A polynomial Q()
x
Input
Q(x) and nothing else
nothing
Output
Implementation Two passes, O(degree) (or O(
logF) ) exponentiations.
36Computing random shares of lnx ln(x1x2)
- Use Taylor approximation for lnx
- x x1 x2 2 n (1?) -½ lt ? lt ½
- lnx ln(2 n (1?)) ln 2 n ln(1?)
- ? ln 2 n ?
i1..k (-1) i-1 ? i / i - ln 2 n T(?)
- T(?) is a polynomial of degree k. Error is
exponentially small in k. - We only know how to work over finite fields
- Compute clnx, where c compensates for fractions.
- Work in F, where F sufficiently large.
37ln(x1x2) Protocol
- Step 1 of the protocol Find n, ?
- Apply Yaos protocol to the following small
circuit - Input x1 and x2
- Output (random shares)
- random a1 and a2 s.t. a1 a2 x-2 n ? 2 n
- random b1 and b2 s.t. b1 b2 ln 2 n
- Operation The protocol finds 2 n closest to
x1x2, computes ?2 n x1x2- 2 n. - x x1 x2 2 n ?2 n
- lnx ln(2 n (1?)) ln 2 n ln(1?)
38ln(x1x2) Protocol (Cont.)
- Step 2 of the protocol
- Compute random shares of T(?) (Taylor approx.)
- P1 chooses a random w1? F and defines a
polynomial Q(x), s.t. w1 Q(a2) T(?)
(recall a1 a2 ?2 n) - Namely, Q(x) T( (a1x)/2 n ) w1 .
- Run an oblivious poly evaluation in which P2
computes - w2 Q(a2) T(?) w1 .
- Now the parties have random w1 and w2 s.t.
- w1 w2 T(?) ? ln(1?)
- (b1 w1) (b2 w2) ? ln 2 n ln(1?) ln x
39Computing x lnx
- Tool Multiply(c1,c2)
- Input c1, c2
- Output d1, d2 s.t. d1 d2 c1 c2
- How? OPE of Q(z) c1z -d1
- Actual task x lnx
- Input x1 x2 x, c1 c2 ln x
- Output x lnx (x1 x2 )(c1 c2)
- Run Multiply(x1 ,c2), Multiply (c1 ,x2)
40The rest of the work..
- The parties compute shares of lnx
- Then they compute shares of xlnx
- Each party computes a share of the entropy by
summing shares of x lnx (H(X) ? x lnx ) - A small circuit finds the attribute giving the
minimal conditional entropy - The attribute is assigned to the node
- The databases are divided according to the value
of this attribute
41Efficiency
- lnx protocol
- secure computation of a small circuit
- one oblivious polynomial evaluation
- ID3 for a database with
- 1,000,000 transactions
- 15 attributes
- 10 values per attribute
- 4 class values
- Communication per node takes seconds (T1)
- Computation per node takes minutes (P3)
42Contributions
- Cryptographic protocols where the bulk of the
operations is done independently. - Data mining
- Rigorous model for secure data-mining.
- Efficient, secure protocol for specific problems
(median, ID3). - Cryptography
- Sub-linear complexity - secure computation for
large data sets. - Efficient protocols for complex known algorithms.
- Secure computation of logarithms (real function -
numerical analysis). - Drawbacks
- Privacy preserving solutions are less efficient
- Its hard to find efficient private solutions for
all interesting functions - Security against malicious parties
43References
- Lecture notes and overview papers
- B. Pinkas, Cryptographic Techniques for
Privacy-Preserving Data Mining, SIGKDD
Explorations, January 2003. http//www.pinkas.net/
PAPERS/sigkdd.pdf - R. Cramer Introduction to Secure Computation,
2000. http//homepages.cwi.nl/cramer/papers/CRAME
R_revised.ps - Ivan Damgård, Theory and practice of multiparty
computation, 8th EWSCS, http//www.cs.ioc.ee/yik/s
chools/win2003/damgard.php - Research papers
- G. Aggarwal, N. Mishra and B. Pinkas, Secure
Computation of the K'th-ranked Element, Eurocrypt
'2004. http//www.pinkas.net/PAPERS/ANP04.pdf - Y. Lindell and B. Pinkas, Privacy Preserving Data
Mining, Journal of Cryptology, Vol. 15 No. 3,
2002. http//www.pinkas.net/PAPERS/id3-final.pdf