Title: Information Sharing across Private Databases
1Information Sharing across Private Databases
- Rakesh Agrawal
- Alexandre Evfimievski
- Ramakrishnan Srikant
- IBM Almaden Research Center
2Todays Information Sharing Systems
Mediator
Q
R
Q
R
Centralized
Federated
- Assumption Information in each database can be
freely shared.
3Selective Document Sharing
- R is shopping for technology.
- S has intellectual property it may want to
license. - First find the specific technologies where there
is a match, and then reveal further information
about those.
R Shopping List
S Technology List
Example 2 Govt. agencies sharing information on
a need-to-know basis.
4Medical Research
- Validate hypothesis between adverse reaction to a
drug and a specific DNA sequence. - Researchers should not learn anything beyond 4
counts
DNA Sequences
Mayo Clinic
Drug Reactions
5Minimal Necessary Information Sharing
- Compute queries across databases so that no more
information than necessary is revealed. - Need is driven by several trends
- End-to-end integration of information systems
across companies. - Simultaneously compete and cooperate.
- Security need-to-know information sharing
- Privacy legislation stated privacy polices
6Talk Outline
- Motivation
- Problem Definition
- Protocols
- Cost Analysis
- Conclusions
7Current Techniques
- Trusted Third Party
- Has to be completely trusted, both wrt intent and
competence against security breaches.
- Secure Multi-Party Computation
- Given two parties with inputs x and y, compute
f(x,y) such that the parties learn only f(x,y)
and nothing else. - Can be solved by building a combinatorial ciruit,
and simulating that circuit Yao86. - Cost makes them impractical for database-size
problems.
8Our Security Model
- No third party.
- Main parties directly execute a protocol, which
is designed to guarantee that they do not learn
any more than they would have learnt had they
given the data to a trusted third party and got
back the answer.
- Honest-but-curious behavior Parties follow
protocol properly, except that they can record
all computation received messages, and analyze
them to learn additional information.
9Problem Statement (Ideal)
- Given
- Two parties R (receiver) and S (sender)
- Databases DR and DS
- Query Q spanning the tables in DR and DS
- Compute the answer to Q and return it to R
without revealing any additional information to
either party.
Anything R can learn from the answer to the query
is fair game! Example If Q VR ? VS, then for
all v ? VR VS, R knows v ? VS.
10Problem Statement (Minimal Sharing)
- Given
- Two parties R (receiver) and S (sender)
- Databases DR and DS
- Query Q spanning the tables in DR and DS
- Additional (pre-specified) categories of
information I - Compute the answer to Q and return it to R
without revealing any additional information to
either party, except for the information
contained in I
11Protocols
- Protocols for four key operations
- Intersection, Equijoin, Intersection Size
Equijoin Size - Notation
- TR , TS tables in DR and DS respectively.
- VR, VS set of distinct values in TR and TS
respectively. - Additional Information I
- For intersection, intersection size equijoin,
- I VS , VR
- For equijoin size, I also includes the
distribution of duplicates some subset of
information in VS ? VR
12Related Work
- NP99 Protocols for list intersection problem
- Oblivious evaluation of n polynomials of degree n
each. - Oblivious evaluation of n2 polynomials.
- HFH99 find people with common preferences,
without revealing the preferences. - Intersection protocols are similar to ours, but
do not provide proofs of security. - Private Information Retrieval
- Privacy Preserving Data Mining
13Talk Outline
- Motivation
- Problem Definition
- Protocols
- Intersection
- Intersection Size Equijoin Size
- Joins
- Proof Methodology
- Cost Analysis
- Conclusions
14A Simple, but Incorrect, Intersection Protocol
R
S
R S agree to use encryption function fe (with
key e)
Shorthand for fe(x) x ? VS
VR
VS
fe(VS )
fe(VS )
VR ? VS v ? VR fe(v) ? fe(VS )
Problem For any element x, R can check whether
fe(x) is in fe(VS )
15Intersection Protocol Intuition
- Still want to encrypt the value in VR and VS and
compare the encrypted values. - However, want an encryption function such that it
can only be jointly computed by R and S, not
separately.
16Commutative Encryption
- Pair of encryption functions f and g such that
- f(g(v)) g(f(v))
- Assuming the Decisional Diffie-Hellman (DDH)
hypothesis, - fe(x) xe mod p
- where
- p safe prime number, i.e., both p and q(p-1)/2
are primes - Dom f all quadratic residues modulo p, and
- encryption key e ? 1, 2, , q-1
- is a commutative encryption.
17Commutative Encryption (2)
- The powers commute
- (xd mod p)e mod p xde mod p (xe mod p)d mod
p - DDH hypothesis The distribution of ltga, gb, gabgt
is computationally indistinguishable from the
distribution of ltga, gb, gcgt where a,b,c ?r Dom
f. - Implication ltx, xe, y, yegt is also
indistinguishable from - ltx, xe, y, zgt where x,y,z ?r Dom f.
- Note DDH does not hold if adversary can select
a, b, c.
18Intersection Protocol
Secret key
R
S
eS
eR
VS
VR
feS(VS )
To satisfy DDH, we apply feS on h(VS), where h is
a hash function, not directly on VS.
19Intersection Protocol
R
S
eS
eR
VS
VR
feS(VS )
feS(VS )
feR(feS(VS ))
Commutative property
feS(feR(VS ))
20Intersection Protocol
R
S
eS
eR
VS
VR
feR(VR )
feR(VR )
feS(feR(VS ))
lty, feS(y)gt for y ? feR(VR )
lty, feS(y)gt for y ? feR(VR )
Since R knows ltx, feR(x)gt
ltx, feS(feR(x))gt for x ? VR
21Intersection Size Protocol
R
S
eS
eR
VS
VR
feR(VR )
feS(VS )
R cannot map z ? feR(feS(VR)) back to x ? VR.
feS(VS )
feR(VR )
feR(feS(VS ))
feS(feR(VR ))
feR(feS(VR))
22Equijoin Size Protocol
- Same as intersection size protocol, but allows
duplicates. - Can reveal some subset of information in VR ? VS
based on distribution of duplicates. - If each element in VR ? VS has same number of
duplicates in VR, does not reveal any additional
information beyond the join size and the
distribution of duplicates in VS. - If each element in VR ? VS has unique number of
duplicates in VR, reveals VR ? VS and the number
of duplicates in VS for elements in VR ? VS.
23Equijoin Protocol Intuition
- R needs some extra information ext(v) for values
v ? VR ? VS. - ext(v) information about the other attributes in
TS for those records where TS.A v - S has second secret key eS
- For each value v ? VS,
- S generates an encryption key ? feS(v), and
- encrypts ext(v) using encryption function K with
key ?. - S allows R to learn feS(v) only for v ? VR.
- K need not be a commutative encryption.
24Join Protocol
R
S
eS, eS
eR
VR
feR(VR )
feR(VR )
lty, feS(y) , feS(y)gt for y ? feR(VR )
ltx, feS(feR(x)), feS(feR(x))gt for x ? VR
feR-1(feS(feR(x)) feR-1(feR(feS(x))
feS(x)
ltx, feS(x), feS(x)gt for x ? VR
25Join Protocol
S
R
eS, eS
eR
VS ext(VS)
VR
ltx, feS(x), feS(x)gt for x ? VR
ltfeS(v), K(feS(v), ext(v))gt for v ? VS
ltfeS(v), K(feS(v), ext(v))gt for v ? VS
K encryption function, Encrypts ext(v)
using feS(v) as the encryption key
ltx, feS(x), feS(x), K(feS(x), ext(x))gt for x
? VR ? VS
26Proof Methodology
- Consider two distributions
- Ss view of the protocol.
- a simulation of Ss view that only uses what S is
supposed to have at the end of the protocol. - e.g., VS, VS ? VR, and VR for intersection.
- If for any VS and VR, these two distributions are
computationally indistinguishable, then the
protocol is secure. - i.e., S cannot learn anything else from the
protocol.
27Proof Methodology (2)
- Simulation only uses the knowledge S is supposed
to have at the end of the protocol. - Distinguisher can also use the inputs of R, i.e.,
VR, but not Rs secret keys. - Implication S doesnt learn anything from the
protocol even if S (correctly) guesses some of
Rs inputs.
28Proofs
- We prove (for each protocol) that if the two
distributions can be distinguished, the DDH
hypothesis is false. - Easy to come up with protocols that look okay,
but are flawed - Proof of security is important for real-world
acceptance use. - The proofs are also fun!
29Talk Outline
- Motivation
- Problem Statement
- Protocols
- Cost Analysis
- Conclusions
30Cost Analysis Operations
- Cost is dominated by exponentiations.
- Let Ce cost of xe mod p
- x, e, p are all 1024-bit integers
- Roughly 0.02 seconds on a Pentium 3 (in 2001)
NP01, or 2 x 105 per hour - Intersection 2 (VR VS) Ce
- Join (2 VR 5 VS) Ce
- Algorithms are trivially parallelizable.
31Selective Document Sharing Implementation
- For each pair of documents dR ? DR and dS ? DS
- R and S execute the intersection protocol to get
dR, dS, and dR ? dS. - Then compute similarity function f between the
documents. - Note This protocol also reveals to R, for each
document dR ? DR, the size of dR ? dS for each - dS ? DS.
32Selective Document SharingCost Analysis
- If
- DR 10 documents, DS 100 docs,
- each document has 1000 words,
- 10 parallel processors,
- 2 hours computation time
- 35 minutes communication time (on T1 line).
33Medical ResearchImplementation
- Let
- VR set of ids in Rs database that took the
drug. - VR subset of VR with adverse reaction.
- VS set of ids in Ss database.
- VS subset of VS with DNA sequence.
- Execute intersection size protocol 4 times
- (VR - VR) ? (VS - VS) (VR - VR) ? VS,
- VR ? (VS - VS) VR ? VS
- Modified version of protocol that sends results
directly to researchers.
34Medical ResearchCost Analysis
- If VR VS 1 million ids, and 10 parallel
processors - 4 hours computation time.
- 1.5 hours communication time.
35Talk Outline
- Motivation
- Problem Statement
- Protocols
- Cost Analysis
- Conclusions
36Summary
- Identified information sharing across private
databases as a new area for database research. - Developed novel protocols for intersection,
intersection size equijoin, and proved that
these protocols disclose minimal information. - Also gave protocol for equijoin size. This
protocol reveals some information about which
tuples joined, based on the distribution of
duplicates. - Showed how new applications can be built using
these protocols.
37Future Work
- What is the tradeoff between the additional
information disclosed and efficiency? - Will we be able to obtain much faster protocols
if we are willing to disclose additional
information? - Can we formalize models of minimal disclosure and
discover corresponding protocols for higher-level
database operations?
38Backup
39System Components
Cryptographic Protocol
Secure Communication
Libraries ( incl. Encryption Primitives)
Database
Operating System
40Lemma 1
- For polynomial m, the distribution of the 2 ? m
tuple - is indistinguishable from the distribution of the
tuple - where
41Lemma 2
- For polynomial m and n, the distribution of the 2
? n tuple - is indistinguishable from the distribution of the
tuple - where
42Lemma 3
- For polynomial m and n, the distribution of the 3
? n tuple - is indistinguishable from the distribution of the
tuple - where
43Limitations
- Multiple Queries
- Schema Discovery Heterogeneity