Title: How to compile searching software so that it is impossible to reverse-engineer.
1How to compile searching software so that it is
impossible to reverse-engineer.
(Private Keyword Search on Streaming Data)
Rafail Ostrovsky William
Skeith UCLA
(patent pending)
2MOTIVATION Problem 1.
- Each hour, we wish to find if any of hundreds of
passenger lists has a name from Possible
Terrorists list and if so his/hers itinerary. - Possible Terrorists list is classified and
should not be revealed to airports - Tantalizing question can the airports help (and
do all the search work) if they are not allowed
to get possible terrorist list?
PROBLEM 1 Is it possible to design mobile
software that can be transmitted to all airports
(including potentially revealing this software to
the adversary due to leaks) so that this software
collects ONLY information needed and without
revealing what it is collecting at each node?
Non-triviality requirement must send back
only needed information, not everything!
3MOTIVATION Problem 2.
- Looking for malicious insiders and/or terrorists
communication - (I) First, we must identify some signature
criteria (rules) for suspicious behavior
typically, this is done by analysts. - (II) Second, we must detect which nodes/stations
transmit these signatures. - Here, we want to tackle part (II).
Public networks
PROBLEM 2 Is it possible to design software that
can capture all messages (and network locations)
that include secret/classified set of rules?
Key challenge the software must not reveal
secret rules. Non-triviality requirement the
software must send back only locations and
messages that match given rules, not
everything it sees.
4What we want
Punch line we can send executable code
publicly. (it wont reveal its secrets!)
5Current Practice
- Continuously transfer all data to a secure
environment. - After data is transferred, filter in the
classified environment, keep only small fraction
of documents.
6Current practice
Filter
Storage
? D(1,3)?D(1,2)? D(1,1)?
D(3,1)
D(1,1)
D(1,2)
D(2,2)
D(2,3)
D(3,2)
D(2,1)
D(1,3)
D(3,3)
?D(2,3)?D(2,2) ?D(2,1)?
Filter rules are written by an analyst and are
classified!
? D(3,3) ? D(3,2) ?D(3,1) ?
Amount of data that must be transferred to a
classified environment is enormous!
7Current Practice
- Drawbacks
- Communication
- Processing
- Cost and timeliness
8How to improve performance?
- Distribute work to many locations on a network,
where you decide on the fly which data is
useful - Seemingly ideal solution, but
- Major problem
- Not clear how to maintain security, which is the
focus of this technology.
9Storage E (D(1,2)) E (D(1,3))
Filter
? D(1,3)? D(1,2)?D(1,1)?
Decrypt
Storage E (D(2,2))
Filter
? D(2,3)?D(2,2)?D(2,1)?
Storage D(1,2) D(1,3) D(2,2)
Storage
Filter
?D(3,3)?D(3,2)?D(3,1)?
10- Example Filters
- Look for all documents that contain special
classified keywords (or string or data-item
and/or do not contain some other data), selected
by an analyst. - Privacy
- Must hide what rules are used to create the
filter - Output must be encrypted
11More generally
- We define the notion of Public Key Program
Obfuscation - Encrypted version of a program
- Performs same functionality as un-obfuscated
program, but - Produces encrypted output
- Impossible to reverse engineer
- A little more formally
12Public Key Program Obfuscation
- Can compile any code into a obfuscated code with
small storage. - Think of the Compiler as a mapping
- Source code ? Smart Public-Key Encryption with
initial Encrypted Storage Decryption Key. - Non-triviality Sizes of complied program
encrypted storage encrypted output are not much
bigger, compared to uncomplied code. - Nothing about the program is revealed, given
compiled code storage. - Yet, Someone who has the decryption key get
recover the original output.
13Privacy
14Related Notions
- PIR (Private Information Retrieval)
CGKS,KO,CMS - Keyword PIR KO,CGN,FIPR
- Cryptographic counters KMO
- Program Obfuscation BGIRSVY
- Here output is identical to un-obfuscated
program, but in our case it is encrypted. - Public Key Program Obfuscation
- A more general notion than PIR, with lots of
applications
15What do we want?
Filter
Storage E (D(1,2)) E (D(1,3))
?D(1,3)?D(1,2)?D(1,1)?
2 requirements correctness only matching
documents are saved, nothing else. efficiency
the decoding is proportional to the length of the
buffer, not the size of the entire stream.
Conundrum Complied Filter Code is not allowed to
have ANY branches (i.e. any if then else
executables). Only straight-line code is allowed!
16Simplifying Assumptions for this Talk
- All keywords come from some poly-size dictionary
- Truncate documents beyond a certain length
17Sneak peak the compiled code
- Suppose we are looking for all documents that
contain some secret word from Webster dictionary. - Here is how it looks to the adversary For each
document, execute the same code as follows
18Lookup encryptions of all words appearing in the
document and multiply them together. Take this
value and apply a fixed formula to it to get
value g.
w1 E()
w2 E()
w3 E()
w4 E()
w5 E()
D
Dictionary
. . .
wn-2 E()
wn-1 E()
wn E()
g
(,,) (,,) (,,) (,,) (,,) (,,) (,,) (,,) (,,) (,,)
Small Output Buffer
19How should a solution look?
20This is matching document 2
This is a Non-matching document
This is matching document 1
This is matching document 3
This is a Non-matching document
This is a Non-matching document
21How do we accomplish this?
22Reminder PKE
- Key-generation(1k) ? (PK, SK)
- E(PK,m,r) ? c
- D(c, SK) ? m
- We will use PKE with additional properties.
23Several Solutions based on Homomorphic Public-Key
Encryptions
- For this talk Paillier Encryption
- Properties
- E(x) is probabilistic, in particular can encrypt
a single bit in many different ways, s.t. any
instances of E(0) and any instance of E(1) can
not be distinguished. - Homomorphic i.e., E(x)E(y) E(xy)
24Using Paillier Encryption
- E(x)E(y) E(xy)
- Important to note
- E(0)c E(0)E(0)
- E(00.0) E(0)
- E(1)c E(1)E(1)
- E(111) E(c)
- Assume we can somehow compute an encrypted value
v, where we dont know what v stands for, but
vE(0) for un-interesting documents and vE(1)
for interesting documents. - Whats vc ? It is either E(0) or E(C) where we
dont know which one it is.
25w1 E(0)
w2 E(1)
w3 E(0)
w4 E(0)
w5 E(1)
D
g E(0) if there are no matching words g E(c)
if there are c matching words
Dictionary
gD E(0) if there are no matching words gD
E(cD) if there are c matching words Thus if we
keep gE(c) and gDE(cD), we can calculate D
exactly.
. . .
wn-2 E(1)
wn-1 E(0)
wn E(0)
(g,gD)
E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0)
Output Buffer
26Heres another matching document
- Collisions cause two problems
- Good documents are destroyed
- 2. Non-existent documents could be fabricated
This is matching document 1
This is matching document3
This is matching document 2
27- Well make use of two combinatorial lemmas
28(No Transcript)
29Combinatorial Lemma 1
- Claim color survival games succeeds with
probability gt 1-neg(g)
30How to detect collisions?
- Idea append a highly structured, (yet random)
short combinatorial object to the message with
the property that if 2 or more of them collide
the combinatorial property is destroyed. - ? can always detect collisions!
31- 100001100010010100001010010
010001010001100001100001010
010100100100010001010001010
100100010111100100111010010
32Combinatorial Lemma 2
Claim collisions are detected with
probability gt 1 - exp(-k/3)
33We do the same for all documents!
34For every document in the stream do the same
Lookup encryptions of all words appearing in the
document and multiply them together ( g).
w1 E()
w2 E()
w3 E()
w4 E()
w5 E()
D
Dictionary
Compute gD and f(g)
. . .
multiply (g,gD,f(g))into g randomly chosen
locations
wn-2 E()
wn-1 E()
wn E()
(g,gD,f(g))
(,,) (,,) (,,) (,,) (,,) (,,) (,,) (,,) (,,) (,,)
Small Output Buffer
35Overflow how to always collect at least m
items (with arbitrary overflow of matching
documents)
- Idea create a logarithmic (in stream size)
number of original buffers. - First buffer is processed for every stream item
- Second buffer takes every item with probability ½
- Third buffer takes every item with (independent)
probability ¼ - ith buffer with probability 1/2i
- Key point If number of documents gtM, at least
one buffer will get O(M) matching documents!
36Comparison of our work to Bethencourt, Song,
Waters 06
- OS-05
- Buffer size to store m items O(m log m)
- Efficiency decoding time is proportional to the
buffer size.
- BSW-06
- Buffer size to store m items O(m)
- Efficiency decoding time is proportional to the
length of the entire stream.
37More from the paper that we dont have time to
discuss
- Reducing program size below dictionary size
(using ? Hiding from CMS) - Queries containing AND (using BGN machinery)
- Eliminating negligible error (using perfect
hashing) - Scheme based on arbitrary homomorphic encryption
- Extending to words not from dictionary (with
small error prob.)
38Conclusions
- We introduced Private searching on streaming data
- More generally Public key program obfuscation --
more general than PIR, or cryptographic counters - Practical, efficient protocols
- Eat your cake and have it too ensure that only
useful documents are collected. - Many possible extensions and lots of open
problems - THANK YOU!