How To Tell A Secret Without Revealing It - PowerPoint PPT Presentation

About This Presentation

Title:

How To Tell A Secret Without Revealing It

Description:

Enhanced Data Privacy in a Distributed Implementation of ... Andy White. Past. Ed Kenney (CMU), Dan Upton (UVA), Rom Chan, Trin C. Four more this summer ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 47

Provided by: informatio57

Learn more at: https://facultystaff.richmond.edu

Category:

more less

Transcript and Presenter's Notes

Title: How To Tell A Secret Without Revealing It

1
How To Tell A Secret Without Revealing It

Enhanced Data Privacy in a Distributed
Implementation of the Smith-Waterman Genome
Sequence Comparison Algorithm
Barry Lawson
University of Richmond

2
Outline

Distributed volunteer computing
A problem
Related work
A real-world application
Our enhanced privacy approach
Results analysis
Conclusions, future ongoing work

3
Our Scenario

You have a very large, compute intensive project

4
What To Do?
5
Distributed Volunteer Computing (DVC)
Participants (Bob)
(lt 200 TFLOPS)
Supervisor (Alice)
6
DVC Computation

Large-scale distributed computation
compute intensive
easily parallelizable
Supervisor
divide computation into tasks (independent)
ship tasks to participants
collect significant results
Participants
download and execute tasks (when o/w idle)
return significant results

7
Real-world Examples
8
DVC Group _at_ Richmond

Faculty
Doug Szajda (CS)
Barry Lawson (CS)
Jason Owen (Statistics)
Students
Current
Mike Pohl, Greg Steffensen, Andy White
Past
Ed Kenney (CMU), Dan Upton (UVA), Rom Chan, Trin
C.
Four more this summer
Stefan Chipilov, Brittany Williams, Matt King,
Ivan Jibaja

9
Telling a Secret Without Revealing It

A problem
Participants are untrustworthy
Code executes outside supervisors control
Computation data may be proprietary
Goal
Participants provide meaningful results
Supervisor does not divulge data

10
Related Work(not exhaustive)

Computing with encrypted data
Alice has x, wants Bob to compute f(x)
But does not want to divulge x
Alice gives Bob x and f( )
Bob computes f(x)
The key
Alice can determine f(x) from f(x)
Bob cannot determine x from x and/or f(x)
Difficult (often impossible) in practice

11
Less Formally
x
x
f
f
?
f(x)
f(x)
f(x)
Bob
12
Flexibility In Our Context

The computation
Alice (supervisor) has many xs
Bob (participant) determines xs that are
significant
Alice doesnt need the value f(x)
Alice will post-process
A few false positives are OK
Sufficient accuracy flexibility in f( )

13
The Adversary

Assumed to be intelligent
can decompile, analyze, modify code
understands task algorithms
understands enhanced privacy scheme(s)
Motivation
may not be obvious business competitor?
may not care if leak is detected

14
General Model

The Computation
evaluate an algorithm f D R for all
x in D
Task T( )
partition D into subsets Di
T(Di) evaluates f(xi) for all xi in D
Filter function G( )
determines significance
returns indices of significant xi

15
Our General Approach

Transform Di, f, G into Di, f, G
Replace task T(Di) with T(Di)
Desirable properties
T(Di) does not leak additional info about
values in Di
significance in T(Di) significance in
T(Di)
any difference is reasonably small

16
In Reality

Providing desired properties is difficult
even with increased flexibility
impossible for some apps
When possible, application-specific
Bottom line we have a potential approach
where few, if any, others exist

17
Application Genome Sequence Comparison

Compare sequences over genome alphabet
? A,C,G,T
Track evolutionary changes by aligning columns of
sequences (an alignment)
E.g. CTGTTA
CAGTTA

18
Sequence Evolution
CTGTTA CTGTA

Deletion
Insertion
Substitution

indels
CTGTTA CGTGTTA
CTGTTA CAGTTA
19
Sequence Evolution

After several generations
Note of alignments is huge
(for realistic-length sequences)

CTGTTA CTATGCTACG
20
Alignment Types

Global alignment
considers entire sequence
Local alignment
considers substrings
biologists usually use local

21
Measuring Alignments

Scoring function
1 if symbols match
-1 if not
Gap penalty
g(k) a b(k-1)
k is gap length
( consecutive dashes in single sequence)
Alignment score
sum of column scores minus gap penalties

22
A Simple Example

Global alignment
Scoring function 1 match, -1 no match
Gap penalty g(k) 2 1(k-1)

C T G T T A C T A T G
C T A C G 1 -2 -1 -2 1 -3
1 1 -3

Alignment score 4 - 11 -7

23
Smith-Waterman

Dynamic programming algorithm
Produces an optimal alignment
Global O(n2) Local O(n3)
Implemented on commercial DVC platforms

24
Significance in S-W

Significance of scores based on probability
Empirical evidence
given randomly-generated sequences
scores exhibit extreme value distribution

25
Determining Significance

Choosing a significance threshold p
want small probability that a random score gtp
typically, probability lt 0.003

p
26
A Smith-Waterman Task

Pairwise comparison of two sets of sequences, A
and B
A proprietary sequences
B sequences from public database
Returned indices of well-matched pairs
Notation T(A,B,s,g,p)

27
Our Transformation

Use offset sequences
compare relative distances b/w specific
nucleotides
X GCACTTACGCCCTTACGACG
F(X,A) 3,4,8,3
F(X,C) 2,2,4,2,1,1,4,3
F(X,G) 1,8,8,3
F(X,T) 5,1,7,1

28
Modified Tasks

X GCACTTACGCCCTTACGACG
F(X,C) 2,2,4,2,1,1,4,3
Y GCACTCGCCACTTAGCACG
F(Y,C) 2,2,2,2,1,2,5,2
Apply S-W to F(X,C) and F(Y,C)
Scoring function, gap penalty
Goodness threshold

29
Intuition

Similar sequences similar offsets
consider effects of indels, substitutions
What about false positives?
multiple nucleotides
e.g., assign A C tasks to distinct participants
good match if both tasks indicate significance

30
A
C
CAGGATCTCAAGC
CAGCATATCACGT
?
?
Bob 2
Bob 1
31
Using Multiple Nucleotides

Maximum method
one task for each of A,C,G,T
result significant if any of the four indicate
Adding method
one task for each of A,C,G,T
result significant if sum of four scores
indicates significance
Costs reduced in either case
on average, 1/4 length of original sequence
runtime for an offset sequence 1/64

32
Does This Provide Real Data Privacy?

Recall desired properties
T(Di) does not leak additional info about
values in Di
significance in T(Di) significance in T(Di)
any difference is reasonably small

33
Data Privacy?

Property 1 fails
T(Di) does leak additional info about values in
Di
adversary knows all info about one nucleotide
How much info is leaked?
conditional entropy gives rough estimate
e.g., N 600, C? N/4 ?
487 bits (of 1200) leaked
713 bits of uncertainty remain

34
Analysis

Clearly, not provable security
Suggests two questions
Can adversary determine additional symbols if
so, how many?
How much info leakage is too much?

35
4 out of 5 Biologists Agree

Given only the position of a single nucleotide
literal
No additional nucleotides can be inferred
No biologically useful information that can be
inferred
Given current understanding of the structure and
function of the genome

36
Does It Work?

In general, yes
strong correlation b/w our scores and S-W
not as sensitive as S-W
some weak matches missed
Via statistical inference
very few false positives lt 10-4
very few false negatives (usually none)

37
An Extension

Sequences can be masked
For each task, choose random binary mask
Remove from sequence all zero elements
Our experiments suggest mask with 1 in 90 of
positions works well

X 2 2 4 2 1 1 4 3 1 1 1 0 1 1 1 0 2 2
4 1 1 4
38
Simulation Results

Well-matched sequences artificially generated
Substring mutated over several generations
Placed at random location into random sequences
Scoring function 1 match, -1 no match
Gap penalty g(k) 2 1(k-1)

10000 comparisons, no mask, maximum method
Sequence length 600-800, matching portion length
300, average of 52.5 subs and 52.5 indels

10000 comparisons, no mask, adding method
Sequence length 600-800, matching portion length
300, average of 52.5 subs and 52.5 indels

1000 comparisons, no mask, maximum method
Sequence length 2000, matching portion length
1000, average of 150 subs and 150 indels

1000 comp, 90 mask, maximum method
Sequence length 1000-1300, matching portion
length 500, average of 86.25 subs and 86.25 indels

43
Conclusions

Introduced notion of sufficient accuracy
Presented a strategy for enhancing data privacy
in important real-world application
Present important real-world app that
requires privacy
efficiently parallelizable
Potential first entry for benchmark suite of apps
for privacy study

44
Future Work