How To Tell A Secret Without Revealing It - PowerPoint PPT Presentation

About This Presentation
Title:

How To Tell A Secret Without Revealing It

Description:

Enhanced Data Privacy in a Distributed Implementation of ... Andy White. Past. Ed Kenney (CMU), Dan Upton (UVA), Rom Chan, Trin C. Four more this summer ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 47
Provided by: informatio57
Category:

less

Transcript and Presenter's Notes

Title: How To Tell A Secret Without Revealing It


1
How To Tell A Secret Without Revealing It
  • Enhanced Data Privacy in a Distributed
    Implementation of the Smith-Waterman Genome
    Sequence Comparison Algorithm
  • Barry Lawson
  • University of Richmond

2
Outline
  • Distributed volunteer computing
  • A problem
  • Related work
  • A real-world application
  • Our enhanced privacy approach
  • Results analysis
  • Conclusions, future ongoing work

3
Our Scenario
  • You have a very large, compute intensive project

4
What To Do?
5
Distributed Volunteer Computing (DVC)
Participants (Bob)
(lt 200 TFLOPS)
Supervisor (Alice)
6
DVC Computation
  • Large-scale distributed computation
  • compute intensive
  • easily parallelizable
  • Supervisor
  • divide computation into tasks (independent)
  • ship tasks to participants
  • collect significant results
  • Participants
  • download and execute tasks (when o/w idle)
  • return significant results

7
Real-world Examples
8
DVC Group _at_ Richmond
  • Faculty
  • Doug Szajda (CS)
  • Barry Lawson (CS)
  • Jason Owen (Statistics)
  • Students
  • Current
  • Mike Pohl, Greg Steffensen, Andy White
  • Past
  • Ed Kenney (CMU), Dan Upton (UVA), Rom Chan, Trin
    C.
  • Four more this summer
  • Stefan Chipilov, Brittany Williams, Matt King,
    Ivan Jibaja

9
Telling a Secret Without Revealing It
  • A problem
  • Participants are untrustworthy
  • Code executes outside supervisors control
  • Computation data may be proprietary
  • Goal
  • Participants provide meaningful results
  • Supervisor does not divulge data

10
Related Work(not exhaustive)
  • Computing with encrypted data
  • Alice has x, wants Bob to compute f(x)
  • But does not want to divulge x
  • Alice gives Bob x and f( )
  • Bob computes f(x)
  • The key
  • Alice can determine f(x) from f(x)
  • Bob cannot determine x from x and/or f(x)
  • Difficult (often impossible) in practice

11
Less Formally
x
x
f
f
?
f(x)
f(x)
f(x)
Bob
12
Flexibility In Our Context
  • The computation
  • Alice (supervisor) has many xs
  • Bob (participant) determines xs that are
    significant
  • Alice doesnt need the value f(x)
  • Alice will post-process
  • A few false positives are OK
  • Sufficient accuracy flexibility in f( )

13
The Adversary
  • Assumed to be intelligent
  • can decompile, analyze, modify code
  • understands task algorithms
  • understands enhanced privacy scheme(s)
  • Motivation
  • may not be obvious business competitor?
  • may not care if leak is detected

14
General Model
  • The Computation
  • evaluate an algorithm f D R for all
    x in D
  • Task T( )
  • partition D into subsets Di
  • T(Di) evaluates f(xi) for all xi in D
  • Filter function G( )
  • determines significance
  • returns indices of significant xi

15
Our General Approach
  • Transform Di, f, G into Di, f, G
  • Replace task T(Di) with T(Di)
  • Desirable properties
  • T(Di) does not leak additional info about
    values in Di
  • significance in T(Di) significance in
    T(Di)
  • any difference is reasonably small

16
In Reality
  • Providing desired properties is difficult
  • even with increased flexibility
  • impossible for some apps
  • When possible, application-specific
  • Bottom line we have a potential approach
  • where few, if any, others exist

17
Application Genome Sequence Comparison
  • Compare sequences over genome alphabet
  • ? A,C,G,T
  • Track evolutionary changes by aligning columns of
    sequences (an alignment)
  • E.g. CTGTTA
  • CAGTTA

18
Sequence Evolution
CTGTTA CTGTA
  • Deletion
  • Insertion
  • Substitution

indels
CTGTTA CGTGTTA
CTGTTA CAGTTA
19
Sequence Evolution
  • After several generations
  • Note of alignments is huge
  • (for realistic-length sequences)

CTGTTA CTATGCTACG
20
Alignment Types
  • Global alignment
  • considers entire sequence
  • Local alignment
  • considers substrings
  • biologists usually use local

21
Measuring Alignments
  • Scoring function
  • 1 if symbols match
  • -1 if not
  • Gap penalty
  • g(k) a b(k-1)
  • k is gap length
  • ( consecutive dashes in single sequence)
  • Alignment score
  • sum of column scores minus gap penalties

22
A Simple Example
  • Global alignment
  • Scoring function 1 match, -1 no match
  • Gap penalty g(k) 2 1(k-1)

C T G T T A C T A T G
C T A C G 1 -2 -1 -2 1 -3
1 1 -3
  • Alignment score 4 - 11 -7

23
Smith-Waterman
  • Dynamic programming algorithm
  • Produces an optimal alignment
  • Global O(n2) Local O(n3)
  • Implemented on commercial DVC platforms

24
Significance in S-W
  • Significance of scores based on probability
  • Empirical evidence
  • given randomly-generated sequences
  • scores exhibit extreme value distribution

25
Determining Significance
  • Choosing a significance threshold p
  • want small probability that a random score gtp
  • typically, probability lt 0.003

p
26
A Smith-Waterman Task
  • Pairwise comparison of two sets of sequences, A
    and B
  • A proprietary sequences
  • B sequences from public database
  • Returned indices of well-matched pairs
  • Notation T(A,B,s,g,p)

27
Our Transformation
  • Use offset sequences
  • compare relative distances b/w specific
    nucleotides
  • X GCACTTACGCCCTTACGACG
  • F(X,A) 3,4,8,3
  • F(X,C) 2,2,4,2,1,1,4,3
  • F(X,G) 1,8,8,3
  • F(X,T) 5,1,7,1

28
Modified Tasks
  • X GCACTTACGCCCTTACGACG
  • F(X,C) 2,2,4,2,1,1,4,3
  • Y GCACTCGCCACTTAGCACG
  • F(Y,C) 2,2,2,2,1,2,5,2
  • Apply S-W to F(X,C) and F(Y,C)
  • Scoring function, gap penalty
  • Goodness threshold

29
Intuition
  • Similar sequences similar offsets
  • consider effects of indels, substitutions
  • What about false positives?
  • multiple nucleotides
  • e.g., assign A C tasks to distinct participants
  • good match if both tasks indicate significance

30
A
C
CAGGATCTCAAGC
CAGCATATCACGT
?
?
Bob 2
Bob 1
31
Using Multiple Nucleotides
  • Maximum method
  • one task for each of A,C,G,T
  • result significant if any of the four indicate
  • Adding method
  • one task for each of A,C,G,T
  • result significant if sum of four scores
    indicates significance
  • Costs reduced in either case
  • on average, 1/4 length of original sequence
  • runtime for an offset sequence 1/64

32
Does This Provide Real Data Privacy?
  • Recall desired properties
  • T(Di) does not leak additional info about
    values in Di
  • significance in T(Di) significance in T(Di)
  • any difference is reasonably small

33
Data Privacy?
  • Property 1 fails
  • T(Di) does leak additional info about values in
    Di
  • adversary knows all info about one nucleotide
  • How much info is leaked?
  • conditional entropy gives rough estimate
  • e.g., N 600, C? N/4 ?
  • 487 bits (of 1200) leaked
  • 713 bits of uncertainty remain

34
Analysis
  • Clearly, not provable security
  • Suggests two questions
  • Can adversary determine additional symbols if
    so, how many?
  • How much info leakage is too much?

35
4 out of 5 Biologists Agree
  • Given only the position of a single nucleotide
    literal
  • No additional nucleotides can be inferred
  • No biologically useful information that can be
    inferred
  • Given current understanding of the structure and
    function of the genome

36
Does It Work?
  • In general, yes
  • strong correlation b/w our scores and S-W
  • not as sensitive as S-W
  • some weak matches missed
  • Via statistical inference
  • very few false positives lt 10-4
  • very few false negatives (usually none)

37
An Extension
  • Sequences can be masked
  • For each task, choose random binary mask
  • Remove from sequence all zero elements
  • Our experiments suggest mask with 1 in 90 of
    positions works well

X 2 2 4 2 1 1 4 3 1 1 1 0 1 1 1 0 2 2
4 1 1 4
38
Simulation Results
  • Well-matched sequences artificially generated
  • Substring mutated over several generations
  • Placed at random location into random sequences
  • Scoring function 1 match, -1 no match
  • Gap penalty g(k) 2 1(k-1)

39
  • 10000 comparisons, no mask, maximum method
  • Sequence length 600-800, matching portion length
    300, average of 52.5 subs and 52.5 indels

40
  • 10000 comparisons, no mask, adding method
  • Sequence length 600-800, matching portion length
    300, average of 52.5 subs and 52.5 indels

41
  • 1000 comparisons, no mask, maximum method
  • Sequence length 2000, matching portion length
    1000, average of 150 subs and 150 indels

42
  • 1000 comp, 90 mask, maximum method
  • Sequence length 1000-1300, matching portion
    length 500, average of 86.25 subs and 86.25 indels

43
Conclusions
  • Introduced notion of sufficient accuracy
  • Presented a strategy for enhancing data privacy
    in important real-world application
  • Present important real-world app that
  • requires privacy
  • efficiently parallelizable
  • Potential first entry for benchmark suite of apps
    for privacy study

44
Future Work
  • Solution is less than ideal
  • lack of formal privacy model / provable security
  • need more testing on real genetic data
  • But its a start
  • general problem is very difficult
  • this is a potential avenue of attack
  • S-W requires more careful study in this context
  • Consider additional apps

45
Ongoing DVC Work _at_ UR
  • Augmenting BOINC software for campus-wide
    distribution
  • want to collect participant/server/data info
    patterns
  • Greg Steffensen
  • Exploring AI to catch malicious behavior
  • can we catch omitted results?
  • Andy White, Matt Kretchmar (Denison CS)

46
Thanks
  • NSF CyberTrust
  • Doug Szajda, Jason Owen
  • All the UR students
  • UR Biologists
  • Rafael de Sa, Laura Runyen-Janecky, Joe Gindhart
  • Tadayoshi Kohno (UCSD)
Write a Comment
User Comments (0)
About PowerShow.com