Privacypreserving Reinforcement Learning - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Privacypreserving Reinforcement Learning

Description:

Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment ... Homomorphic public-key cryptosystem. Addition of cipher epk(m1 m2; r1 ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 37

Provided by: velblodVid

Category:

more less

Transcript and Presenter's Notes

Title: Privacypreserving Reinforcement Learning

1
Privacy-preserving Reinforcement Learning
Tokyo Inst. of Tech. Jun Sakuma, Shigenobu
Kobayashi
Rutgers Univ. Rebecca N. Wright
2
Motivating application Load balancing
Order
Shipment
Order
Shipment
Production
Production
Redirection when heavily loaded

A load balancing among competing factories
Factories obtain a reward by processing a job,
but suffer a large penalty if overflow happens
Factories may need to redirect jobs to the other
factory when heavily loaded
When should factories redirect jobs to the other
factory?

Jun Sakuma
3
Motivating application Load balancing
Order
Shipment
Order
Shipment
Production
Production
Redirection when heavily loaded

If two factories are competing
The frequency of orders and the speed of
production is private (private model)
The backlog is private (private state
observation)
The profit is private (private reward)
Privacy-preserving Reinforcement Learning
States, actions, and rewards are not shared
But the learned policy is shared in the end

Jun Sakuma
4
Definition of Privacy

Partitioned-by-time model
Agents share the state space, the action space
and the reward function
Agents cannot interact with the environment
simultaneously

Environment
state st, reward rt
state st, reward rt
action at
action at
Alice
Bob
Alices (st,at,rt)
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Jun Sakuma
5
Definition of Privacy

Partitioned-by-observation model
State spaces and action spaces are mutually
exclusive between agents
Agents interact with the environment
simultaneously

Environment
state stA, reward rtA
state stB, reward rtB
action atA
action atB
Alice
Bob
Alices perception

(sAt,aAt,rAt)
(sA1,aA1,rA1)
Bobs perception

(sBt,aBt,rBt)

(sB1,aB1,rB1)
t0
Jun Sakuma
6
Are existing RLs privacy-preserving?
Centralized RL (CRL)
Distributed RL (DRL), Schneider99Ng05
environment
environment
Each distributed agent shares partial observation
and learns
Leader agent learns
Independent DRL (IDRL)
environment
Each agent learns independently
Target achievement of privacy preservation
without sacrificing the optimality
Jun Sakuma
7
Privacy-preserving Reinforcement Learning

Algorithm
Tabular SARSA learning with epsilon-greedy action
selection
Overview
(Step 1) Initialization of Q-vales
Building block 1 Homomorphic cryptosystem
(Step 2) Observation from the environment
(Step 3) Private Action selection
Building block 2 Random shares
Building block 3 Private comparison by Secure
Function Evaluation
(Step 4) Private update of Q-values
Go to step 2

Jun Sakuma
8
Building block Homomorphic public-key
cryptosystem

Public-key cryptosystem
A pair of public and secret key (pk, sk)
Encryption c epk(m r), m is an integer, r is
a random integer
Decryption mdsk(c)
Homomorphic public-key cryptosystem
Addition of cipher epk(m1m2 r1r2) epk(m1
r)epk(m2 r)
Multiplication of cipher epk(km kr) epk(m
r)k
Paillier cryptosystemPai99 is homomorphic

8
Jun Sakuma
9
Building block Random shares
public N
Secret x
Bob
Alice
Random share a
Random share b

(a, b) are random shares when a and b distributes
uniform randomly with satisfying ab x mod N

Jun Sakuma
10
Building block Random shares
Secret x6
public N23
Bob
Alice
Random share a15
Random share b14

(a, b) are random shares when a and b distributes
uniform randomly with satisfying ab x mod N
Example
a15 and b14
6 15 14 (29) mod 23

Jun Sakuma
11
Building block Private comparison

Private comparison
Secure Function Evaluation Yao86
allows parties to evaluate a specified function f
by taking their private input
after the SFE, their inputs and outputs are not
revealed

Private input x
Private input y
Private comparison
Output If xgty, 0. Else 1.
Output 0
Jun Sakuma
12
Privacy-preserving Reinforcement Learning

Protocol for partitioned-by-time model
(Step 1) Initialization of Q-vales
Building block 1 Homomorphic cryptosystem
(Step 2) Observation from the environment
(Step 3) Private Action selection
Building block 2 Random shares
Building block 3 Private comparison by Secure
Function Evaluation
(Step 4) Private update of Q-values
Go to step 2

Jun Sakuma
13
Step 1 Initialization of Q-vales

Alice Learn Q-values Q(s,a) from t0 to T1
Alice Generate a pair of keys (pk, sk)
Alice Compute c(s,a) encpk(Q(s,a)) send them
to Bob

Environment
state st, reward rt
action at
Alice
Bob
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Q-values
Jun Sakuma
14
Step 1 Initialization of Q-vales

Alice Learn Q-values Q(s,a) from t0 to T1
Alice Generate a pair of keys (pk, sk)
Alice Compute c(s,a) encpk(Q(s,a)) send them
to Bob

Environment
state st, reward rt
action at
Alice
Bob
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Q-values
Encrypted Q-values
Jun Sakuma
15
Step 1 Initialization of Q-vales

Alice Learn Q-values Q(s,a) from t0 to T1
Alice Generate a pair of keys (pk, sk)
Alice Compute c(s,a) encpk(Q(s,a)) send them
to Bob

Environment
state st, reward rt
action at
Alice
Bob
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Q-values
Encrypted Q-values
Jun Sakuma
16
Privacy-preserving Reinforcement Learning

The protocol overview
(Step 1) Initialization of Q-vales
Building block 1 Homomorphic cryptosystem
(Step 2) Observation from the environment
(Step 3) Private Action selection
Building block 2 Random shares
Building block 3 Private comparison by Secure
Function Evaluation
(Step 4) Private update of Q-values
Go to step 2

Jun Sakuma
17
Step 2-3 Private Action Selection (greedy)

Bob Observe state st, reward rt
Bob For all a, compute random shares of Q(st, a)
and send them to Alice
Bob and Alice Run private comparison of random
shares to learn greedy action at

Environment
state st
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Jun Sakuma
18
Step 2-3 Private Action Selection (greedy)

Bob Observe state st, reward rt
Bob For all a, compute random shares of Q(st, a)
and send them to Alice
Bob and Alice Run private comparison of random
shares to learn greedy action at

Environment
state st
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Jun Sakuma
19
Step 2-3 Private Action Selection (greedy)

Bob Observe state st, reward rt
Bob For all a, compute random shares of Q(st, a)
and send them to Alice
Bob and Alice Run private comparison of random
shares to learn greedy action at

Environment
state st
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Split Q values as random shares
Jun Sakuma
20
Step 2-3 Private Action Selection (greedy)

Bob Observe state st, reward rt
Bob For all a, compute random shares of Q(st, a)
and send them to Alice
Bob and Alice Run private comparison of random
shares to learn greedy action at

Environment
state st
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Private comparison
Jun Sakuma
21
Step 2-3 Private Action Selection (greedy)

Bob Observe state state st, reward rt
Bob For all a, compute random shares of Q(st, a)
and send them to Alice
Bob and Alice Run private comparison of random
shares to learn greedy action at

Environment
state st
action at
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
gt
Private comparison
Jun Sakuma
22
Privacy-preserving Reinforcement Learning

The protocol overview
(Step 1) Initialization of Q-vales
Building block 1 Homomorphic cryptosystem
(Step 2) Observation from the environment
(Step 3) Private Action selection
Building block 2 Random shares
Building block 3 Private comparison by Secure
Function Evaluation
(Step 4) Private update of Q-values
Go to step 2

Jun Sakuma
23
Step 3 Private Update of Q-values

After greedy action selection, Bob observes (rt,
st1)
How can Bob update encrypted Q-values c(st,at)
from (st, at, rt, st1) ?

Environment
reward rt , state st1
action at
Alice
Bob
Taken by Bob (greedy)
Regular update by SARSA
Observed
Encrypted Q-values
Jun Sakuma
24
Step 3 Private Update of Q-values

After greedy action selection, Bob observes (rt,
st1)
How can Bob update encrypted Q-values c(st,at)
from (st, at, rt, st1) ?

Environment
reward rt , state st1
action at
Alice
Bob
Taken by Bob (greedy)
Regular update by SARSA
Observed
Can Bob update encrypted Q-values?
Encrypted Q-values
?
?
Jun Sakuma
25
Step 3 Private Update of Q-values
Jun Sakuma
26
Step 3 Private Update of Q-values
K, L
Jun Sakuma
27
Step 3 Private Update of Q-values
K, L
Encryption
Jun Sakuma
28
Step 3 Private Update of Q-values
public
Bob holds
Jun Sakuma
29
Step 3 Private Update of Q-values
K, L
Encryption
Bob can update c(s,a) without knowledge of Q(s,a)!
Jun Sakuma
30
Privacy-preserving Reinforcement Learning

The protocol overview
(Step 1) Initialization of Q-vales
(Step 2) Observation from the environment
(Step 3) Private Action selection
(Step 4) Private update of Q-values
Go to step 2
Not mentioned in this talk, but
Partitioned-by-Observation model
Epsilon-greedy action selection
Q-learning can be treated in a similar manner

Jun Sakuma
31
Experiment Load balancing among factories
Job is assigned w.p. pin
aBno redirect
SA5
SB2
aAredirect
Job is processed w.p. pout

Setting
State space SA,SB?0,1,,5
Action space AA,AB?redirect, no redirect
Reward
Cost for backlog rA 50-(sA)2
Cost for redirection rA rA -2
Cost for overflow rA0
Reward rB is set similarly
System reward rt rtA rtB

Regular RL/PPRL
DRL(rewards are shared)
IDRL(no sharing)
Jun Sakuma
32
Experiment Load balancing among factories
Job is assigned w.p. pin

Comparison

aBno redirect
SA5
SB2
aAredirect
Job is processed w.p. pout
Java 1.5.0Fairplay, 1.2 GHz Core solo
Jun Sakuma
33
Summary

Reinforcement Learning from private observations
Achieve optimality as regular RL does
Privacy preservation is guaranteed theoretically
Computational load is higher than regular RL, but
works efficiently with 36 state/4 action problem
Future works
Scalability
Treatment of agents with competing reward
functions
Game theoretic analysis

Jun Sakuma
34
Thank you!
35
Step 2-3 Private Action Selection (greedy)

Bob Observe state state st, reward rt
Bob For all a, compute random shares of c(st,
a) c(st, a)?encpk(-rB(st, a)) and send them
Alice For all a, compute rA(st, a)decsk(c(st,
a))
Bob and Alice Run private comparison of random
shares to learn greedy action at

Environment
state st, reward rt
action at
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)

t0
tT1
tT1T2
tT1T2T3
Private comparison
decrypt
c(st, a) c(st, a)?encpk(-rB(st, a))
36
Distributed Reinforcement Learning
Environment
state sA, reward rA
state sB, reward rB
action aA
action aB
Bob
Alice
(sA, rA , aA)
(sB, rB , aB)

Distributed Value Function Schneider99
Manage huge state-action space
Suppress the memory consumption
Policy gradient approach Peshkin00Moallemi03B
agnell05
Limit the communication
DRL learns good, but sub-optimal, policies with
minimal or limited sharing of agents perceptions

Write a Comment

User Comments (0)