Title: Privacypreserving Reinforcement Learning
1Privacy-preserving Reinforcement Learning
Tokyo Inst. of Tech. Jun Sakuma, Shigenobu
Kobayashi
Rutgers Univ. Rebecca N. Wright
2Motivating application Load balancing
Order
Shipment
Order
Shipment
Production
Production
Redirection when heavily loaded
- A load balancing among competing factories
- Factories obtain a reward by processing a job,
but suffer a large penalty if overflow happens - Factories may need to redirect jobs to the other
factory when heavily loaded - When should factories redirect jobs to the other
factory?
Jun Sakuma
3Motivating application Load balancing
Order
Shipment
Order
Shipment
Production
Production
Redirection when heavily loaded
- If two factories are competing
- The frequency of orders and the speed of
production is private (private model) - The backlog is private (private state
observation) - The profit is private (private reward)
- Privacy-preserving Reinforcement Learning
- States, actions, and rewards are not shared
- But the learned policy is shared in the end
Jun Sakuma
4Definition of Privacy
- Partitioned-by-time model
- Agents share the state space, the action space
and the reward function - Agents cannot interact with the environment
simultaneously
Environment
state st, reward rt
state st, reward rt
action at
action at
Alice
Bob
Alices (st,at,rt)
Bobs (st,at,rt)
Alices (st,at,rt)
t0
tT1
tT1T2
tT1T2T3
Jun Sakuma
5Definition of Privacy
- Partitioned-by-observation model
- State spaces and action spaces are mutually
exclusive between agents - Agents interact with the environment
simultaneously
Environment
state stA, reward rtA
state stB, reward rtB
action atA
action atB
Alice
Bob
Alices perception
(sAt,aAt,rAt)
(sA1,aA1,rA1)
Bobs perception
(sBt,aBt,rBt)
(sB1,aB1,rB1)
t0
Jun Sakuma
6Are existing RLs privacy-preserving?
Centralized RL (CRL)
Distributed RL (DRL), Schneider99Ng05
environment
environment
Each distributed agent shares partial observation
and learns
Leader agent learns
Independent DRL (IDRL)
environment
Each agent learns independently
Target achievement of privacy preservation
without sacrificing the optimality
Jun Sakuma
7Privacy-preserving Reinforcement Learning
- Algorithm
- Tabular SARSA learning with epsilon-greedy action
selection - Overview
- (Step 1) Initialization of Q-vales
- Building block 1 Homomorphic cryptosystem
- (Step 2) Observation from the environment
- (Step 3) Private Action selection
- Building block 2 Random shares
- Building block 3 Private comparison by Secure
Function Evaluation - (Step 4) Private update of Q-values
- Go to step 2
Jun Sakuma
8Building block Homomorphic public-key
cryptosystem
- Public-key cryptosystem
- A pair of public and secret key (pk, sk)
- Encryption c epk(m r), m is an integer, r is
a random integer - Decryption mdsk(c)
- Homomorphic public-key cryptosystem
- Addition of cipher epk(m1m2 r1r2) epk(m1
r)epk(m2 r) - Multiplication of cipher epk(km kr) epk(m
r)k - Paillier cryptosystemPai99 is homomorphic
8
Jun Sakuma
9Building block Random shares
public N
Secret x
Bob
Alice
Random share a
Random share b
- (a, b) are random shares when a and b distributes
uniform randomly with satisfying ab x mod N
Jun Sakuma
10Building block Random shares
Secret x6
public N23
Bob
Alice
Random share a15
Random share b14
- (a, b) are random shares when a and b distributes
uniform randomly with satisfying ab x mod N - Example
- a15 and b14
- 6 15 14 (29) mod 23
Jun Sakuma
11Building block Private comparison
- Private comparison
- Secure Function Evaluation Yao86
- allows parties to evaluate a specified function f
by taking their private input - after the SFE, their inputs and outputs are not
revealed
Private input x
Private input y
Private comparison
Output If xgty, 0. Else 1.
Output 0
Jun Sakuma
12Privacy-preserving Reinforcement Learning
- Protocol for partitioned-by-time model
- (Step 1) Initialization of Q-vales
- Building block 1 Homomorphic cryptosystem
- (Step 2) Observation from the environment
- (Step 3) Private Action selection
- Building block 2 Random shares
- Building block 3 Private comparison by Secure
Function Evaluation - (Step 4) Private update of Q-values
- Go to step 2
Jun Sakuma
13Step 1 Initialization of Q-vales
- Alice Learn Q-values Q(s,a) from t0 to T1
- Alice Generate a pair of keys (pk, sk)
- Alice Compute c(s,a) encpk(Q(s,a)) send them
to Bob
Environment
state st, reward rt
action at
Alice
Bob
Alices (st,at,rt)
t0
tT1
tT1T2
tT1T2T3
Q-values
Jun Sakuma
14Step 1 Initialization of Q-vales
- Alice Learn Q-values Q(s,a) from t0 to T1
- Alice Generate a pair of keys (pk, sk)
- Alice Compute c(s,a) encpk(Q(s,a)) send them
to Bob
Environment
state st, reward rt
action at
Alice
Bob
Alices (st,at,rt)
t0
tT1
tT1T2
tT1T2T3
Q-values
Encrypted Q-values
Jun Sakuma
15Step 1 Initialization of Q-vales
- Alice Learn Q-values Q(s,a) from t0 to T1
- Alice Generate a pair of keys (pk, sk)
- Alice Compute c(s,a) encpk(Q(s,a)) send them
to Bob
Environment
state st, reward rt
action at
Alice
Bob
Alices (st,at,rt)
t0
tT1
tT1T2
tT1T2T3
Q-values
Encrypted Q-values
Jun Sakuma
16Privacy-preserving Reinforcement Learning
- The protocol overview
- (Step 1) Initialization of Q-vales
- Building block 1 Homomorphic cryptosystem
- (Step 2) Observation from the environment
- (Step 3) Private Action selection
- Building block 2 Random shares
- Building block 3 Private comparison by Secure
Function Evaluation - (Step 4) Private update of Q-values
- Go to step 2
Jun Sakuma
17Step 2-3 Private Action Selection (greedy)
- Bob Observe state st, reward rt
- Bob For all a, compute random shares of Q(st, a)
and send them to Alice - Bob and Alice Run private comparison of random
shares to learn greedy action at
Environment
state st
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)
t0
tT1
tT1T2
tT1T2T3
Jun Sakuma
18Step 2-3 Private Action Selection (greedy)
- Bob Observe state st, reward rt
- Bob For all a, compute random shares of Q(st, a)
and send them to Alice - Bob and Alice Run private comparison of random
shares to learn greedy action at
Environment
state st
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)
t0
tT1
tT1T2
tT1T2T3
Jun Sakuma
19Step 2-3 Private Action Selection (greedy)
- Bob Observe state st, reward rt
- Bob For all a, compute random shares of Q(st, a)
and send them to Alice - Bob and Alice Run private comparison of random
shares to learn greedy action at
Environment
state st
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)
t0
tT1
tT1T2
tT1T2T3
Split Q values as random shares
Jun Sakuma
20Step 2-3 Private Action Selection (greedy)
- Bob Observe state st, reward rt
- Bob For all a, compute random shares of Q(st, a)
and send them to Alice - Bob and Alice Run private comparison of random
shares to learn greedy action at
Environment
state st
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)
t0
tT1
tT1T2
tT1T2T3
Private comparison
Jun Sakuma
21Step 2-3 Private Action Selection (greedy)
- Bob Observe state state st, reward rt
- Bob For all a, compute random shares of Q(st, a)
and send them to Alice - Bob and Alice Run private comparison of random
shares to learn greedy action at
Environment
state st
action at
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)
t0
tT1
tT1T2
tT1T2T3
gt
Private comparison
Jun Sakuma
22Privacy-preserving Reinforcement Learning
- The protocol overview
- (Step 1) Initialization of Q-vales
- Building block 1 Homomorphic cryptosystem
- (Step 2) Observation from the environment
- (Step 3) Private Action selection
- Building block 2 Random shares
- Building block 3 Private comparison by Secure
Function Evaluation - (Step 4) Private update of Q-values
- Go to step 2
Jun Sakuma
23Step 3 Private Update of Q-values
- After greedy action selection, Bob observes (rt,
st1) - How can Bob update encrypted Q-values c(st,at)
from (st, at, rt, st1) ?
Environment
reward rt , state st1
action at
Alice
Bob
Taken by Bob (greedy)
Regular update by SARSA
Observed
Encrypted Q-values
Jun Sakuma
24Step 3 Private Update of Q-values
- After greedy action selection, Bob observes (rt,
st1) - How can Bob update encrypted Q-values c(st,at)
from (st, at, rt, st1) ?
Environment
reward rt , state st1
action at
Alice
Bob
Taken by Bob (greedy)
Regular update by SARSA
Observed
Can Bob update encrypted Q-values?
Encrypted Q-values
?
?
Jun Sakuma
25Step 3 Private Update of Q-values
Jun Sakuma
26Step 3 Private Update of Q-values
K, L
Jun Sakuma
27Step 3 Private Update of Q-values
K, L
Encryption
Jun Sakuma
28Step 3 Private Update of Q-values
public
Bob holds
Jun Sakuma
29Step 3 Private Update of Q-values
K, L
Encryption
Bob can update c(s,a) without knowledge of Q(s,a)!
Jun Sakuma
30Privacy-preserving Reinforcement Learning
- The protocol overview
- (Step 1) Initialization of Q-vales
- (Step 2) Observation from the environment
- (Step 3) Private Action selection
- (Step 4) Private update of Q-values
- Go to step 2
- Not mentioned in this talk, but
- Partitioned-by-Observation model
- Epsilon-greedy action selection
- Q-learning can be treated in a similar manner
Jun Sakuma
31Experiment Load balancing among factories
Job is assigned w.p. pin
aBno redirect
SA5
SB2
aAredirect
Job is processed w.p. pout
- Setting
- State space SA,SB?0,1,,5
- Action space AA,AB?redirect, no redirect
- Reward
- Cost for backlog rA 50-(sA)2
- Cost for redirection rA rA -2
- Cost for overflow rA0
- Reward rB is set similarly
- System reward rt rtA rtB
Regular RL/PPRL
DRL(rewards are shared)
IDRL(no sharing)
Jun Sakuma
32Experiment Load balancing among factories
Job is assigned w.p. pin
aBno redirect
SA5
SB2
aAredirect
Job is processed w.p. pout
Java 1.5.0Fairplay, 1.2 GHz Core solo
Jun Sakuma
33Summary
- Reinforcement Learning from private observations
- Achieve optimality as regular RL does
- Privacy preservation is guaranteed theoretically
- Computational load is higher than regular RL, but
works efficiently with 36 state/4 action problem - Future works
- Scalability
- Treatment of agents with competing reward
functions - Game theoretic analysis
Jun Sakuma
34Thank you!
35Step 2-3 Private Action Selection (greedy)
- Bob Observe state state st, reward rt
- Bob For all a, compute random shares of c(st,
a) c(st, a)?encpk(-rB(st, a)) and send them - Alice For all a, compute rA(st, a)decsk(c(st,
a)) - Bob and Alice Run private comparison of random
shares to learn greedy action at
Environment
state st, reward rt
action at
Alice
Bob
Bobs (st,at,rt)
Alices (st,at,rt)
t0
tT1
tT1T2
tT1T2T3
Private comparison
decrypt
c(st, a) c(st, a)?encpk(-rB(st, a))
36Distributed Reinforcement Learning
Environment
state sA, reward rA
state sB, reward rB
action aA
action aB
Bob
Alice
(sA, rA , aA)
(sB, rB , aB)
- Distributed Value Function Schneider99
- Manage huge state-action space
- Suppress the memory consumption
- Policy gradient approach Peshkin00Moallemi03B
agnell05 - Limit the communication
- DRL learns good, but sub-optimal, policies with
minimal or limited sharing of agents perceptions