Title: Giving Advice about Preferred Actions to Reinforcement Learners Via KnowledgeBased Kernel Regression
1Giving Advice about Preferred Actions to
Reinforcement Learners ViaKnowledge-Based Kernel
Regression
- Richard Maclin
- University of Minnesota-Duluth
- Jude Shavlik, Lisa Torrey, Trevor Walker, Edward
WildUniversity of Wisconsin-Madison
2Goal!!!!
- Given
- Environment to explore
- Reinforcements for that environment
- Advice from a human observer
- Do
- Learn a good policy for the environment
Pass to your teammate!!
3Our Contribution
- Natural form of advice for reinforcement learning
(RL) - preference advice
- Advice format
- If ltagent is in this region of feature spacegt
- Then Prefer Action1 To Action2
- Advice about policy rather than Q value
4Desiderata for Advice-Taking
- Human observer expresses advice naturally and
w/o knowledge of ML agents internals
- Agent incorporates advice directly into function
it is learning
- Additional feedback (rewards, more advice) used
to refine learner continually
5Advice in Knowledge-Based Kernel Regression
(KBKR MSW, JMLR04)
- If The goal center is close and
- The goalie isnt covering it
- Then Shoot!
and angleGoalieGCenter 25
If distGoalCenter 15
Then Q(shoot) 0.9
6Preference Advice (Pref-KBKR)
- Likely hard for the user togenerate Q(shoot)
0.9
Would be useful to sayShoot is better than Pass
If distGoalCenter 15 and angleGoalieGC
enter 25 Then Prefer Shoot to Pass
7Knowledge-Based SVMs Generalizing Example
from POINT to REGION
POS
NEG
8Support-Vector Regression for RL
- min w1 ? b C s1
-
- such that for all training examples
- Qa(x) - s w x b Qa(x) s
Learned models predictions
Error
9Mathematically
penalties for not following advice
(hence advice can be refined )
- min w1 ? b C s1
-
- such that
- Qa(x) - s w x b Qa(x) s
-
constraints that represent advice
10Incorporating Advice in KBKR
- Advice format
- Bx d ? f(x) hx ?
If distGoalCenter 15 and angleGoalieGCenter
25 Then Q(shoot) 0.9
11Preference Advice
- If distGoalCenter 15 and
- angleGoalieGCenter 25
- Then Prefer Shoot to Pass
Advice format Bx d ? Qshoot(x) - Qpass(x)
?
- Note, learn w,b for each action simultaneously
12Preference Advice Theorem
- Let xBxd be nonempty. For fixed
(wp,bp,wn,bn,?) -
- Bx d ? Qp(x) Qn(x) ?
- is equivalent to the following having a solution
u (Motzkins Theorem of the Alternative) -
-
- BTu wp - wn 0
- -dTu bp - bn - ? 0, u 0
13Pref-KBKR Linear Program
- min sum for each action a
- wa1 ? ba C sa1
-
-
- such that
- for each action a
- Qa(x) - sa wa x ba Qa(x) sa
-
-
-
sum for each piece of advice k ?1zk1
?2 ?k
for each piece of advice k -zk wp wn
BkTuk zk -dTuk ?k ?k bp bn
14Methodology
- Test on 2-on-1 BreakAway
- Learnable choice action when you have ball
- 13 basic features describe world, each augmented
by 32 tiles - Average over 10 runs
- Batch learning at 100 games using SARSA
estimates - Learn on 2000 examples chosen stochastically
(favoring recent examples)
15(No Transcript)
16Related Work
- Advice-taking RL
- Gordon Subramanian, Informatica 1994
- Maclin Shavlik, AAAI 1994, MLJ 1996
- Andre Russell, NIPS 2001
- RL and SVMs
- Dietterich Wang, ECML 2001
- Lagoudakis and Parr, ICML 2003
17Current and Future Work
- Knowledge transfer viapreference advice
- Wider variety problems
- Large numbers of examples
- Large numbers of pieces of advice
- Other types of advice (e.g., multi-step plans)
18Conclusions
- Pref-KBKR
- Allows human user to advise RL agent in a
natural manner - Accepts rules about policies rather than Q values
- If distGoalCenter 15 and
- angleGoalieGCenter 25
- Then Prefer Shoot to Pass
- When applied to a complex RL problem
significantly outperforms agents w/o advice and
with KBKR advice
19Acknowledgements
- DARPA Grant HR0011-04-0007
- US Naval Research Lab Grant N00173-04-1-G026