Title: Decision Making in Intelligent Systems Lecture 9
1Decision Making in Intelligent SystemsLecture 9
- BSc course Kunstmatige Intelligentie 2008
- Bram Bakker
- Intelligent Systems Lab Amsterdam
- Informatics Institute
- Universiteit van Amsterdam
- bram_at_science.uva.nl
2Overview of this lecture
- Last lecture!
- Illustrate trade-offs and issues that arise in
real applications - Illustrate use of domain knowledge
- Describe some RL topics that UvA is working on
3Final exam mei 2007
- Tijd/plek nog onbekend (voor mij)!
- Let op website
http//staff.science.uva.nl/bram/DMIS/
4Case study 1 TD Gammon
Tesauro 1992, 1994, 1995, ...
- White has just rolled a 5 and a 2 so can move one
of his pieces 5 and one (possibly the same) 2
steps - Objective is to advance all pieces to points
19-24 - 30 pieces, 24 locations implies enormous number
of configurations - Effective branching factor of 400
5A Few Details
- Reward 0 at all times except those in which the
game is won, when it is 1 - Episodic (game episode), undiscounted
- Gradient descent TD(l) with a multi-layer neural
network - weights initialized to small random numbers
- backpropagation of TD error
- four input units for each point unary encoding
of number of white pieces, plus other features - Learning during self-play
6Multi-layer Neural Network
7Summary of TD-Gammon Results
8Samuels Checkers Player
Arthur Samuel 1959, 1967
- Minimax to determine backed-up score of a
position - Rote learning save each board config encountered
together with backed-up score - Learning similar to TD algorithm
9Samuels Backups
10The Basic Idea
. . . we are attempting to make the score,
calculated for the current board position, look
like that calculated for the terminal board
positions of the chain of moves which most
probably occur during actual play.
A. L. Samuel Some Studies in Machine
Learning Using the Game of Checkers, 1959
11 More Samuel Details
- Did not include explicit rewards
- Instead used so-called piece advantage feature
- No special treatment of terminal positions
- Generalization method produced better than
average play tricky but beatable - Supervised mode book learning
12The Acrobot
Spong 1994
13Acrobot Learning Curves for Sarsa(l)
14Typical Acrobot Learned Behavior
15Elevator Dispatching
Crites and Barto 1996
16Control Strategies
- Zoning divide building into zones park in
zone when idle. Robust in heavy traffic. - Search-based methods greedy or non-greedy.
Receding Horizon control. - Rule-based methods expert systems/fuzzy
logic from human experts - Other heuristic methods Longest Queue First
(LQF), Highest Unanswered Floor First (HUFF),
Dynamic Load Balancing (DLB) - Adaptive/Learning methods NNs for prediction,
parameter space search using simulation, DP on
simplified model, non-sequential RL
17The Elevator Model(from Lewis, 1991)
Discrete Event System continuous time,
asynchronous elevator operation
- Parameters
- Floor Time (time to move one floor at max
speed) 1.45 secs. - Stop Time (time to decelerate, open and close
doors, and accelerate again) 7.19 secs. - TurnTime (time needed by a stopped car to
change directions) 1 sec. - Load Time (the time for one passenger to enter
or exit a car) a random variable with range from
0.6 to 6.0 secs, mean of 1 sec. - Car Capacity 20 passengers
Traffic Profile Poisson arrivals with rates
changing every 5 minutes down-peak
18State Space
18
- 18 hall call buttons 2 combinations
- positions and directions of cars 18
(rounding to nearest floor) - motion states of cars (accelerating, moving,
decelerating, stopped, loading, turning) 6 - 40 car buttons 2
- Set of passengers waiting at each floor, each
passenger's arrival time and destination
unobservable. However, 18 real numbers are
available giving elapsed time since hall buttons
pushed we discretize these. - Set of passengers riding each car and their
destinations observable only through the car
buttons
4
4
40
22
Conservatively about 10 states
19Actions
- When moving (halfway between floors)
- stop at next floor
- continue past next floor
- When stopped at a floor
- go up
- go down
20Constraints
- A car cannot pass a floor if a passenger wants
to get off there - A car cannot change direction until it has
serviced all onboard passengers traveling in the
current direction - Dont stop at a floor if another car is already
stopping, or is stopped, there - Dont stop at a floor unless someone wants to
get off there - Given a choice, always move up
standard
special heuristic
21Performance Criteria
Minimize
- Average wait time
- Average system time (wait travel time)
- waiting gt T seconds (e.g., T 60)
- Average squared wait time (to encourage fast
and fair service)
22Average Squared Wait Time
- Instantaneous cost
-
- Define return as an integral rather than a sum
(Bradtke and Duff, 1994)
becomes
23Algorithm
24Neural Networks
47 inputs, 20 sigmoid hidden units, 1 or 2 output
units
Inputs
- 9 binary state of each hall down button
- 9 real elapsed time of hall down button if
pushed - 16 binary one on at a time position and
direction of car making decision - 10 real location/direction of other cars
footprint - 1 binary at highest floor with waiting
passenger? - 1 binary at floor with longest waiting
passenger? - 1 bias unit ? 1
25Elevator Results
26Dynamic Channel Allocation
Singh and Bertsekas 1997
27Summary
- RL can lead to successful applications
- Background knowledge important
- Learning directly in the real world is rarely
possible - You need a more or less accurate simulation
- Function approximation (e.g. neural networks) is
important to deal with large state spaces
28Frontier Dimensions
- Smart function approximation
- Non-Markov case
- Partially Observable MDPs (POMDPs)
- Bayesian approach belief states
- construct state from sequence of observations
- Modularity and hierarchies
- Learning and planning at several different levels
- Theory of options, MAXQ
- Multi-agent RL
29Adaptive resolution function approximation
- Learn, in your state(-action) space
- where you can generalize over many states
(coarse-grained view) - where you must distinguish between states
(fine-grained view) - Learn based on experienced rewards
- Returns guide formation of boundaries between
regions in state(-action) space
30Frontier Dimensions
- Smart function approximation
- Non-Markov case
- Partially Observable MDPs (POMDPs)
- Bayesian approach belief states
- construct state from sequence of observations
- Modularity and hierarchies
- Learning and planning at several different levels
- Theory of options, MAXQ
- Multi-agent RL
31Architectures NNs fully observable MDPs
Direct value function approximation
Actor-critic
32Architectures partially observable case
Direct value function approximation
Actor-critic
33Long Short-Term Memory (LSTM)
The memory cells can learn to remember relevant
information from the timeseries of inputs for
long periods of time (e.g. Hochreiter
Schmidhuber, 1997 Bakker, 2001)
34Frontier Dimensions
- Smart function approximation
- Non-Markov case
- Partially Observable MDPs (POMDPs)
- Bayesian approach belief states
- construct state from sequence of observations
- Modularity and hierarchies
- Learning and planning at several different levels
- Theory of options, MAXQ
- Multi-agent RL
35Hierarchical methods
policy
HIGH
overall task
policy
policy
subtask
subtask
policy
policy
policy
policy
LOW
subtask
subtask
subtask
subtask
36Frontier Dimensions
- Smart function approximation
- Non-Markov case
- Partially Observable MDPs (POMDPs)
- Bayesian approach belief states
- construct state from sequence of observations
- Modularity and hierarchies
- Learning and planning at several different levels
- Theory of options, MAXQ
- Multi-agent RL
37Multi-agent RL
- Structure overall task such that team of agents
(rather than single agent) can solve it - Task must be decomposible in this way
- Find way of distributing rewards between agents
- Agents must be rewarded for good contribution to
overall team - Agents must not be rewarded for bad contribution
or selfish behavior
38 Traffic simulator
39Approach
- Model-based multi-agent reinforcement learning
- Traffic behavior model is estimated online
(maximum likelihood model) - Value function/policy is estimated online using
approximate real-time dynamic programming - Each traffic light junction makes locally optimal
decision by using value function and sensing
local cars around the junction - Recently an explicit coordination mechanism was
added by means of coordination graphs and maxplus
40 3D visualisation traffic simulator
Thanks to Matthijs Amelink
41Multi-agent example Robocup simulation league
42Studying for final exam
- Literature
- Sutton Barto (1998) book
- Kaelbling, Littman, Cassandra (1998) Planning
and acting in partially observable stochastic
domains, AI journal article. (website) Sections
1-3. - Questions will test
- General insight into important issues
- Ability to apply the mathematics of RL
- Use the slides!
- Open book
- Try answering some questions at the end of each
chapter
43Final exam mei 2008
- ? mei
- Plaats ?
- Deze informatie zal op de DMIS website staan
http//staff.science.uva.nl/bram/DMIS/