Machine Learning: Symbol-based

About This Presentation

Title:

Machine Learning: Symbol-based

Description:

10d 10.0 Introduction 10.1 A Framework for Symbol-based Learning 10.2 Version Space Search 10.3 The ID3 Decision Tree Induction Algorithm 10.4 Inductive Bias and – PowerPoint PPT presentation

Number of Views:2

Avg rating:3.0/5.0

Slides: 30

Provided by: Nilu5

Learn more at: https://pages.mtu.edu

more less

Transcript and Presenter's Notes

Title: Machine Learning: Symbol-based

1
Machine Learning Symbol-based
10d
10.0 Introduction 10.1 A Framework
for Symbol-based Learning 10.2 Version Space
Search 10.3 The ID3 Decision Tree Induction
Algorithm 10.4 Inductive Bias and Learnability
10.5 Knowledge and Learning 10.6 Unsupervised
Learning 10.7 Reinforcement Learning 10.8 Epilogue
and References 10.9 Exercises
Additional references for the slides Thomas
Dean, James Allen, and Yiannis Aloimonos, Artifici
al Intgelligence Theory and Practice Addison
Wesley, 1995, Section 5.9.
2
Reinforcement Learning

A form of learning where the agent can explore
and learn through interaction with the
environment
The agent learns a policy which is a mapping
from states to actions. The policy tells what the
best move is in a particular state.
It is a general methodology planning, decision
making, search can all be viewed as some form of
the reinforcement learning.

3
Tic-tac-toe a different approach

Recall the minimax approach The agent knows
its current state. Generates a two layer search
tree taking into account all the possible moves
for itself and the opponent. Backs up values from
the leaf nodes and takes the best move assuming
that the opponent will also do so.
An alternative is to directly start playing with
an opponent (does not have to be perfect,but
could as well be). Assume no prior knowledge or
lookahead. Assign values to states 1 is
win 0 is loss or draw 0.5 is anything else

4
Notice that 0.5 is arbitrary, it cannot
differentiate between good moves and bad moves.
So, the learner has no guidance initially. It
engages in playing. When the game ends, if it is
a win, the value 1 will be propagated backwards.
If it is a draw or a loss, the value 0 is
propagated backwards. Eventually, earlier states
will be labeled to reflect their true value.
After several plays, the learner will learn the
best move given a state (a policy.)
5
Issues in generalizing this approach

How will the state values be initialized or
propagated backwards?
What if there is no end to the game (infinite
horizon)?
This is an optimization problem which suggests
that it is hard. How can an optimal policy be
learned?

6
A simple robot domain
The robot is in one of the states 0, 1, 2, 3.
Each one represents an office, the offices are
connected in a ring. Three actions are
available moves to the next state
- moves to the previous state _at_
remains at the same state
_at_
_at_

0
1
-

-
-
-
3
2
_at_
_at_

7
The robot domain (contd)

The robot can observe the label of the state it
is in and perform any action corresponding to an
arc leading out of its current state.
We assume that there is a clock governing the
passage of time, and that at each tick of the
clock the robot has to perform an action.
The environment is deterministic, there is a
unique state resulting from any initial state and
action.
Each state has a reward10 for state 3, 0 for
the others.

8
The reinforcement learning problem

Given information about the environment
States
Actions
State-transition function (or diagram)
Output a policy p states ? actions, i.e., find
the best action to execute at each state
Assumes that the state is completely observable
(the agent always knows which state it is in)

9
Compare three policies

a. Every state is mapped to _at_
The value of this policy is 0, because the
robot will never get to office 3.
b. Every state is mapped to
policy 0
The value of this policy is ?, because the
robot will end up in office 3 infinitely often.
c. Every state is except 3 is mapped to , 3 is
mapped to _at_
policy 1
The value of this policy is also ?, because
the robot will end up (stay) in office 3
infinitely often.

10
Compare three policies
So, it is easy to rule case a out, but how can we
show that policy 1 is better than policy 0? One
way would be to compute the average reward per
tick

POLICY 1
The average reward per tick for state 0 is 10.

POLICY 0 The average reward per tick for state 0
is 10/4.
Another way would be to assign higher values for
immediate rewards and apply a discount to future
rewards.
11
Discounted cumulative reward

Assume that the robot associates a higher value
with more immediate rewards and therefore
discounts future rewards.
The discount rate (?) is a number between 0 and 1
used to discount future rewards.
The discounted cumulative reward for a particular
state with respect to a given policy is the sum
for n from 0 to infinity of ?n times the reward
associated with the state reached after the n-th
tick of the clock.

POLICY 1 The discounted cumulative reward for
state 0 is 2.5.
POLICY 0 The discounted cumulative reward for
state 0 is 1.33.
12
Discounted cumulative reward (contd)

Take ? 0.5
For state 0 with respect to policy 00.50 x 0
0.51 x 0 0.52 x 0 0.53 x 10 0.54 x 0 0.55
x 0 0.56 x 0 0.57 x 10 1.25 0.078
1.33 in the limit
For state 0 with respect to policy 10.50 x 0
0.51 x 0 0.52 x 0 0.53 x 10 0.54 x 10
0.55 x 10 0.56 x 10 0.57 x 10 2.5 in
the limit

13
Discounted cumulative reward (contd)

Let j be a state,R(j) be the reward for ending
up in state j,? be a fixed policy,?(j) be the
action dictated by ? in state j,f(j,a) be the
next state given the robot starts in state j and
performs action a,V?i(j) be the estimated value
of state j with respect to the policy ? after the
i-th iteration of the algorithm
Using a dynamic programming algorithm, one can
obtain a good estimate of V?, the value function
for policy ? as i ? ?.

14
A dynamic programming algorithm to compute values
for states for a policy ?

1. For each j, set V?0(j) to 0.
2. Set i to 0.
3. For each j, set V?i1 (j) to R(j) ? V?i(
f(j,?) ) ).
4. Set i to i 1.
5. If i is equal to the maximum number of
iterations, then return V?i otherwise, return
to step 3.

15
Values of states for policy 0

initialize
V(0) 0
V(1) 0
V(2) 0
V(3) 0
iteration 0
For office 0 R(0) ? V(1) 0 0.5 x 0 0
For office 1 R(1) ? V(2) 0 0.5 x 0 0
For office 2 R(2) ? V(3) 0 0.5 x 0 0
For office 3 R(3) ? V(1) 10 0.5 x 0 10
(iteration 0 essentially initializes values of
states to their immediate rewards)

16
Values of states for policy 0 (contd)

iteration 0 V(0) V(1) V(2) 0 V(3)10
iteration 1
For office 0 R(0) ? V(1) 0 0.5 x 0 0
For office 1 R(1) ? V(2) 0 0.5 x 0 0
For office 2 R(2) ? V(3) 0 0.5 x 10 5
For office 3 R(3) ? V(0) 10 0.5 x 0 10
iteration 2
For office 0 R(0) ? V(1) 0 0.5 x 0 0
For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
For office 2 R(2) ? V(3) 0 0.5 x 10 5
For office 3 R(3) ? V(0) 10 0.5 x 0 10

17
Values of states for policy 0 (contd)

iteration 2 V(0) 0 V(1) 2.5 V(2) 5
V(3) 10
iteration 3
For office 0 R(0) ? V(1) 0 0.5 x 2.5
1.25
For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
For office 2 R(2) ? V(3) 0 0.5 x 10 5
For office 3 R(3) ? V(0) 10 0.5 x 0 10
iteration 4
For office 0 R(0) ? V(1) 0 0.5 x 2.5
1.25
For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
For office 2 R(2) ? V(3) 0 0.5 x 10 5
For office 3 R(3) ? V(1) 10 0.5 x 1.25
10.625

18
Values of states for policy 1

initialize
V(0) 0
V(1) 0
V(2) 0
V(3) 0
iteration 0
For office 0 R(0) ? V(1) 0 0.5 x 0 0
For office 1 R(1) ? V(2) 0 0.5 x 0 0
For office 2 R(2) ? V(3) 0 0.5 x 0 0
For office 3 R(3) ? V(3) 10 0.5 x 0 10

19
Values of states for policy 1 (contd)

iteration 0 V(0) V(1) V(2) 0 V(3)15
iteration 1
For office 0 R(0) ? V(1) 0 0.5 x 0 0
For office 1 R(1) ? V(2) 0 0.5 x 0 0
For office 2 R(2) ? V(3) 0 0.5 x 10 5
For office 3 R(3) ? V(3) 10 0.5 x 10 15
iteration 2
For office 0 R(0) ? V(1) 0 0.5 x 0 0
For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
For office 2 R(2) ? V(3) 0 0.5 x 15 7.5
For office 3 R(3) ? V(3) 10 0.5 x 15
17.5

20
Values of states for policy 1 (contd)

iteration 2 V(0) 0 V(1) 2.5 V(2)
7.5 V(3) 17.5
iteration 3
For office 0 R(0) ? V(1) 0 0.5 x 2.5
1.25
For office 1 R(1) ? V(2) 0 0.5 x 7.5
3.75
For office 2 R(2) ? V(3) 0 0.5 x 17.5
8.75
For office 3 R(3) ? V(3) 10 0.5 x 17.5
18.75
iteration 4
For office 0 R(0) ? V(1) 0 0.5 x 3.75
1.875
For office 1 R(1) ? V(2) 0 0.5 x 8.75
4.375
For office 2 R(2) ? V(3) 0 0.5 x 18.75
9.375
For office 3 R(3) ? V(3) 10 0.5 x 18.75
19.375

21
Compare policies

Policy 0 after iteration 4
For office 0 R(0) ? V(1) 0 0.5 x 2.5
1.25
For office 1 R(1) ? V(2) 0 0.5 x 5 2.5
For office 2 R(2) ? V(3) 0 0.5 x 10 5
For office 3 R(3) ? V(1) 10 0.5 x 1.25
10.625
Policy 1 after iteration 4
For office 0 R(0) ? V(1) 0 0.5 x 3.75
1.875
For office 1 R(1) ? V(2) 0 0.5 x 8.75
4.375
For office 2 R(2) ? V(3) 0 0.5 x 18.75
9.375
For office 3 R(3) ? V(3) 10 0.5 x 18.75
19.375
Policy 1 is better because each state has higher
value compared to policy 0

22
Temporal credit assignment problem

It is the problem of assigning credit or blame
to the actions in a sequence of actions where
feedback is available only at the end of the
sequence.
When you lose a game of chess or checkers, the
blame for your loss cannot necessarily be
attributed to the last move you made, or even the
next-to-the-last move.
Dynamic programming solves the temporal credit
assignment problem by propagating rewards
backwards to earlier states and hence to actions
earlier in the sequence of actions determined by
a policy.

23
Computing an optimal policy

Given a method for estimating the value of states
with respect to a fixed policy, it is possible to
find an optimal policy. We would like to maximize
the discounted cumulative reward.
Policy iteration Howard, 1960 is an algorithm
that uses the algorithm for computing the value
of a state as a subroutine.

24
Policy iteration algorithm

1. Let ?0 be an arbitrary policy.
2. Set i to 0.
3. Compute V?0 (j) for each j.
4. Compute a new policy ?i1 so that ?i1 (j) is
the action a maximizing R(j) ? V?i( f(j,?) ) .
5. If ?i1 ?i , then return ?i otherwise, set
i to i 1, and go to step 3.

25
Policy iteration algorithm (contd)

A policy ? is said to be the optimal policy if
there is no other policy ? and state j such that
V? (j) gt V? (j) and for all k ? j V? (j) gt V?
(j) .
The policy iteration algorithm is guaranteed to
terminate in a finite number of steps with an
optimal policy.

26
Comments on reinforcement learning

A general model where an agent can learn to
function in dynamic environments
The agent can learn while interacting with the
environment
No prior knowledge except the (probabilistic)
transitions is assumed
Can be generalized to stochastic domains (an
action might have several different probabilistic
consequences, i.e., the state-transition function
is not deterministic)
Can also be generalized to domains where the
reward function is not known

27
Famous example TD-Gammon (Tosauro, 1995)

Learns to play Backgammon
Immediate reward 100 if win -100 if lose 0
for all other states
Trained by playing 1.5 million games against
itself (several weeks)
Now approximately equal to best human player
(won World Cup of Backgammon in 1992 among top 3
since 1995)
Predecessor NeuroGammon Tesauro and Sejnowski,
1989 learned from examples of labeled moves
(very tedious for human expert)

28
Other examples