Markov Games as a Framework for Multiagent Reinforcement Learning Mike L' Littman

About This Presentation

Title:

Markov Games as a Framework for Multiagent Reinforcement Learning Mike L' Littman

Description:

MDP is capable of describing only single-agent environments. New mathematical framework is needed to support multi-agent ... Example 'rock, paper, scissors' ... – PowerPoint PPT presentation

Number of Views:160

Avg rating:3.0/5.0

Slides: 19

Provided by: Yan

Learn more at: http://www.sci.brooklyn.cuny.edu

Category:

more less

Transcript and Presenter's Notes

Title: Markov Games as a Framework for Multiagent Reinforcement Learning Mike L' Littman

1
Markov Games as a Framework for Multi-agent
Reinforcement LearningMike L. Littman

Jinzhong Niu
March 30, 2004

2
Overview

MDP is capable of describing only single-agent
environments.
New mathematical framework is needed to support
multi-agent reinforcement learning.
Markov Games
A single step in this direction is explored.
2-player zero-sum Markov Games

3
Definitions

Markov Decision Process (MDP)

4
Definitions (cont.)

Markov Game (MG)

5
Definitions (cont.)

Two-player zero-sum Markov Game (2P-MG)

6
2P-MG Is Capable?
Yes

Precludes cooperation!
Generalizes
MDPs (when O1) The opponent has a constant
behavior, which may be viewed as part of the
environment.
Matrix Games (when S1)The environment doesnt
hold any information and rewards are totally
decided by the actions.

7
Matrix Games

Example rock, paper, scissors

8
What does optimality exactly mean?

MDP
A stationary, deterministic, and undominated
optimal policy always exists.
MG
The performance of a policy depends on the
opponents policy, so we cannot evaluate them
without context.
New definition of optimality in game theory
Performs best at its worst case compared with
others
At least one optimal policy exists, which may or
may not be deterministic because the agent is
uncertain of its opponents move.

9
Finding Optimal Policy - Matrix Games

The optimal agents minimum expected reward
should be as large as possible.
Use V to express the minimum value, then
consider how to maximize it

10
Finding Optimal Policy - MDP

Value of a state
Quality of a state-action pair

11
Finding Optimal Policy 2P-MG

Value of a state
Quality of a s-a-o triple

12
Learning Optimal Polices

Q-learning
minimax-Q learning

13
Minimax-Q Algorithm
14
Experiment - Problem

Soccer

15
Experiment - Training

4 agents trained through 106 steps
minimax-Q learning
vs. random opponent - MR
vs. itself - MM
Q-learning
vs. random opponent - QR
vs. itself - QQ

16
Experiment - Testing

Test 3
QR, QQ 100 loser?

Test 1
QR gt MR?
Test 2
QRltltQQ?

17
Contributions

A solution to 2-player Markov games with a
modified Q-learning method in which minimax is in
place of max
Minimax can also be used in single-agent
environments to avoid risky behavior.

18
Future work

Possible performance improvement of the minimax-Q
learning method
Linear programming caused large computational
complexity.
Iterative methods may be used to get approximate
solutions to minimax much faster, which is
sufficiently satisfactory.

19
Discussions

The paper claims that the training is not
sufficient for attaining the optimal policy for
MR and MM. Then how soon will it possible for
them to do so?
It is claimed that MR and MM should break even
with even the strongest opponent. Why?
After training and before testing, the policies
in agents are fixed. How about not fixing it and
leaving learning abilities there? Thus we can
examine how they adapt themselves over the long
run, say how their winning rate changes.
What is a slow enough exponentially weighted
average?