Iterated prisoners dilemma using Zhus algorithm - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Iterated prisoners dilemma using Zhus algorithm

Description:

Basics. Normal single-agent learning. Environment has ... Basics (contd) Multi-Agent ... Basics (contd) Definition 2: A Nash Equilibrium is such a profile ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 26
Provided by: rohans9
Category:

less

Transcript and Presenter's Notes

Title: Iterated prisoners dilemma using Zhus algorithm


1
Iterated prisoners dilemma using Zhus algorithm
  • By
  • Rohan Shetty (rohans_at_)
  • Satyam Sharma (ssatyam_at_)
  • Course instructor Amitabha Mukerjee

2
Contents
  • Motivation
  • Basics
  • Previous Work
  • Zhus Algorithm
  • Results Expected
  • References

3
Motivation
  • Game theory (von Neumann Morgenstern ,1947) is
    useful to understand the performance of human and
    autonomous game players.
  • It is widely employed to solve resource
    allocation problems in distributed decision
    making systems (Multi agent systems).
  • Applied in various other fields for e.g.
  • Economics
  • innovated antitrust policy
  • Political Science
  • developed election laws
  • Computer Science
  • new software algorithms and routing protocols
  • Military Strategy
  • Created Nuclear policies and notions of strategic
    detterrance
  • Sports
  • Coaching staffs decide new team strategies
  • Biology
  • Determined what species have the greatest
    likelihood of extinction

4
Basics
  • Normal single-agent learning
  • Environment has observable states
  • The agent performs some action leading to a state
    transition
  • The agent gets some reward for its action and it
    tries to improve its action to increase the
    reward
  • Here the environment is stationary.

5
Basics (contd)Multi-Agent Learning Problem
  • Agent tries to solve its learning problem, while
    other agents in the environment also are trying
    to solve their own learning problems.
  • Environment cannot be considered stationary.

6
Basics (contd)
  • Definition 2
  • A Nash Equilibrium is such a profile where any
    player
  • who changes her strategy would be rewarded less.
  • Eg (Defect Defect ) is a Nash equilibrium for IPD

7
Basics (contd)
  • Definition 3
  • A profile p is more Pareto efficient than profile
    q if and only if the reward for each player
    using profile p is greater than reward for each
    player using profile q.
  • Eg (C,C) is more pareto efficient than (D,D).

8
Iterated Prisoners Dilemma
  • A two-player binary choice non-zero-sum game can
    be described using a 2 x 2 payoff matrix of
    outcomes.
  • 1. T gt R gt P gt S
  • 2. 2 R gt S T

9
Previous work
  • Littman (1994) 2 proposed a framework for
    learning zero-sum games called minimax-Q-Learning
    which was an extension of traditional Q-learning
    for MDP
  • The modification lies in Q (s,a,o) instead of
    Q(s,a) where o is the action of the opponent.
  • The minimum of Q values over opponent action is
    the optimal action of the opponent.
  • So the agent can estimate the action a rational
    oponent would take.
  • It is limited to learning in zero sum games in
    which payoffs are negatively correlated

10
Previous work (contd)
  • For RL in non-zero sum games, Carmel and
    Markovitch (1996) 3 describe a model-based
    approach in which the learning process splits
    into 2 separate stages.
  • In the first stage the agent infers a model of
    the other agent based on history.
  • In the second stage, the agent utilized the
    learned model to predicate strategies for the
    future.
  • But this work is limited on learning of an agent
    against a non learning agent.

11
Previous work (contd)
  • Sandholm and Crites (1996) 4 had built
    Q-learning agents for IPD. They showed the
    optimal strategy learned against the opponent
    with fixed policy was tit-for-tat, and the
    behaviors when two Q-learning agents face each
    other.
  • The convergence of the algorithm is not proved in
    non stationary environment.
  • Some approaches like Sen and Arora (1997) 5 try
    to build opponents model while playing, but it
    expects the opponents strategy to be
    deterministic.

12
Zhus Algorithm
  • Objectives
  • To find a policy which is more pareto efficient
    than Nash Equilibrium
  • To make the agents learn to co-operate with other
    learning agents
  • To exploit weak opponents
  • To perform well even in the presence of noise.

13
Probabilistic Tit-for-Tat
  • The tit-for-tat strategy is not optimal when
    there is noise in the system
  • Probabilistic tit-for-tat solves the problem of
    noise for the original tit-for-tat.

14
The Hill Climbing Exploration algorithm
  • The idea behind the hill-climbing exploration
    (HCE) algorithm is that the agents explore
    strategies and hope that the opponents respond
    with strategies that benefit it and compensate
    the exploration cost.
  • If the agents are constantly benefited with such
    strategies these strategies are cooperative ones

15
HCE (contd)
  • The algorithm has 2 levels
  • Hyper level
  • Hypo level
  • In hyper-level the hyper-controller decides the
    state of the hypo-controller.
  • The hypo controller acts differently according to
    the states and also reports observations to the
    hyper-controller.

16
HCE(contd)
  • The hypo-controller decides whether the
    opponents behavior is Malicious or Generous.
  • It reports Malicious behavior if
  • The agents own action dominates its payoff
  • The current change of the opponent is destructive
  • The agents payoff is less than payoff reference
  • Otherwise it reports Generous

17
HCE(contd)
  • The hyper-controller plays a hyper game with two
    strategies , foresight (cooperate) and myopia
    (defect).
  • The hyper game is like prisoners dilemma using
    PTFT strategy with lamda as 0.
  • The hyper controller instructs the
    hypo-controller to be
  • foreseeing with probability of delta if
    hypo-controller reported that opponent is
    malicious
  • Foreseeing if hypo-controller reported that
    opponent is generous
  • Myopic otherwise

18
HCE(contd)
  • In myopic state the agents act as selfish agents
    and try to maximize their own reward without
    considering the behaviour of the opponent
  • In foreseeing state the agents explore the
    neighborhood of the current profile to find a
    more Pareto efficient profile than the current
    one.

19
HCE(contd)
20
Results Expected
21
Results Expected (contd)
22
Results Expected (contd)
23
Results Expected (contd)
24
References
  • 1 Learning To Cooperate by Shengho Zhu (2003)
  • 2 Littman, M. L. (1994). Markov games as a
    frame work for multi- agent reinforcement
    learning.ICML94.
  • 3 Carmel, D., Markovitch, S.
    (1996).Learning models of intelligent
    agents.AAAI-96.
  • 4 Sandholm, T. W., Crites, R. H. (1996).
    Multiagent reinforcement learning in the
    iterated prisoners dilemma. Biosystems, 37,
    14766.
  • 5 Sen, S., Arora, N. (1997). Learning to take
    risks. Collected papers from AAAI-97workshop on
    Multiagent Learning (pp. 5964). AAAI.
  • 6 An Analysis of Stochastic Game Theory for
    Multiagent Reinforcement Learning by Michael
    Bowling Manuela Veloso October, 2000
    CMU-CS-00-165

25
References
  • Multi-Agent Learning Mini-Tutorial
  • Gerry Tesauro IBM T.J.Watson Research Center
  • www.gametheory.net
  • http//www.giaur.qs.pl/ipdt/strategies.php
Write a Comment
User Comments (0)
About PowerShow.com