Randomized Strategies and Temporal Difference Learning in Poker - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Randomized Strategies and Temporal Difference Learning in Poker

Description:

Only set to train when card is a 4. First player always bets, second player tested ... Repeated previous experiments using this encoding. Results (Experiment Set 2) ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 23
Provided by: michae558
Category:

less

Transcript and Presenter's Notes

Title: Randomized Strategies and Temporal Difference Learning in Poker


1
Randomized Strategies and Temporal Difference
Learning in Poker
  • Michael Oder
  • April 4, 2002
  • Advisor Dr. David Mutchler

2
Overview
  • Perfect vs. Imperfect Information Games
  • Poker as Imperfect Information Game
  • Randomization
  • Neural Nets and Temporal Difference
  • Experiments
  • Conclusions
  • Ideas for Further Study

3
Perfect vs. Imperfect Information
  • World-class AI agents exist for many popular
    games
  • Checkers
  • Chess
  • Othello
  • These are games of perfect information
  • All relevant information is available to each
    player
  • Good understanding of imperfect information games
    would be a breakthrough

4
Poker as an Imperfect Information Game
  • Other players hands affect how much will be won
    or lost.However, each player is not aware of
    this vital information.
  • Non-deterministic aspects as well

5
Enter Loki
  • One of the most successful computer poker players
    created
  • Produced at University of Alberta by Jonathan
    Schaeffer et al
  • Employs randomized strategy
  • Makes player less predictable
  • Allows for bluffing

6
Probability Triples
  • At any point in a poker game, player has 3
    choices
  • Bet/Raise
  • Check/Call
  • Fold
  • Assign a probability to each possible move
  • Single move is now a probability triple
  • Problem Associate payoff with hand, betting
    history, and triple (move selected)

7
Neural Nets
  • One promising way to learn such functions is with
    a neural network
  • Neural Networks consist of connected neurons
  • Each connection has a weight
  • Input game state, output a prediction of payoff
  • Train by modifying weights
  • Weights are modified by an amount proportional to
    learning rate

8
Neural Net Example
hand
P(2) P(1) P(-1) P(-2)
history
triple
9
Temporal Difference
  • Most common way to train multiple layer neural
    net is with backpropagation
  • Relies on simple input-output pairs.
  • Problem need to know correct answer right away
    in order to train nets
  • Solution Temporal Difference (TD) learning.
  • TD(?) algorithm developed by Richard Sutton

10
Temporal Difference (contd)
  • Trains responses over the course of a game over
    many time steps
  • Tries to make each prediction closer to the
    prediction in the next time step

P1 P2 P3
P4 P5
11
University of Mauritius Group
  • TD Poker program produced by group supervised by
    Dr. Mutchler
  • Provides environment for playing poker variants
    and testing agents

12
Simple Poker Game
  • Experiments were conducted on extremely simple
    variant of Poker
  • Deck consists of 2, 3, and 4 of Hearts
  • Each player gets one card
  • One round of betting
  • Player with highest card wins the pot
  • Goal Get the net to produce accurate payoff
    values as outputs

13
Early Results
  • Started by pitting a neural net player against a
    random one
  • Results were inconsistant
  • Problem Innappropriate value for learning rate
  • Too low Outputs never approach true payoffs
  • Too high Outputs fluctuate between too high and
    too low

14
Experiment Set I
  • Conjecture Learning should occur with very
    small learning rate over many games
  • Learning Rate 0.01
  • Train for 50,000 games
  • Only set to train when card is a 4
  • First player always bets, second player tested
  • Two Choices
  • call 80, fold 20 -gt avg. payoff 1.4
  • call 20, fold 80 -gt avg. payoff -0.4
  • Want payoffs to settle in on average values

15
Results
  • 3 out of 10 trials came within 0.1 of the correct
    result for the highest payoff
  • 2 out of 10 trials came within 0.1 of the correct
    result for the lowest payoff
  • None of the trials came within 0.1 of the correct
    result for both
  • The results were in the correct order in only
    half of the trials

16
More Distributions
  • Repeated experiment with six choices instead of
    two
  • call 100 -gt avg. payoff 2.0
  • call 80, fold 20 -gt avg. payoff 1.4
  • call 60, fold 40 -gt avg. payoff 0.8
  • call 40, fold 60 -gt avg. payoff 0.2
  • call 20 fold 80 -gt avg. payoff -0.4
  • fold 100 -gt avg. payoff -1.0
  • Using more distributions did help the program
    learn to order value of the distributions
    correctly
  • All six distributions were ranked correctly 7 out
    of 10 times (0.14 chance for any one trial)

17
Output Encoding
  • Distributions are ranked correctly, but many
    output values are still inaccurate.
  • Seems to be largely caused by the encoding of
    outputs
  • Network has four outputs, each representing
    probability of a specific payoff
  • This encoding is not expandable, and four outputs
    must all be correct for good payoff prediction.

18
Relative Payoff Encoding
  • Replace four outputs with single number
  • The number represents the payoff relative to
    highest payoff possibleP 0.5
    (winnings/total possible)
  • Total possible winnings determined at beginning
    of game (sum of other players holdings)
  • Repeated previous experiments using this encoding

19
Results (Experiment Set 2)
  • Payoff predictions were generally more accurate
    using this encoding
  • 5 out of 10 trials got exact payoff (0.502) for
    best distribution choice with six choices
    available
  • Most trials had very close value for payoff
    associated with one of the distributions
  • However, no trial was significantly close on
    multiple probability distributions

20
Observations/Conclusions
  • Neural Net player can learn strategies based on
    probability
  • Payoff is successfully learned as a function of
    betting action
  • Consistency is still a problem
  • Trouble learning correct payoffs for more than
    one distribution

21
Further Study
  • Issues of expandability
  • Coding for multiple-round history
  • Can previous learning be extended?
  • Variable learning rate
  • Study distribution choices
  • Sample some bad distribution choices
  • Test against a variety of other players

22
Questions?
Write a Comment
User Comments (0)
About PowerShow.com