Progressive%20Strategies%20For%20Monte-Carlo%20Tree%20Search - PowerPoint PPT Presentation

About This Presentation
Title:

Progressive%20Strategies%20For%20Monte-Carlo%20Tree%20Search

Description:

Rules: 1. Disallow play in its eyes 2. Stop the game after a certain number of moves. ... H is around 1000 times slower than playing a move in simulated game. ... – PowerPoint PPT presentation

Number of Views:240
Avg rating:3.0/5.0
Slides: 24
Provided by: hah1
Category:

less

Transcript and Presenter's Notes

Title: Progressive%20Strategies%20For%20Monte-Carlo%20Tree%20Search


1
Progressive Strategies For Monte-Carlo Tree
Search
Authors G.M.J.B. Chaslot, M.H.M. Winands,
J.W.H.M. Uiterwijk, H.J. van den Herik
and B. Bouzy
  • Presenter Ling Zhao
  • University of Alberta
  • November 5, 2007

2
Outlines
  • Monte-Carlo Tree Search (MCTS) and the
    implementation in MANGO.
  • Progressive strategies progressive bias and
    progressive unpruning.
  • Experiments.
  • Conclusions and future work.

3
MCTS
4
Selection
  • Process select moves in UCT tree for the best
    balance between exploitation and exploration.
  • A multi-armed bandit problems.
  • UCB formula
  • k No. k child of node n, vi value of node i
  • ni visit count of node i, np visit count of
    node p
  • C const
  • Selection precondition np gt T ( 30)

5
Expansion
  • Process For a given leaf node, determine whether
    it will be expanded by storing one or more of its
    children in UCT tree.
  • Simple rule expand one node per simulated game
    (the first node encountered not in UCT tree).
  • In MANGO, if np T ( 30), all its children will
    be expanded.

6
Simulation
  • Process self-play until the end of the game.
  • Rules 1. Disallow play in its eyes 2. Stop the
    game after a certain number of moves.
  • In MANGO, the probability of a move being
    selected in simulation is proportional to its
    urgency, a sum of capture value, 3x3 pattern
    value and proximity modification.

7
Backpropagation
  • Process using the result of a simulated game to
    update the nodes it traverses.
  • Result 1 for win, -1 for loss, 0 for draw
  • vi of node i is computed by averaging the result
    of all simulated games made through it.

8
Progressive Strategies
  • Soft transition between selection strategy and
    simulation strategy.
  • Intuition Selection strategy becomes more
    accurate than simulation one only when the number
    of games simulated is large.
  • Progress strategy uses the information available
    for the selection strategy, and some expensive
    domain knowledge.
  • Progress strategy is similar to the simulation
    strategy when a few games have been played, and
    converges to selection strategy when numerous
    games have been played.

9
Progressive Bias
  • Direct search using possibly expensive heuristic
    knowledge.
  • Modify the selection strategy, and make sure the
    influence decreases fast when many games have
    been played.

10
Progressive Bias Formula
  • Hi is a coefficient representing knowledge
  • For children with ni 0, is replaced by M
    with Mgtgtany vi, thus the children with the
    highest f(ni) is selected.
  • If np ? 30, 100, f(ni) is dominant.
  • If np ? (100, 500, f(ni) has partial impact.
  • When np gt 500, f(ni) is dominated, but can be
    used for tie breaker.

11
Alternative Approach
  • Using prior knowledge (Gelly and Silver)
  • Scalability of this approach to larger board
    sizes is an open question.

12
Progressive Unpruning
  • Reducing the branching factor artificially when
    the selection strategy is used.
  • Increase the branching factor progressively when
    more games are simulated.
  • Pruning or unpruning is done according to the
    heuristic value of the children.

13
Progressive Unpruning (Details)
  • If np T, only k0 (5) children with highest
    heuristic values are not pruned.
  • If np gt T, k lg(np /40) 2.67 k0, children
    will be left unpruned.
  • k 5 (np 40), 7 (np 80), 10 (np 120)
  • Similar idea used by Coulom (progressive
    widening).

14
Heuristic Values
  • Pattern value learned offline using pattern
    matching (89,119 patterns from 2000 pro games).
  • Capture value the number of stones to be
    captured or to escape a capture with the move.
  • Proximity value Euclidean distance to the last
    move.

15
Heuristic Value Formula
  • Ci Capture value
  • Pi pattern value
  • Dk,i distance to the kth last move
  • ?k 1.25 k/2
  • Computing Pi the time consuming part

16
Time For Computing Heuristics
  • Computing H is around 1000 times slower than
    playing a move in simulated game.
  • So H is computed only once per node, when T (30)
    games is played through it.
  • Speed reduction is only 4, since the number of
    nodes with visit count gt 30 is low compared to
    the total number of moves in simulated games.

17
Domain Knowledge Calls Vs. T
18
Visit Count Vs. Number of Nodes
19
Experiments
  • Self played games on 13x13 board (10 sec per
    move) MANGO with progressive strategies won 91
    of the 500 games against MANGO without
    progressive strategies.
  • MANGO 20,000 simulated games, 1 sec on 9x9, 2
    sec on 13x13, 5 sec on 19x19.
  • GNU Go level 10 on 9x9 and 13x13, 0 on 19x19.

20
MANGO Vs. GNU Go
21
MANGO Vs. GNU Go
  • Plain MCTS does not scale well to 13x13 or 19x19
    board.
  • Progressive strategies are useful on every board
    size.
  • The two progressive strategies combined are most
    powerful, esp. in 19x19.

22
Tournament Results
  • Always in the top half.
  • But were negative results removed?

23
Conclusions and Future Work
  • Two progressive strategies are useful by
    providing a soft transition between selection and
    simulation.
  • Overhead is negligible.
  • Combine with RAVE and UCT with prior knowledge.
  • Combine with the advanced knowledge developed by
    Coulom.
  • Using life and death information.
  • Better progressive bias.
  • P-A. Coquelin and R. Munos. Bandit Algorithm for
    Tree Search. Technical Report 6141, INRIA, 2007.
Write a Comment
User Comments (0)
About PowerShow.com