Case Study: GPU-based Implementation of Sequence Pair Based Floorplanning Using CUDA PowerPoint PPT Presentation

presentation player overlay
1 / 23
About This Presentation
Transcript and Presenter's Notes

Title: Case Study: GPU-based Implementation of Sequence Pair Based Floorplanning Using CUDA


1
Case Study GPU-based Implementation ofSequence
Pair Based Floorplanning Using CUDA
  • Won Ha Choi and Xun Liu
  • Department of Electrical and Computer
    Engineering, North Carolina State University
  • ISCAS 2010

2
Outline
  • Introduction
  • Reviews of preliminaries
  • CUDA and GPU architecture
  • Sequence pair representations
  • Parallelization of floorplanning
  • Parameter setup for kernel call
  • Experimental results
  • Conclusions

3
Introduction
  • Runtime of VLSI CAD tool increases very fast due
    to high design complexity.
  • Parallel programming offers the opportunity to
    reduce runtime with low cost.
  • Especially, GPU has high computing power with
    cheap price.

4
Introduction (cont.)
  • Two challenges of CAD using GPUs
  • Tools are developed in sequential computing
    platform.
  • Data dependencies in traditional CAD algorithms.
  • In this work, dependency removing technique and
    GPU programming are used to address the
    challenges.

5
Introduction (cont.)
  • As a result, sequence pair based floorplanning, a
    simulated annealing (SA) core algorithm, are
    chosen to demonstrate.

6
Review CUDA
  • Compute Unified Device Architecture
  • Developed by nVIDIA.
  • The CPU code does the sequential part.
  • Highly parallelized part usually implement in the
    GPU code, called kernel.
  • Calling GPU function in CPU code is called kernel
    launch.

7
(No Transcript)
8
Review GPU architecture
9
Review Sequence pair
  • Sequence pair (SP) representation basics
  • Represent a packing by a pair of module
    permutations called sequence-pair (e.g., (s1,s2)
    (abdecf, cbfade)).
  • Moves of simulated annealing
  • Swap two modules in the first sequence.
  • Swap two modules in both sequences.
  • Rotate a module.
  • x is after (before) x in both sequences ? x is
    right (left) to x.
  • x is after (before) x in s1 and before (after) x
    in s2 ? x is below (above) x.

10
Packing in SP
11
Parallelization of floorplanning
  • Traditional SA based floorplanning
  • Generate initial solution.
  • Perturb a new solution.
  • Compute packing area.
  • Check stopping criterion and do solution update
    if needed.
  • Which is the slowest part?

The slowest part
12
Parallelization of floorplanning (cont.)
  • Data dependency the ith solution depends on the
    i-1th solution.
  • How to remove it?
  • The proposed way generate multiple SPs in
    advance, and store them into an array that is to
    be transferred into GPU for area calculations.

13
Schematic view
14
Parallelization of floorplanning (cont.)
  • In fact, the proposed parallelization scheme does
    not follow the order of traditional SA.
  • If the ith SP is to be rejected, i1th SP will
    not be generated from ith SP.
  • Hence, degradation in performance may occurs.
  • But, actually, degradation is quite small and
    some of them may get better packing area.

15
Parameter setup for kernel call
  • Kernel call takes longer runtime.
  • Minimize the number of kernel calls.
  • Each batch contains QM SPs.
  • Q the maximum number of temperature iterations
    that a kernel call can execute at once.
  • M the number of solution comparisons executed
    per temperature iteration.

16
Parameter setup for kernel call (cont.)
  • Maximizing Q could minimize the overhead of
    kernel call.
  • But the host-to-device runtime constraint of OS
    limits Q.
  • Finally, QNmax/M is chosen where Nmax represents
    the maximum number of thread allowed in a kernel
    call.

17
Parameter setup for kernel call (cont.)
  • For M, L?M?P
  • L is the maximum number of consecutive rejections
    allowed. The first L SPs need to pack at least.
  • P is the maximum number of solution comparisons
    allowed per temperature iteration.
  • In the experiments, M P.

18
Experimental results
  • Environments
  • CPU Inter Core 2 duo E8200, 2.66GHz.
  • GPU nVIDIA Quadro FX5600, 1.35GHz.
  • OS is not mentioned.
  • C with CUDA
  • Benchmarks
  • blocks 16256, step size is 16.
  • Block dimensions are randomly generated within
    the range of (0,15.

19
Experimental results (cont.)
  • Runtime comparisons
  • To perform fair comparisons, the quality ,in
    terms of area, are the same.

20
Experimental results (cont.)
  • Final area comparisons
  • Maximum difference is less than 5.

21
Change of code percentage
  • Much of the codes are not changed to perform
    parallelization on GPUs.

22
Some discussions
  • How about using a decision tree to generate SPs?
  • Like the idea of carry-select adder.
  • How about using multiple independent SPs?

23
Conclusions
  • The runtime of VLSI CAD algorithm could be
    greatly reduced using GPU.
  • By removing the data dependency appropriately,
    sequential algorithm may be much more faster with
    small modifications only.
  • An encouraging potential for the automatic
    conversion tool that converts traditional CAD
    program into GPU based parallel program.
Write a Comment
User Comments (0)
About PowerShow.com