Case Study: GPU-based Implementation of Sequence Pair Based Floorplanning Using CUDA presentation

About This Presentation

Transcript and Presenter's Notes

Title: Case Study: GPU-based Implementation of Sequence Pair Based Floorplanning Using CUDA

1
Case Study GPU-based Implementation ofSequence
Pair Based Floorplanning Using CUDA

Won Ha Choi and Xun Liu
Department of Electrical and Computer
Engineering, North Carolina State University
ISCAS 2010

2
Outline

Introduction
Reviews of preliminaries
CUDA and GPU architecture
Sequence pair representations
Parallelization of floorplanning
Parameter setup for kernel call
Experimental results
Conclusions

3
Introduction

Runtime of VLSI CAD tool increases very fast due
to high design complexity.
Parallel programming offers the opportunity to
reduce runtime with low cost.
Especially, GPU has high computing power with
cheap price.

4
Introduction (cont.)

Two challenges of CAD using GPUs
Tools are developed in sequential computing
platform.
Data dependencies in traditional CAD algorithms.
In this work, dependency removing technique and
GPU programming are used to address the
challenges.

5
Introduction (cont.)

As a result, sequence pair based floorplanning, a
simulated annealing (SA) core algorithm, are
chosen to demonstrate.

6
Review CUDA

Compute Unified Device Architecture
Developed by nVIDIA.
The CPU code does the sequential part.
Highly parallelized part usually implement in the
GPU code, called kernel.
Calling GPU function in CPU code is called kernel
launch.

7
(No Transcript)
8
Review GPU architecture
9
Review Sequence pair

Sequence pair (SP) representation basics
Represent a packing by a pair of module
permutations called sequence-pair (e.g., (s1,s2)
(abdecf, cbfade)).
Moves of simulated annealing
Swap two modules in the first sequence.
Swap two modules in both sequences.
Rotate a module.
x is after (before) x in both sequences ? x is
right (left) to x.
x is after (before) x in s1 and before (after) x
in s2 ? x is below (above) x.

10
Packing in SP
11
Parallelization of floorplanning

Traditional SA based floorplanning
Generate initial solution.
Perturb a new solution.
Compute packing area.
Check stopping criterion and do solution update
if needed.
Which is the slowest part?

The slowest part
12
Parallelization of floorplanning (cont.)

Data dependency the ith solution depends on the
i-1th solution.
How to remove it?
The proposed way generate multiple SPs in
advance, and store them into an array that is to
be transferred into GPU for area calculations.

13
Schematic view
14
Parallelization of floorplanning (cont.)

In fact, the proposed parallelization scheme does
not follow the order of traditional SA.
If the ith SP is to be rejected, i1th SP will
not be generated from ith SP.
Hence, degradation in performance may occurs.
But, actually, degradation is quite small and
some of them may get better packing area.

15
Parameter setup for kernel call

Kernel call takes longer runtime.
Minimize the number of kernel calls.
Each batch contains QM SPs.
Q the maximum number of temperature iterations
that a kernel call can execute at once.
M the number of solution comparisons executed
per temperature iteration.

16
Parameter setup for kernel call (cont.)

Maximizing Q could minimize the overhead of
kernel call.
But the host-to-device runtime constraint of OS
limits Q.
Finally, QNmax/M is chosen where Nmax represents
the maximum number of thread allowed in a kernel
call.

17
Parameter setup for kernel call (cont.)

For M, L?M?P
L is the maximum number of consecutive rejections
allowed. The first L SPs need to pack at least.
P is the maximum number of solution comparisons
allowed per temperature iteration.
In the experiments, M P.

18
Experimental results

Environments
CPU Inter Core 2 duo E8200, 2.66GHz.
GPU nVIDIA Quadro FX5600, 1.35GHz.
OS is not mentioned.
C with CUDA
Benchmarks
blocks 16256, step size is 16.
Block dimensions are randomly generated within
the range of (0,15.

19
Experimental results (cont.)

Runtime comparisons
To perform fair comparisons, the quality ,in
terms of area, are the same.

20
Experimental results (cont.)

Final area comparisons
Maximum difference is less than 5.

21
Change of code percentage

Much of the codes are not changed to perform
parallelization on GPUs.

22
Some discussions

How about using a decision tree to generate SPs?
Like the idea of carry-select adder.
How about using multiple independent SPs?

23
Conclusions

The runtime of VLSI CAD algorithm could be
greatly reduced using GPU.
By removing the data dependency appropriately,
sequential algorithm may be much more faster with
small modifications only.
An encouraging potential for the automatic
conversion tool that converts traditional CAD
program into GPU based parallel program.

Write a Comment

User Comments (0)

About PowerShow.com

Case Study: GPU-based Implementation of Sequence Pair Based Floorplanning Using CUDA PowerPoint PPT Presentation