Exploiting On-chip Memory Bandwidth in the VIRAM Compiler

About This Presentation

Title:

Exploiting On-chip Memory Bandwidth in the VIRAM Compiler

Description:

Dave Judd, Katherine Yelick, Christoforos Kozyrakis, David Martin, and David Patterson http://iram.cs.berkeley.edu/ – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 51

Provided by: DavidOpp8

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Exploiting On-chip Memory Bandwidth in the VIRAM Compiler

1
Exploiting On-chip Memory Bandwidth in the VIRAM
Compiler

Dave Judd, Katherine Yelick, Christoforos
Kozyrakis, David Martin, and David Patterson
http//iram.cs.berkeley.edu/

2
IRAM Overview

A processor architecture for embedded/portable
systems running media applications
MIPS scalar core with vector co-processor
Embedded DRAM

Flag 0
Flag 1
Instr Cache (8KB)
FPU
Flag Register File (512B)
MIPS64 5Kc Core
CP IF
Arith 0
Arith 1
256b
256b
SysAD IF
Vector Register File (8KB)
64b
64b
Memory Unit
TLB
256b
JTAG IF
DMA
Memory Crossbar

JTAG
DRAM0 (2MB)
DRAM1 (2MB)
DRAM7 (2MB)
3
Why Vectors?

Utilizes on-chip bandwidth of IRAM
parallelism within instructions
Efficient architecture for vectorizable code
avoids area, power, and design of reorder logic
low instruction decode overhead
Multimedia algorithms are vectorizable
e.g., vectorize across pixels in an image
Scales easily across chip generations
e.g., 32-way parallelism in instruction can be
implemented by 1, 2, 4, 8-way
Leverages well-known compiler technology

4
Architecture Details

MIPS64 5Kc core (200 MHz)
Single-issue scalar core with 8 Kbyte ID caches
Vector unit (200 MHz)
8 KByte register file (32 64b elements per
register)
256b datapaths, can be subdivided into 16b, 32b,
64b
2 arithmetic (1 FP, single), 2 flag processing
Memory unit
4 address generators for strided/indexed accesses
Main memory system
8 2-MByte DRAM macros
25ns random access time, 7.5ns page access time
Crossbar interconnect
12.8 GBytes/s peak bandwidth per direction
(load/store)
Off-chip interface
2 channel DMA engine and 64n SysAD bus

5
Floorplan

Technology IBM SA-27E
0.18mm CMOS, 6 metal layers
290 mm2 die area
225 mm2 for memory/logic
Transistor count 150M
Power supply
1.2V for logic, 1.8V for DRAM
Typical power consumption 2.0 W
0.5 W (scalar) 1.0 W (vector) 0.2 W (DRAM)
0.3 W (misc)
Peak vector performance
1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b
operations)
3.2/6.4 /12.8 Gops w. madd
1.6 Gflops (single-precision)
Tape-out planned for Spring 01

6
Scalable Design

2 lanes, 4 MB
1.6 Gops

4 lanes, 8 MB 3.2 Gops (32-bit)
1 lane, 2 MB .8 Gops

Scaling number of lanes for performance, energy,
area
Number of DRAM banks may scale independently
e.g., 16 banks rather than 8

7
Vector Architectural State

Number of VPs given by the Vector Length register
vl
Width of each VP given by the register vpw
vpw is one of 8b,16b,32b,64b
Maximum vector length is given by a read-only
register mvl
mvl depends on implementation and vpw
128,128,64,32 in VIRAM-1

8
VIRAM Compiler
Optimizer
Frontends
Code Generators
C
T3D/T3E
Crays PDGCS
C
C90/T90/SV1
Fortran95
SV2/VIRAM

Based on the Crays production compiler
Challenges
narrow data types and scalar/vector memory
consistency
Advantages relative to media-extensions
powerful addressing modes and ISA independent of
datapath width

9
Compiler Challenges

Can compiled code effectively use VIRAM design?
Is on-chip DRAM bandwidth sufficient
How well do multimedia applications vectorize
Generating code for variable width data

10
Matrix-Vector Multiplication

Vector matrix multiply ( mvm with column layout)
saxpy 2 vloads, 1 vstore (all unit stride)
Matrix vector multiply
dot 2 vloads, (both unit stride a reduction)
saxpy 2 vloads, 1 vstore (2 strided 1 unit)
Sparse matrix-vector multiply
dot 3 vloads (1 indexed, 2 unit reduction)
saxpy 3 vloads, 1 vstore (2 indexed, 2 unit)
needs column layout

11
Matrix Vector Multiplication

Performance of various source optimizations

Column performance peak
12
Comparison of MVM Performance

Double precision floating point
compiled for VIRAM (note chip only does single)
hand- or Atlas-optimized for other machines

100x100 matrix

As matrix size increases, performance
drops on cache-based designs
increases on vector designs

MFLOPS
13
Sparse MVM Performance

Performance is matrix-dependent lp matrix
compiled for VIRAM using independent pragma
sparse column layout
Sparsity-optimized for other machines
sparse row (or blocked row) layout

MFLOPS
14
Generating Code for Variable VPW

Strategy vectorizer determines minimum correct
vpw for each loop nest
Vectorizer assumes vpw64 initially
At end of vectorization, discard vectorized copy
of loop if greatest width encountered is less
than 64 and start vectorization over with new
vpw.
Code gen checks vpw for each loop nest.
Limitation a single loop nest will run at the
speed of the widest type.
Reason simplicity performance of the common
case
No attempt to split/combine loops based on vpw

15
Media Benchmarks

Mostly from U Torontos benchmark suite
8-bit data, 16-bit operations
Colorspace strided loads/stores
Composition unit stride
Convolve strided
Mixed 16 and 32-bit integer
Detect
Decrypt
32-bit Floating point
FIR filter
SAXPY 64 64 element
SAXPY 1K 1024 element
matmul matrix multiplication

16
Integer Benchmarks

Strided access important, e.g., RGB
narrow types limited by address generation
Outer loop vectorization and unrolling used
helps avoid short vectors
spilling can be a problem
Tiling could probably help

17
Floating Point DSP Benchmarks

Performance is competitive with hand-coding
Vector length is important (e.g., saxpy)
but multiple vectors is fine (e.g., matmul)

18
Conclusions

VIRAM ISA shows high performance on compiled code
competitive with modern processors
limitations are address generation for strided
and indexed memory operations
Compiler effectively uses variable width data
allows media applications to vectorize
performance scales with inverse data width
Future compiler work
Tiling
Fixed point support
Better register allocation

19
Backup slides
20
Performance Summary

Performance of compiled code is generally good
matmul and saxpy meet or beat hand-coded
3 addressing modes very useful
Limitations to performance
Dependencies or inadequate compiler analysis
Inadequate memory bandwidth
Lack of address generators
Short vectors
Future compiler work
Tiling
Fixed point support
Better register allocation

21
Scaling Media Benchmarks
22
Compiled matrix-vector multiplication 2
Flops/element

Easy compilation problem stresses memory
bandwidth
Compare to 304 Mflops (64-bit) for Power3
(hand-coded)

Performance scales with number of lanes up to 4
Need more memory banks than default DRAM macro
for 8 lanes

23
Outline

Why vectors for IRAM?
Including media types
The virtual lane model
Virtual processor width
Limitations to performance
Dependencies or inadequate compiler analysis
Inadequate memory bandwidth
Lack of address generators
Short vectors
Comparisons to other architectures
Conclusions

24
Matrix-Vector Multiply

Scaling Matrix-Vector Multiplication

25
Performance on Media Benchmarks

Using compiled code 1, 2, 4, and 8 lanes

26
Compiled matrix-vector multiplication 2
Flops/element

Easy compilation problem stresses memory
bandwidth
Compare to 304 Mflops (64-bit) for Power3
(hand-coded)

Performance scales with number of lanes up to 4
Need more memory banks than default DRAM macro
for 8 lanes

MFLOPS
27
Compiling Media Kernels on IRAM

The compiler generates code for narrow data
widths, e.g., 16-bit integer
Compilation model is simple, more scalable
(across generations) than MMX, VIS, etc.

Strided and indexed loads/stores simpler than
pack/unpack
Maximum vector length is longer than datapath
width (256 bits) all lane scalings done with
single executable

28
Vector Vs. SIMD Example

Simple image processing example
conversion from RGB to YUV
Y ( 9798R 19235G 3736B) / 32768
U (-4784R - 9437G 4221B) / 32768
128
V (20218R 16941G 3277B) / 32768
128

29
VIRAM Code (22 instructions)

RGBtoYUV
vlds.u.b r_v, r_addr, stride3, addr_inc
load R
vlds.u.b g_v, g_addr, stride3, addr_inc
load G
vlds.u.b b_v, b_addr, stride3, addr_inc
load B
xlmul.u.sv o1_v, t0_s, r_v
calculate Y
xlmadd.u.sv o1_v, t1_s, g_v
xlmadd.u.sv o1_v, t2_s, b_v
vsra.vs o1_v, o1_v, s_s
xlmul.u.sv o2_v, t3_s, r_v
calculate U
xlmadd.u.sv o2_v, t4_s, g_v
xlmadd.u.sv o2_v, t5_s, b_v
vsra.vs o2_v, o2_v, s_s
vadd.sv o2_v, a_s, o2_v
xlmul.u.sv o3_v, t6_s, r_v
calculate V
xlmadd.u.sv o3_v, t7_s, g_v
xlmadd.u.sv o3_v, t8_s, b_v
vsra.vs o3_v, o3_v, s_s
vadd.sv o3_v, a_s, o3_v
vsts.b o1_v, y_addr, stride3, addr_inc
store Y

30
MMX Code (part 1)

RGBtoYUV
movq mm1, eax
pxor mm6, mm6
movq mm0, mm1
psrlq mm1, 16
punpcklbw mm0, ZEROS
movq mm7, mm1
punpcklbw mm1, ZEROS
movq mm2, mm0
pmaddwd mm0, YR0GR
movq mm3, mm1
pmaddwd mm1, YBG0B
movq mm4, mm2
pmaddwd mm2, UR0GR
movq mm5, mm3
pmaddwd mm3, UBG0B
punpckhbw mm7, mm6
pmaddwd mm4, VR0GR
paddd mm0, mm1

paddd mm4, mm5
movq mm5, mm1
psllq mm1, 32
paddd mm1, mm7
punpckhbw mm6, ZEROS
movq mm3, mm1
pmaddwd mm1, YR0GR
movq mm7, mm5
pmaddwd mm5, YBG0B
psrad mm0, 15
movq TEMP0, mm6
movq mm6, mm3
pmaddwd mm6, UR0GR
psrad mm2, 15
paddd mm1, mm5
movq mm5, mm7
pmaddwd mm7, UBG0B
psrad mm1, 15
pmaddwd mm3, VR0GR

31
MMX Code (part 2)

paddd mm6, mm7
movq mm7, mm1
psrad mm6, 15
paddd mm3, mm5
psllq mm7, 16
movq mm5, mm7
psrad mm3, 15
movq TEMPY, mm0
packssdw mm2, mm6
movq mm0, TEMP0
punpcklbw mm7, ZEROS
movq mm6, mm0
movq TEMPU, mm2
psrlq mm0, 32
paddw mm7, mm0
movq mm2, mm6
pmaddwd mm2, YR0GR
movq mm0, mm7
pmaddwd mm7, YBG0B

movq mm4, mm6
pmaddwd mm6, UR0GR
movq mm3, mm0
pmaddwd mm0, UBG0B
paddd mm2, mm7
pmaddwd mm4,
pxor mm7, mm7
pmaddwd mm3, VBG0B
punpckhbw mm1,
paddd mm0, mm6
movq mm6, mm1
pmaddwd mm6, YBG0B
punpckhbw mm5,
movq mm7, mm5
paddd mm3, mm4
pmaddwd mm5, YR0GR
movq mm4, mm1
pmaddwd mm4, UBG0B
psrad mm0, 15

32
MMX Code (pt. 3 121 instructions)

pmaddwd mm7, UR0GR
psrad mm3, 15
pmaddwd mm1, VBG0B
psrad mm6, 15
paddd mm4, OFFSETD
packssdw mm2, mm6
pmaddwd mm5, VR0GR
paddd mm7, mm4
psrad mm7, 15
movq mm6, TEMPY
packssdw mm0, mm7
movq mm4, TEMPU
packuswb mm6, mm2
movq mm7, OFFSETB
paddd mm1, mm5
paddw mm4, mm7
psrad mm1, 15
movq ebx, mm6
packuswb mm4,

movq ecx, mm4
packuswb mm5, mm3
add ebx, 8
add ecx, 8
movq edx, mm5
dec edi
jnz RGBtoYUV

33
IRAM Status

Chip
ISA has not changed significantly in over a year
Verilog complete, except SRAM for scalar cache
Testing framework in place
Compiler
Backend code generation complete
Continued performance improvements, especially
for narrow data widths
Application Benchmarks
Handcoded kernels better than MMX,VIS, gp DSPs
DCT, FFT, MVM, convolution, image composition,
Compiled kernels demonstrate ISA advantages
MVM, sparse MVM, decrypt, image composition,
Full applications H263 encoding (done), speech
(underway)

34
Backup from Dave Judds Talk
35
VIRAM Tools

vas assembler
vdis disassembler
vsim-isa simulator
vsim-db debugger
vsim-p performance simulator
vsim-syncmemory consistency simulator

36
Compiler Testing

C regression test suite (commercial test suite)
Scalar emphasis, C conformance
All tests pass except
Small numerical differences due to lack on 128
f.p. support
C test suite
1167 of 1183 tests execute correctly.
12 failures in compilation undefined variables
4 failures in execution bad answers

37
Compiler Testing

Vector regression test suites (CRAY)
Specifically tests for vectorization
Compares vector and scalar results
Easy to isolate problems
vector status
59 of 62 tests pass
Some minor numerical differences
1 bad answer, 2 integer overflow
vector4 status
163 of 165 tests execute correctly
1 bad anwer, 1 illegal use of vector inst.

38
Kernel Performance mvmmatrix-vector
multiplication
64x64, 32 bit floating pt.
Hand optimized assembly code 579 mflops
vcc w/ restrict keywords added 352 mflops
1 element padding to avoid bank conflicts 401 mflops
shortloop directive Loops interchanged outer loop vectorized by vcc. 592 mflops
39
Mods to mvm code
/ Original code mvm.c / /
Modified code / void mvm
(float A, void mvm (float
restrict A, float X,
float restrict X,
float Y, float
restrict Y, int n,
int n, int acol )
int acol ) int i,j
int i,j float
x_elem lt if ( n
lt 64 ) if ( n lt 64 )
for (i 0 i lt n i)
for (i 0 i lt n i)
pragma shortloop for (j
0 j lt n j) for (j 0 j
lt n j) Yj Ajacoli x_elem
Yj Ajacoli Xi

40
Kernel performance mm_mulmatrix matrix
multiplication
64x64x64, 32 bit float, 1.6 gigaflop theoretical
peak
Hand coded assembly mm-mul-small.s 1.58 gigaflops
vcc w/ restrict and shortloop keywords 0.852 gigaflops
inner two loops in separate function, allows outer loop vectorization 1.51 gigaflops
41
Kernel performance saxpy

32 bit floating point ops

N64 256 1024 4096
379 593 691 720
385 596 692 721
Hand coded assembly
vcc w/restrict keywords
42
Kernel performance motion_estimate
32 bit integer ops, finding the minimum sum of
absolute differences for a reference block and a
region in an image.
Hand optimized assembly 1.181 gigaops
vcc w/restrict keywords 170 mops
shortloop directives 253 mops
outer loop unroll directive 257 mops
No improvement because of spilling.
43
Dongarra loops

100 loops to test compiler vectorization
capability
Rewritten in C by Cray (?)
vcc vectorizes 74 loops
vcc partially vectorizes 3 loops
vcc conditionally vectorizes 3 loops
1 loop not vectorized because vector sin/cos not
currently available on viram.
19 other loops not vectorized
Data provided by Sam Williams

44
Features Remaining

Support version 3 isa and version 4 isa
Isa changes required by Mips Inc. scalar core
Performance simulator only supports oldisa
Finish sync support
take advantage of Cray implementation
VIRAM machine target
Allow easier maintainence of frontend and
optimizer mods for viram
User documentation
Summary of differences w/Cray compiler
Useful options, hints for vector code

45
Performance Features Remaining

Additional tuning instruction scheduler
Support new SV2 inliner for C/C
Shortloop enhancements
Reduce spilling
Scheduler concern with registers
Ordering of blocks for register assignment within
priority groups
Special vector registers carried across calls
Loop unrolling for vector loops
Tune for key benchmarks

46
Other Future Compiler Features ?

Support for speculative execution
Compiler extensions for fixed point hardware
Support for vector functions vector mlib

47
Summary

vcc is a reasonably robust compiler for VIRAM
Performance on kernels is good w/appropriate
directives, some effort for optimum vectorization
Need to prioritize remaining work

48
Codegen/optimizer issues for VIRAM

Variable virtual processor width (VPW)
Variable maximum vector register length (MVL)
Vector flag registers treated as 1 bit wide
vector register
Multiple base, incr, stride regs. autoincrement
Fixed point arithmetic (saturating add, etc.)
Memory consistency
New vector instructions not available on SV2

49
Generating Code for Variable MVL

Maximum vector length is not specified in IRAM
ISA.
However, compiler assumes mvl at compile time
mvl based on vpw
mvl assumption dependent on VIRAM-1 hardware
implementation
Recompiling required for future hardware versions
if mvl changes
MVL knowledge useful for code gen and vectorizer
register spilling
short loop vectorization
length-dependent vectorization ( and may
eliminate safe vector length computation at run
time)
for (i 0 i lt n i)
ai ai32