Title: Exploiting On-chip Memory Bandwidth in the VIRAM Compiler
1Exploiting On-chip Memory Bandwidth in the VIRAM
Compiler
- Dave Judd, Katherine Yelick, Christoforos
Kozyrakis, David Martin, and David Patterson - http//iram.cs.berkeley.edu/
2IRAM Overview
- A processor architecture for embedded/portable
systems running media applications - MIPS scalar core with vector co-processor
- Embedded DRAM
Flag 0
Flag 1
Instr Cache (8KB)
FPU
Flag Register File (512B)
MIPS64 5Kc Core
CP IF
Arith 0
Arith 1
256b
256b
SysAD IF
Vector Register File (8KB)
64b
64b
Memory Unit
TLB
256b
JTAG IF
DMA
Memory Crossbar
JTAG
DRAM0 (2MB)
DRAM1 (2MB)
DRAM7 (2MB)
3Why Vectors?
- Utilizes on-chip bandwidth of IRAM
- parallelism within instructions
- Efficient architecture for vectorizable code
- avoids area, power, and design of reorder logic
- low instruction decode overhead
- Multimedia algorithms are vectorizable
- e.g., vectorize across pixels in an image
- Scales easily across chip generations
- e.g., 32-way parallelism in instruction can be
implemented by 1, 2, 4, 8-way - Leverages well-known compiler technology
4Architecture Details
- MIPS64 5Kc core (200 MHz)
- Single-issue scalar core with 8 Kbyte ID caches
- Vector unit (200 MHz)
- 8 KByte register file (32 64b elements per
register) - 256b datapaths, can be subdivided into 16b, 32b,
64b - 2 arithmetic (1 FP, single), 2 flag processing
- Memory unit
- 4 address generators for strided/indexed accesses
- Main memory system
- 8 2-MByte DRAM macros
- 25ns random access time, 7.5ns page access time
- Crossbar interconnect
- 12.8 GBytes/s peak bandwidth per direction
(load/store) - Off-chip interface
- 2 channel DMA engine and 64n SysAD bus
5Floorplan
- Technology IBM SA-27E
- 0.18mm CMOS, 6 metal layers
- 290 mm2 die area
- 225 mm2 for memory/logic
- Transistor count 150M
- Power supply
- 1.2V for logic, 1.8V for DRAM
- Typical power consumption 2.0 W
- 0.5 W (scalar) 1.0 W (vector) 0.2 W (DRAM)
0.3 W (misc) - Peak vector performance
- 1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b
operations) - 3.2/6.4 /12.8 Gops w. madd
- 1.6 Gflops (single-precision)
- Tape-out planned for Spring 01
6Scalable Design
4 lanes, 8 MB 3.2 Gops (32-bit)
1 lane, 2 MB .8 Gops
- Scaling number of lanes for performance, energy,
area - Number of DRAM banks may scale independently
- e.g., 16 banks rather than 8
7Vector Architectural State
- Number of VPs given by the Vector Length register
vl - Width of each VP given by the register vpw
- vpw is one of 8b,16b,32b,64b
- Maximum vector length is given by a read-only
register mvl - mvl depends on implementation and vpw
128,128,64,32 in VIRAM-1
8VIRAM Compiler
Optimizer
Frontends
Code Generators
C
T3D/T3E
Crays PDGCS
C
C90/T90/SV1
Fortran95
SV2/VIRAM
- Based on the Crays production compiler
- Challenges
- narrow data types and scalar/vector memory
consistency - Advantages relative to media-extensions
- powerful addressing modes and ISA independent of
datapath width
9Compiler Challenges
- Can compiled code effectively use VIRAM design?
- Is on-chip DRAM bandwidth sufficient
- How well do multimedia applications vectorize
- Generating code for variable width data
10Matrix-Vector Multiplication
- Vector matrix multiply ( mvm with column layout)
- saxpy 2 vloads, 1 vstore (all unit stride)
- Matrix vector multiply
- dot 2 vloads, (both unit stride a reduction)
- saxpy 2 vloads, 1 vstore (2 strided 1 unit)
- Sparse matrix-vector multiply
- dot 3 vloads (1 indexed, 2 unit reduction)
- saxpy 3 vloads, 1 vstore (2 indexed, 2 unit)
- needs column layout
11Matrix Vector Multiplication
- Performance of various source optimizations
Column performance peak
12Comparison of MVM Performance
- Double precision floating point
- compiled for VIRAM (note chip only does single)
- hand- or Atlas-optimized for other machines
100x100 matrix
- As matrix size increases, performance
- drops on cache-based designs
- increases on vector designs
MFLOPS
13Sparse MVM Performance
- Performance is matrix-dependent lp matrix
- compiled for VIRAM using independent pragma
- sparse column layout
- Sparsity-optimized for other machines
- sparse row (or blocked row) layout
MFLOPS
14Generating Code for Variable VPW
- Strategy vectorizer determines minimum correct
vpw for each loop nest - Vectorizer assumes vpw64 initially
- At end of vectorization, discard vectorized copy
of loop if greatest width encountered is less
than 64 and start vectorization over with new
vpw. - Code gen checks vpw for each loop nest.
- Limitation a single loop nest will run at the
speed of the widest type. - Reason simplicity performance of the common
case - No attempt to split/combine loops based on vpw
15Media Benchmarks
- Mostly from U Torontos benchmark suite
- 8-bit data, 16-bit operations
- Colorspace strided loads/stores
- Composition unit stride
- Convolve strided
- Mixed 16 and 32-bit integer
- Detect
- Decrypt
- 32-bit Floating point
- FIR filter
- SAXPY 64 64 element
- SAXPY 1K 1024 element
- matmul matrix multiplication
16Integer Benchmarks
- Strided access important, e.g., RGB
- narrow types limited by address generation
- Outer loop vectorization and unrolling used
- helps avoid short vectors
- spilling can be a problem
- Tiling could probably help
17Floating Point DSP Benchmarks
- Performance is competitive with hand-coding
- Vector length is important (e.g., saxpy)
- but multiple vectors is fine (e.g., matmul)
18Conclusions
- VIRAM ISA shows high performance on compiled code
- competitive with modern processors
- limitations are address generation for strided
and indexed memory operations - Compiler effectively uses variable width data
- allows media applications to vectorize
- performance scales with inverse data width
- Future compiler work
- Tiling
- Fixed point support
- Better register allocation
19Backup slides
20Performance Summary
- Performance of compiled code is generally good
- matmul and saxpy meet or beat hand-coded
- 3 addressing modes very useful
- Limitations to performance
- Dependencies or inadequate compiler analysis
- Inadequate memory bandwidth
- Lack of address generators
- Short vectors
- Future compiler work
- Tiling
- Fixed point support
- Better register allocation
21Scaling Media Benchmarks
22Compiled matrix-vector multiplication 2
Flops/element
- Easy compilation problem stresses memory
bandwidth - Compare to 304 Mflops (64-bit) for Power3
(hand-coded)
- Performance scales with number of lanes up to 4
- Need more memory banks than default DRAM macro
for 8 lanes
23Outline
- Why vectors for IRAM?
- Including media types
- The virtual lane model
- Virtual processor width
- Limitations to performance
- Dependencies or inadequate compiler analysis
- Inadequate memory bandwidth
- Lack of address generators
- Short vectors
- Comparisons to other architectures
- Conclusions
24Matrix-Vector Multiply
- Scaling Matrix-Vector Multiplication
25Performance on Media Benchmarks
- Using compiled code 1, 2, 4, and 8 lanes
26Compiled matrix-vector multiplication 2
Flops/element
- Easy compilation problem stresses memory
bandwidth - Compare to 304 Mflops (64-bit) for Power3
(hand-coded)
- Performance scales with number of lanes up to 4
- Need more memory banks than default DRAM macro
for 8 lanes
MFLOPS
27Compiling Media Kernels on IRAM
- The compiler generates code for narrow data
widths, e.g., 16-bit integer - Compilation model is simple, more scalable
(across generations) than MMX, VIS, etc.
- Strided and indexed loads/stores simpler than
pack/unpack - Maximum vector length is longer than datapath
width (256 bits) all lane scalings done with
single executable
28Vector Vs. SIMD Example
- Simple image processing example
- conversion from RGB to YUV
- Y ( 9798R 19235G 3736B) / 32768
- U (-4784R - 9437G 4221B) / 32768
128 - V (20218R 16941G 3277B) / 32768
128
29VIRAM Code (22 instructions)
- RGBtoYUV
- vlds.u.b r_v, r_addr, stride3, addr_inc
load R - vlds.u.b g_v, g_addr, stride3, addr_inc
load G - vlds.u.b b_v, b_addr, stride3, addr_inc
load B - xlmul.u.sv o1_v, t0_s, r_v
calculate Y - xlmadd.u.sv o1_v, t1_s, g_v
- xlmadd.u.sv o1_v, t2_s, b_v
- vsra.vs o1_v, o1_v, s_s
- xlmul.u.sv o2_v, t3_s, r_v
calculate U - xlmadd.u.sv o2_v, t4_s, g_v
- xlmadd.u.sv o2_v, t5_s, b_v
- vsra.vs o2_v, o2_v, s_s
- vadd.sv o2_v, a_s, o2_v
- xlmul.u.sv o3_v, t6_s, r_v
calculate V - xlmadd.u.sv o3_v, t7_s, g_v
- xlmadd.u.sv o3_v, t8_s, b_v
- vsra.vs o3_v, o3_v, s_s
- vadd.sv o3_v, a_s, o3_v
- vsts.b o1_v, y_addr, stride3, addr_inc
store Y
30MMX Code (part 1)
- RGBtoYUV
- movq mm1, eax
- pxor mm6, mm6
- movq mm0, mm1
- psrlq mm1, 16
- punpcklbw mm0, ZEROS
- movq mm7, mm1
- punpcklbw mm1, ZEROS
- movq mm2, mm0
- pmaddwd mm0, YR0GR
- movq mm3, mm1
- pmaddwd mm1, YBG0B
- movq mm4, mm2
- pmaddwd mm2, UR0GR
- movq mm5, mm3
- pmaddwd mm3, UBG0B
- punpckhbw mm7, mm6
- pmaddwd mm4, VR0GR
- paddd mm0, mm1
- paddd mm4, mm5
- movq mm5, mm1
- psllq mm1, 32
- paddd mm1, mm7
- punpckhbw mm6, ZEROS
- movq mm3, mm1
- pmaddwd mm1, YR0GR
- movq mm7, mm5
- pmaddwd mm5, YBG0B
- psrad mm0, 15
- movq TEMP0, mm6
- movq mm6, mm3
- pmaddwd mm6, UR0GR
- psrad mm2, 15
- paddd mm1, mm5
- movq mm5, mm7
- pmaddwd mm7, UBG0B
- psrad mm1, 15
- pmaddwd mm3, VR0GR
31MMX Code (part 2)
- paddd mm6, mm7
- movq mm7, mm1
- psrad mm6, 15
- paddd mm3, mm5
- psllq mm7, 16
- movq mm5, mm7
- psrad mm3, 15
- movq TEMPY, mm0
- packssdw mm2, mm6
- movq mm0, TEMP0
- punpcklbw mm7, ZEROS
- movq mm6, mm0
- movq TEMPU, mm2
- psrlq mm0, 32
- paddw mm7, mm0
- movq mm2, mm6
- pmaddwd mm2, YR0GR
- movq mm0, mm7
- pmaddwd mm7, YBG0B
- movq mm4, mm6
- pmaddwd mm6, UR0GR
- movq mm3, mm0
- pmaddwd mm0, UBG0B
- paddd mm2, mm7
- pmaddwd mm4,
- pxor mm7, mm7
- pmaddwd mm3, VBG0B
- punpckhbw mm1,
- paddd mm0, mm6
- movq mm6, mm1
- pmaddwd mm6, YBG0B
- punpckhbw mm5,
- movq mm7, mm5
- paddd mm3, mm4
- pmaddwd mm5, YR0GR
- movq mm4, mm1
- pmaddwd mm4, UBG0B
- psrad mm0, 15
32MMX Code (pt. 3 121 instructions)
- pmaddwd mm7, UR0GR
- psrad mm3, 15
- pmaddwd mm1, VBG0B
- psrad mm6, 15
- paddd mm4, OFFSETD
- packssdw mm2, mm6
- pmaddwd mm5, VR0GR
- paddd mm7, mm4
- psrad mm7, 15
- movq mm6, TEMPY
- packssdw mm0, mm7
- movq mm4, TEMPU
- packuswb mm6, mm2
- movq mm7, OFFSETB
- paddd mm1, mm5
- paddw mm4, mm7
- psrad mm1, 15
- movq ebx, mm6
- packuswb mm4,
- movq ecx, mm4
- packuswb mm5, mm3
- add ebx, 8
- add ecx, 8
- movq edx, mm5
- dec edi
- jnz RGBtoYUV
33IRAM Status
- Chip
- ISA has not changed significantly in over a year
- Verilog complete, except SRAM for scalar cache
- Testing framework in place
- Compiler
- Backend code generation complete
- Continued performance improvements, especially
for narrow data widths - Application Benchmarks
- Handcoded kernels better than MMX,VIS, gp DSPs
- DCT, FFT, MVM, convolution, image composition,
- Compiled kernels demonstrate ISA advantages
- MVM, sparse MVM, decrypt, image composition,
- Full applications H263 encoding (done), speech
(underway)
34Backup from Dave Judds Talk
35VIRAM Tools
- vas assembler
- vdis disassembler
- vsim-isa simulator
- vsim-db debugger
- vsim-p performance simulator
- vsim-syncmemory consistency simulator
36 Compiler Testing
- C regression test suite (commercial test suite)
- Scalar emphasis, C conformance
- All tests pass except
- Small numerical differences due to lack on 128
f.p. support - C test suite
- 1167 of 1183 tests execute correctly.
- 12 failures in compilation undefined variables
- 4 failures in execution bad answers
37 Compiler Testing
- Vector regression test suites (CRAY)
- Specifically tests for vectorization
- Compares vector and scalar results
- Easy to isolate problems
- vector status
- 59 of 62 tests pass
- Some minor numerical differences
- 1 bad answer, 2 integer overflow
- vector4 status
- 163 of 165 tests execute correctly
- 1 bad anwer, 1 illegal use of vector inst.
38Kernel Performance mvmmatrix-vector
multiplication
64x64, 32 bit floating pt.
Hand optimized assembly code 579 mflops
vcc w/ restrict keywords added 352 mflops
1 element padding to avoid bank conflicts 401 mflops
shortloop directive Loops interchanged outer loop vectorized by vcc. 592 mflops
39Mods to mvm code
/ Original code mvm.c / /
Modified code / void mvm
(float A, void mvm (float
restrict A, float X,
float restrict X,
float Y, float
restrict Y, int n,
int n, int acol )
int acol ) int i,j
int i,j float
x_elem lt if ( n
lt 64 ) if ( n lt 64 )
for (i 0 i lt n i)
for (i 0 i lt n i)
pragma shortloop for (j
0 j lt n j) for (j 0 j
lt n j) Yj Ajacoli x_elem
Yj Ajacoli Xi
40Kernel performance mm_mulmatrix matrix
multiplication
64x64x64, 32 bit float, 1.6 gigaflop theoretical
peak
Hand coded assembly mm-mul-small.s 1.58 gigaflops
vcc w/ restrict and shortloop keywords 0.852 gigaflops
inner two loops in separate function, allows outer loop vectorization 1.51 gigaflops
41Kernel performance saxpy
- 32 bit floating point ops
N64 256 1024 4096
379 593 691 720
385 596 692 721
Hand coded assembly
vcc w/restrict keywords
42Kernel performance motion_estimate
32 bit integer ops, finding the minimum sum of
absolute differences for a reference block and a
region in an image.
Hand optimized assembly 1.181 gigaops
vcc w/restrict keywords 170 mops
shortloop directives 253 mops
outer loop unroll directive 257 mops
No improvement because of spilling.
43Dongarra loops
- 100 loops to test compiler vectorization
capability - Rewritten in C by Cray (?)
- vcc vectorizes 74 loops
- vcc partially vectorizes 3 loops
- vcc conditionally vectorizes 3 loops
- 1 loop not vectorized because vector sin/cos not
currently available on viram. - 19 other loops not vectorized
- Data provided by Sam Williams
44Features Remaining
- Support version 3 isa and version 4 isa
- Isa changes required by Mips Inc. scalar core
- Performance simulator only supports oldisa
- Finish sync support
- take advantage of Cray implementation
- VIRAM machine target
- Allow easier maintainence of frontend and
optimizer mods for viram - User documentation
- Summary of differences w/Cray compiler
- Useful options, hints for vector code
45Performance Features Remaining
- Additional tuning instruction scheduler
- Support new SV2 inliner for C/C
- Shortloop enhancements
- Reduce spilling
- Scheduler concern with registers
- Ordering of blocks for register assignment within
priority groups - Special vector registers carried across calls
- Loop unrolling for vector loops
- Tune for key benchmarks
46Other Future Compiler Features ?
- Support for speculative execution
- Compiler extensions for fixed point hardware
- Support for vector functions vector mlib
47Summary
- vcc is a reasonably robust compiler for VIRAM
- Performance on kernels is good w/appropriate
directives, some effort for optimum vectorization - Need to prioritize remaining work
48Codegen/optimizer issues for VIRAM
- Variable virtual processor width (VPW)
- Variable maximum vector register length (MVL)
- Vector flag registers treated as 1 bit wide
vector register - Multiple base, incr, stride regs. autoincrement
- Fixed point arithmetic (saturating add, etc.)
- Memory consistency
- New vector instructions not available on SV2
49Generating Code for Variable MVL
- Maximum vector length is not specified in IRAM
ISA. - However, compiler assumes mvl at compile time
- mvl based on vpw
- mvl assumption dependent on VIRAM-1 hardware
implementation - Recompiling required for future hardware versions
if mvl changes - MVL knowledge useful for code gen and vectorizer
- register spilling
- short loop vectorization
- length-dependent vectorization ( and may
eliminate safe vector length computation at run
time) - for (i 0 i lt n i)
- ai ai32
50Memory consistency
SaV VaS VaV vp RaW WaR WaW