The Future of Vector Processors

About This Presentation

Title:

The Future of Vector Processors

Description:

The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999 – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 75

Provided by: MateoV9

Category:

more less

Transcript and Presenter's Notes

Title: The Future of Vector Processors

1
The Future of Vector Processors

M. Valero, R. Espasa and J. Corbal
UPC, Barcelona

Kyoto, May 28th, 1999
2
TOP-500 and Vector Processors
310
November 98 Fujitsu27 NEC18 SGI..15 Hitachi
.5
96
65
43
15
3
The Future of Vector ISAs

Cross-Pollination of Vector/Superscalar/VLIW
MMX, Embedded...
Very-high Performance Architectures
ILP techniques, IRAM, SDRAM
Vector Microprocessors
Numerical Accelerators
Multimedia Applications

4
Talk Outline

The Past
Initial Motivation for Vector ISA
Evolution of Vector Processors
The Present
Recent Announcements
The Decline of Vector Processors
Cross-Pollination of Vector/Superscalars/VLIW
The Future
Very-high Performance Architectures
Vector Microprocessors
Numerical Accelerators
Multimedia Applications
Conclusions

5
Characteristics of Numerical Applications

Examples Weather prediction, mechanical
engineering
Data structures Huge matrices (dense, sparse)
Data types 64 bits, floating point
Highly repetitive loops
Compute-intensive
Data-Level Parallel

6
Initial Motivations for Vector Processors
Dependence Graph
real8 x(9992), y(9992), u(9984) subroutine
loop integer I real8 q do
I1,9984 q u(I) y(I)
y(I) x(I) q x(I)
q - u(I) x(I) enddo end
y(I)
u(I)
x(I)
For I1 to 9984
7
Execution of scalar code
Loop ld R1,0(R10) ld
R2,0(R11) ld R3,0(R12) mulf
R4,R1,R2) mulf R5,R2,R3
M
ALU
add R11,R11,8 addf
R6,R4,R3 subf R7,R4,R5 st
0(R12),R7 add R10, R10,8
st 0(R12),R7 sub
R13,R13,1 bne Loop add
R12,R12,8
14 cycles / Iteration
Perfect Memory !!!
8
Generation of Vector Code
A vector iteration is equivalent to 128 scalar
iterations
ld.w 9984,s2 ld.w
0,a2 ld.w 8,vs
.
.
.
Loop mov s2, vl
vl lt- min(s2,128) ld.l
-y(a2),v0 v0 lt- y(II127) ld.l
-u(a2),v1 v1 lt- u(II127)
mul.d v1,v0,v2 q(II127) lt-
u(II127)y() ld.l -x(a2),v3
v3 lt-x(II127) add.d
v3,v2,v0 v0 lt- x(II127) q(II127)
st.l v0,-y(a2)
y(II127) lt- x(II127) q( ) mul.d
v1,v3,v1 v1 lt- u(II127)
x(II127) sub.d v2,v1,v0
v0 lt- q( ) - u( ) x( ) st.l
v0,-x(a2) x(II127) lt- q( ) - u( )
x( ) add.w 1024,a2
increment index (128 8) add.w
-128,s2 128 iterations less to process
lt.w 0,s2 jbrs.t
loop
0 1 2 127
.
.
.
.
.
.
DLP !!!
9
Execution of vector code
One L/S Port One Adder, One Multiplier
Loop mov s2, vl ld.l
-y(a2),v0 ld.l
-u(a2),v1 mul.d v1,v0,v2
ld.l -x(a2),v3 add.d
v3,v2,v0 st.l v0,-y(a2)
mul.d v1,v3,v1 sub.d
v2,v1,v0 st.l v0,-x(a2)
add.w 1024,a2 add.w -
128,s2 lt.w 0,s2 jbrs.t
loop
A vector iteration is equivalent to 128 scalar
iterations
5.1 cycles / Iteration Memory Latency 24 cycles
!!! 14 vector instructions 1792 scalar
instructions
10
Vector Processor
11
Why Vector ISA ?

Natural way to express Data-Level Parallelism
Fewer instructions
( 3 )
Easy way to convey this information to the
hardware
Good hardware implementation
Affordable/ incremental parallelism ( 2 )
Simple control/ faster clock
( 1 )
Mechanism to deal with memory latency
Problem Memory Bandwidth...

12
Vector versus Scalar Architectures
Number of instructions (in millions)
Vector instruction semantics encode many
different scalar instructions
- Loop counters - Branch computations - Addresses
generation
Rate from 140 to 2
F. Quintana, R. Espasa and M. Valero A case for
merging the ILP.. PDP-98
13
Easy to convey information to the hardware

Data path
No pressure at fetch, decode and issue
Decentralized control
Faster cycle times
Vector memory instructions
Spatial locality can be made clearly visible to
the hardware through strides
No overhead and good prefetching
Reduction of memory latency overhead
Memory uses facts, not guesses

14
Key parameters for vector processors

Cycle time
Scalar processor
of registers and FUs
Cache
Vector processor
of vector registers
of FUs and of pipes/ FU
Connection to memory
of busses and width
Number of processors

15
Cray Y-MP Architecture
0
4
28
P0
44
88
224
228
232
44
P1
256 modules. ta 30 ns.
tc 6 ns. 333 Mflops / processor
31
3
7
88
228
231
255
P7
44
Synchronization
16
Vector Processors (1 of 2)
17
Vector Processors (2 of 2 )
18
Evolution of Cray Machines
Tc x6 ILP x2
of proc. x32 Total x400
Courtesy from SGI/CRAY
19
Vector Innovations (1 of 2 )

Star-100/Cyber-200 had many of them
Gather/scatter
Masked operations for conditionals
Cray-1 introduced vector registers
BSP had instructions for recurrences and
multioperand
Instructions to optimize masked vector operations
Instructions to handle Index and Bit sequence on
mask register
Flexible addressing of subvector registers(C4)

20
Vector Innovations ( 2 of 2 )

Multi-pipes (Star/Cyber)
Vector with Virtual Memory
Flexible chaining (multi-ported register-file)
Multilevel register-file (NEC)
Scalar units sharing vector FUs (Fujitsu)
Combined vector and scalar instructions (Titan)
Short vectors (CS-2 and CM-5)
Scalar processor LIW( Fujitsu), SS(NEC)

21
Automatic vectorization

Compiler technology for vectorization over 25
years of development
Dependence analysis
Elimination of false dependences
Strip mining
Loop interchange
Partial vectorization
Idiom recognition
IF conversion
Vector parallelization

22
Vector Architectures Present

New announcements (NEC, Cray, Fujitsu)
The decline of vector processors
Cross-pollination of Vector/ Superscalar/ VLIW
processors

23
NEC SX-5

Announced on June 5th. of 1998
8 Gflops, CMOS, tc 4 ns
Superscalar processor at 500 Mflops
32 results/cycle (2 FPU, 16-pipe)
32 data memory accesses/cycle (2 ports,16
data/port). Memory bandwidth of 64 GB/s
System composed by 32 nodes of 128 Gflops
providing 4 Tflop/s

24
Cray SV1

Announced on June 16th. of 1998
CMOS, 250 Mhz and 4 Gigaflop/proc.
Vector cache memory
2 FUs of 8 operations/cycle
Multi-Streaming Processor
Scalable vector architecture (32 nodes of 32
processors4 Teraflops)
Future processor enhancements !!!

25
Fujitsu VP5000

Announced on April 20 th. of 1999
9.2 Gflop/s, CMOS, 0.22 micr, 33 Mtrs/chip
Linpack 10001000 gives 8758 Mflop/s
Crossbar provides 21.6 GB/s per processor
System composed by 512 PEs or 4.9 Teraflops
Maximum of 16 GB/PE or 8 TB/512 PEs

26
The decline of vector processors

Why have vector machines declined so fast in
popularity?
Cost (Scalar parallel machines use commodity
parts)
Too restricted in applications (lack of
vectorization in many programs)
Massive use of computers to run so called
Non-numerical Applications

27
Characteristics of non-numerical Applications

Examples OLTP,DSS, simulators, games
General data structures Lists, trees, tables
Data types Scalar integers of 8 to 64 bits
Frequent control flow changeSpeculation
Short distance data dependencies... Forwarding
Instruction/data localityCaches
Fine-grain ILP..Out-of-order

28
Micro Killers ???
Peak performance Tc ILP
29
Bandwidth and Performance
30
Peak performance and Bandwidth
100
90
80
Z(I)C0A(I)(C1B(I)
70
(C2C(I)(C3D(I)
60
(C4E(I)(C5F(I)
Efficiency ()
50
(C6G(I)(C7H(I)
40
(C8K(I)(C9L(I))))))))))
30
20
VPP500
IBM RS6000
10
0
0
1000
2000
3000
4000
Vector length
Measurement condition RS6000-590(66.6MHz)
FORTRAN77 - 03 - qarchpwr2 - qtunepwr2
Courtesy from Fujitsu
31
Vector ideas used in SSs/VLIW processors

Address prediction and Prefetching
Exploitation of data locality(the stride value is
used for locality detection and exploitation)
Predicate execution(VLIW)
Multiply and add, chaining
Multi-size operands
Data reuse and vectorization
Addressing modes (auto-increment)
Multithreading ( 2 scalar processors in Fujitsu
machines)
Dynamic load/store elimination

32
Predictions for ALL instructions
Y.Sazeides and J.E. Smith The predictability of
data valuesMICRO-30.1997
33
Characterization of Vector Programs
R. Espasa Advanced Vector Architectures . PhD
Thesis, Feb.97
34
SSs ideas usable in vector processors

Decoupled Vector Architectures
Multithreaded Vector Architectures
Out-of-order Vector Architectures
Simultaneous Multithreaded Vector Architecture
Victim Register File

R. Espasa, M. Valero and J.E. Smith HPCA96,
HPCA97, MICRO97, ICS97...
35
ILPDLP Out-of-order Vector
Fetch
Decode Rename
S registers
A registers
LD/ST
V registers
Memory
Reorder Buffer
R. Espasa, M. Valero, J.E. Smith Out-of-order
Vector Architecture MICRO30, 1997.
36
OOO Vector Performance
R. Espasa, M. Valero, J.E. Smith Out-of-order
Vector Architecture MICRO30, 1997.
37
Vector Processors The Future

Very high-performance architectures
Vector Microprocessors
Numerical Accelerators
Multimedia Applications

38
Architectures for a Billion Transistors

Advanced/Superspeculative Architectures
Trace Processors
Simultaneous Multithreading
Multiprocessor on a chip
RAW processors
IRAM

Billion -Transistor Architectures. IEEE Computer
Sept. 1997
39
SMV

Simultaneous Multithreaded Vector Arch.
Mixes three paradigms
DLP vector unit
ILP O-o-O execution
TLP multithreaded fetch unit
Requires a memory system with
high performance at low cost
low pin-count

R. Espasa and M. Valero Exploiting Instruction
and Data-Level ParallelismIEEE MICRO Sep. 1997
40
Billion Trans. Vector Architecture
M e m o r y
Memory
B
R. Espasa and M. Valero Exploiting Instruction
and Data-Level ParallelismIEEE MICRO Sep. 1997
41
SMV Performance
R. Espasa and M. Valero Exploiting Instruction
and Data-Level ParallelismIEEE MICRO Sep. 1997
42
V-IRAM1
0.18 µm, 200 MHz, 1.6GFLOPS(64b)/6.4GOPS(16b)/32M
B
Serial I/O
D.A. Patterson New directions in Computer
Architecture Berkeley, June 1998
43
Conflict-free access to vectors
Idea Out-of-order access
Memory Modules
P1
P1
P2
P2
Interconnection Network
Interconnection Network
P3
P3
Pn
Pn
Sections
M. Valero et al. ISCA 92, ISCA 95, IEEE-TC 95,
ICS 92, ICS 94,...
44
Command Memory System
Command lt_at_,Length,Stride,sizegt Break commands
into bursts at the section controller
J. Corbal, R. Espasa and M. Valero
Command-Vector Memory System PACT98
45
System configuration in 2009
T. Watanabe SC98, Orlando.
46
Vector Microprocessors

Ways of reducing the design impact
Short Vectors (64 x 16 words 8 Kbytes)
Vector Functionall units shared with INT/FP
units
Vector Register renaming to allow precise
exceptions
Cache hierarchy tuned to vector execution
Vector data locality allows large data
transactions
Very large bandwidth between cache and vector
registers
High performance for numerical and multimedia
applications

47
General Architecture
I-Cache
Fetch
Decode

VRF
1024
Vector Cache
Rambus Controller
8
48
Vector PC Vs SuperScalar
49
Cache Hierarchy

Where should be allocated the Vector Cache?

DIRECT RAMBUS
DIRECT RAMBUS
L2
VC
VC
L1
CPU
CPU
50
Performance of the cache hierarchies
BDNA
FLO52
HYDRO2D
EIPC
FLOPS/CYCLE
FLOPS/CYCLE
FLOPS/CYCLE
VECTOR CACHE on L1
VECTOR CACHE on L2
PERFECT CACHE
51
Importance of media Applications
On the next five years, (1998-2002), we believe
that media processing will become the dominant
force in computer architecture (K. Diefendorf
and P. K. Dubey in IEEE Computer Journal, Sep.97,
pp. 43-45) 90 of Desktop Cycles will Be Spent
on Media Applications by 2000 ( Scott
Kirkpatrick of IBM )
52
Characteristics of media Applications

Examples Image/ speech processing,
communications, virtual reality, graphics
Data structures matrices and vectors
Data types Integer(8 -32 bits), FP (32- 64)
Demand for high memory bandwidth
Low data locality and latency problem
No critical data-dependences
Real time necessity
Fine/coarse grain parallelism

53
Multimedia Applications and Architectures
Scientific Applications Multimedia
Superscalar MMX
Vector Architectures
VLIW
Re-discover the parallelism at run-time using a
lot of hardware
54
MMX-like processors

Multimedia extensions are designed to exploit
the parallelism inherent in multimedia
aplications
Targeted to leverage full compatibility with
existing operating systems and applications, plus
minimum chip area investment.
The highlights of multimedia extensions are
Single Instruction, Multiple Data (SIMD)
techniques
New data types (Multimedia Vectors, 32/64 bits)
Multimedia registers
SIMD-like instructions, over small integer data
types

55
MMX instruction example

PADDW Parallel ADD of 4x16-bit data type with
Wrap Around (No Saturation)

15
0
31
47
63

56
Superscalar Multimedia Processors
Microprocessor Report Vol 12, N 6, May 11, 1998
57
Multimedia Applications and Architectures
Scientific Applications Multimedia
Superscalar MMX
Vector Architectures
VLIW
Re-discover the parallelism at run-time using a
lot of hardware
58
Multimedia Embedded Systems

NEC V830R/AV includes MIX2, a multimedia
instruction extension (SIMD, MMX-like approach)
Hitachi SH4 includes FP 4-length vector
instructions, targeted at geometry transformation
in 3D rendering applications
ARM10 Thumb Family processors will include a
Vector FP unit capable of delivering 600 MFLOPS

59
Widen is better(?)

Most multimedia algorithms exhibit vectors no
longer than 8/16 elements gt widening the
multimedia registers could provide diminishing
returns.

SS
Altivec
MMX
60
VLIW Widening vs Replication
Bus configurations
D. López et al. Increasing Memory Bandwidth
with Wide BussesICS-97
61
Widening and Replication Performance
D. López et al. Widening versus
replicating... ICS98, MICRO98
62
Multimedia Applications and Architectures
Scientific Applications Multimedia
Superscalar MMX
Vector Architectures
VLIW
Re-discover the parallelism at run-time using a
lot of hardware
63
Torrent T0 Microprocessor

The first single-chip vector microprocessor.
Can sustain over 24 operations per cycle while
having a issue rate of only one 32-bit
instruction per cycle
Features
16 vector registers (32 32-bit elements each)
2 Vector arithmetic units (8 pipes each)
Reconfigurable composite operation pipelines
128-bit wide, external memory interface
MIPS-II, 32-bit instruction set, scalar unit.

K. Asanovic et al. The T0 vector microprocessor
. Hot Chips VII, 1995
64
Torrent T0 Microprocessor
K. Asanovic et al. The T0 vector microprocessor
. Hot Chips VII, 1995
65
Vector versus Superscalar Processors

Comparison of Die Area
Processor Die Area (in mm2 scaled to 0.25m)

250.0
69.81
66.92
67.77
37.77
21.86
14.73
C. G. Lee and D. J. DeVries Initial Results on
. MICRO-30, 1997.
66
Vector versus Superscalar Processors

Component Percentages

C. G. Lee and D. J. DeVries Initial Results on
. MICRO-30, 1997.
67
Imagine project

Focused on developing a programmable architecture
that achieves performance similar to special
purpose hardware on graphics and image
processing.
Matches media applications demands to the current
VLSI capabilities by using a stream-based
programming model.
Most multimedia kernels exhibit a streaming
nature.
Individual stream elements can be operated on in
parallel, thus exploiting data parallelism.

Bill Dally Tomorrow Computing EnginesKeynote
HPCA98
68
Imagine architecture

Organized around a large stream register file
(64Kb)
Memory operations move entire streams of data
Data streams pass through a set of arithmetic
clusters (8)
Each cluster unit operates a single element under
VLIW control

Bill Dally Tomorrow Computing EnginesKeynote
HPCA98
69
Matrix extensions for Multimedia

By combining conventional vector approach
together with SIMD MMX-like instructions, we can
exploit additional levels of DLP with matrix
oriented multimedia extensions.

MOM
0
15
31
47

63
MMX
A1
A2
A4
A3
SS
A5
A6
A8
A7

A9
A10
A12
A11
15
31
0
47

63
A1
A2
A4
A3
A13
A14
A16
A15

B1
B2
B4
B3
B1
0
C1
C2
C4
C3
C5
C6
C8
C7
C1
C1
C2
C4
C3
C9
C10
C12
C11
C13
C14
C16
C15
70
Relative Performance
INVERSE DCT TRANSFORM
MPEG-2 MOTION ESTIMATION
RGB-YCC Color CONVERSION
71
Applications and Architectures
Numerical Applications

Integer
Very Slow
Subroutines

FPU
Very Big Improvement !!!

Additional Speed
FPU
72
Future Applications