Title: The Future of Vector Processors
1The Future of Vector Processors
- M. Valero, R. Espasa and J. Corbal
- UPC, Barcelona
Kyoto, May 28th, 1999
2TOP-500 and Vector Processors
310
November 98 Fujitsu27 NEC18 SGI..15 Hitachi
.5
96
65
43
15
3The Future of Vector ISAs
- Cross-Pollination of Vector/Superscalar/VLIW
- MMX, Embedded...
- Very-high Performance Architectures
- ILP techniques, IRAM, SDRAM
- Vector Microprocessors
- Numerical Accelerators
- Multimedia Applications
4Talk Outline
- The Past
- Initial Motivation for Vector ISA
- Evolution of Vector Processors
- The Present
- Recent Announcements
- The Decline of Vector Processors
- Cross-Pollination of Vector/Superscalars/VLIW
- The Future
- Very-high Performance Architectures
- Vector Microprocessors
- Numerical Accelerators
- Multimedia Applications
- Conclusions
5Characteristics of Numerical Applications
- Examples Weather prediction, mechanical
engineering - Data structures Huge matrices (dense, sparse)
- Data types 64 bits, floating point
- Highly repetitive loops
- Compute-intensive
- Data-Level Parallel
6Initial Motivations for Vector Processors
Dependence Graph
real8 x(9992), y(9992), u(9984) subroutine
loop integer I real8 q do
I1,9984 q u(I) y(I)
y(I) x(I) q x(I)
q - u(I) x(I) enddo end
y(I)
u(I)
x(I)
For I1 to 9984
7Execution of scalar code
Loop ld R1,0(R10) ld
R2,0(R11) ld R3,0(R12) mulf
R4,R1,R2) mulf R5,R2,R3
M
ALU
add R11,R11,8 addf
R6,R4,R3 subf R7,R4,R5 st
0(R12),R7 add R10, R10,8
st 0(R12),R7 sub
R13,R13,1 bne Loop add
R12,R12,8
14 cycles / Iteration
Perfect Memory !!!
8Generation of Vector Code
A vector iteration is equivalent to 128 scalar
iterations
ld.w 9984,s2 ld.w
0,a2 ld.w 8,vs
.
.
.
Loop mov s2, vl
vl lt- min(s2,128) ld.l
-y(a2),v0 v0 lt- y(II127) ld.l
-u(a2),v1 v1 lt- u(II127)
mul.d v1,v0,v2 q(II127) lt-
u(II127)y() ld.l -x(a2),v3
v3 lt-x(II127) add.d
v3,v2,v0 v0 lt- x(II127) q(II127)
st.l v0,-y(a2)
y(II127) lt- x(II127) q( ) mul.d
v1,v3,v1 v1 lt- u(II127)
x(II127) sub.d v2,v1,v0
v0 lt- q( ) - u( ) x( ) st.l
v0,-x(a2) x(II127) lt- q( ) - u( )
x( ) add.w 1024,a2
increment index (128 8) add.w
-128,s2 128 iterations less to process
lt.w 0,s2 jbrs.t
loop
0 1 2 127
.
.
.
.
.
.
DLP !!!
9Execution of vector code
One L/S Port One Adder, One Multiplier
Loop mov s2, vl ld.l
-y(a2),v0 ld.l
-u(a2),v1 mul.d v1,v0,v2
ld.l -x(a2),v3 add.d
v3,v2,v0 st.l v0,-y(a2)
mul.d v1,v3,v1 sub.d
v2,v1,v0 st.l v0,-x(a2)
add.w 1024,a2 add.w -
128,s2 lt.w 0,s2 jbrs.t
loop
A vector iteration is equivalent to 128 scalar
iterations
5.1 cycles / Iteration Memory Latency 24 cycles
!!! 14 vector instructions 1792 scalar
instructions
10Vector Processor
11Why Vector ISA ?
- Natural way to express Data-Level Parallelism
- Fewer instructions
( 3 ) - Easy way to convey this information to the
hardware - Good hardware implementation
- Affordable/ incremental parallelism ( 2 )
- Simple control/ faster clock
( 1 ) - Mechanism to deal with memory latency
- Problem Memory Bandwidth...
12Vector versus Scalar Architectures
Number of instructions (in millions)
Vector instruction semantics encode many
different scalar instructions
- Loop counters - Branch computations - Addresses
generation
Rate from 140 to 2
F. Quintana, R. Espasa and M. Valero A case for
merging the ILP.. PDP-98
13Easy to convey information to the hardware
- Data path
- No pressure at fetch, decode and issue
- Decentralized control
- Faster cycle times
- Vector memory instructions
- Spatial locality can be made clearly visible to
the hardware through strides - No overhead and good prefetching
- Reduction of memory latency overhead
- Memory uses facts, not guesses
14Key parameters for vector processors
- Cycle time
- Scalar processor
- of registers and FUs
- Cache
- Vector processor
- of vector registers
- of FUs and of pipes/ FU
- Connection to memory
- of busses and width
- Number of processors
15Cray Y-MP Architecture
0
4
28
P0
44
88
224
228
232
44
P1
256 modules. ta 30 ns.
tc 6 ns. 333 Mflops / processor
31
3
7
88
228
231
255
P7
44
Synchronization
16Vector Processors (1 of 2)
17Vector Processors (2 of 2 )
18Evolution of Cray Machines
Tc x6 ILP x2
of proc. x32 Total x400
Courtesy from SGI/CRAY
19Vector Innovations (1 of 2 )
- Star-100/Cyber-200 had many of them
- Gather/scatter
- Masked operations for conditionals
- Cray-1 introduced vector registers
- BSP had instructions for recurrences and
multioperand - Instructions to optimize masked vector operations
- Instructions to handle Index and Bit sequence on
mask register - Flexible addressing of subvector registers(C4)
20Vector Innovations ( 2 of 2 )
- Multi-pipes (Star/Cyber)
- Vector with Virtual Memory
- Flexible chaining (multi-ported register-file)
- Multilevel register-file (NEC)
- Scalar units sharing vector FUs (Fujitsu)
- Combined vector and scalar instructions (Titan)
- Short vectors (CS-2 and CM-5)
- Scalar processor LIW( Fujitsu), SS(NEC)
21Automatic vectorization
- Compiler technology for vectorization over 25
years of development - Dependence analysis
- Elimination of false dependences
- Strip mining
- Loop interchange
- Partial vectorization
- Idiom recognition
- IF conversion
- Vector parallelization
22Vector Architectures Present
- New announcements (NEC, Cray, Fujitsu)
- The decline of vector processors
- Cross-pollination of Vector/ Superscalar/ VLIW
processors
23NEC SX-5
- Announced on June 5th. of 1998
- 8 Gflops, CMOS, tc 4 ns
- Superscalar processor at 500 Mflops
- 32 results/cycle (2 FPU, 16-pipe)
- 32 data memory accesses/cycle (2 ports,16
data/port). Memory bandwidth of 64 GB/s - System composed by 32 nodes of 128 Gflops
providing 4 Tflop/s
24Cray SV1
- Announced on June 16th. of 1998
- CMOS, 250 Mhz and 4 Gigaflop/proc.
- Vector cache memory
- 2 FUs of 8 operations/cycle
- Multi-Streaming Processor
- Scalable vector architecture (32 nodes of 32
processors4 Teraflops) - Future processor enhancements !!!
25Fujitsu VP5000
- Announced on April 20 th. of 1999
- 9.2 Gflop/s, CMOS, 0.22 micr, 33 Mtrs/chip
- Linpack 10001000 gives 8758 Mflop/s
- Crossbar provides 21.6 GB/s per processor
- System composed by 512 PEs or 4.9 Teraflops
- Maximum of 16 GB/PE or 8 TB/512 PEs
26The decline of vector processors
- Why have vector machines declined so fast in
popularity? - Cost (Scalar parallel machines use commodity
parts) - Too restricted in applications (lack of
vectorization in many programs) - Massive use of computers to run so called
Non-numerical Applications
27Characteristics of non-numerical Applications
- Examples OLTP,DSS, simulators, games
- General data structures Lists, trees, tables
- Data types Scalar integers of 8 to 64 bits
- Frequent control flow changeSpeculation
- Short distance data dependencies... Forwarding
- Instruction/data localityCaches
- Fine-grain ILP..Out-of-order
28Micro Killers ???
Peak performance Tc ILP
29Bandwidth and Performance
30Peak performance and Bandwidth
100
90
80
Z(I)C0A(I)(C1B(I)
70
(C2C(I)(C3D(I)
60
(C4E(I)(C5F(I)
Efficiency ()
50
(C6G(I)(C7H(I)
40
(C8K(I)(C9L(I))))))))))
30
20
VPP500
IBM RS6000
10
0
0
1000
2000
3000
4000
Vector length
Measurement condition RS6000-590(66.6MHz)
FORTRAN77 - 03 - qarchpwr2 - qtunepwr2
Courtesy from Fujitsu
31 Vector ideas used in SSs/VLIW processors
- Address prediction and Prefetching
- Exploitation of data locality(the stride value is
used for locality detection and exploitation) - Predicate execution(VLIW)
- Multiply and add, chaining
- Multi-size operands
- Data reuse and vectorization
- Addressing modes (auto-increment)
- Multithreading ( 2 scalar processors in Fujitsu
machines) - Dynamic load/store elimination
32Predictions for ALL instructions
Y.Sazeides and J.E. Smith The predictability of
data valuesMICRO-30.1997
33Characterization of Vector Programs
R. Espasa Advanced Vector Architectures . PhD
Thesis, Feb.97
34 SSs ideas usable in vector processors
- Decoupled Vector Architectures
- Multithreaded Vector Architectures
- Out-of-order Vector Architectures
- Simultaneous Multithreaded Vector Architecture
- Victim Register File
R. Espasa, M. Valero and J.E. Smith HPCA96,
HPCA97, MICRO97, ICS97...
35ILPDLP Out-of-order Vector
Fetch
Decode Rename
S registers
A registers
LD/ST
V registers
Memory
Reorder Buffer
R. Espasa, M. Valero, J.E. Smith Out-of-order
Vector Architecture MICRO30, 1997.
36OOO Vector Performance
R. Espasa, M. Valero, J.E. Smith Out-of-order
Vector Architecture MICRO30, 1997.
37Vector Processors The Future
- Very high-performance architectures
- Vector Microprocessors
- Numerical Accelerators
- Multimedia Applications
38Architectures for a Billion Transistors
- Advanced/Superspeculative Architectures
- Trace Processors
- Simultaneous Multithreading
- Multiprocessor on a chip
- RAW processors
- IRAM
Billion -Transistor Architectures. IEEE Computer
Sept. 1997
39SMV
- Simultaneous Multithreaded Vector Arch.
- Mixes three paradigms
- DLP vector unit
- ILP O-o-O execution
- TLP multithreaded fetch unit
- Requires a memory system with
- high performance at low cost
- low pin-count
R. Espasa and M. Valero Exploiting Instruction
and Data-Level ParallelismIEEE MICRO Sep. 1997
40Billion Trans. Vector Architecture
M e m o r y
Memory
B
R. Espasa and M. Valero Exploiting Instruction
and Data-Level ParallelismIEEE MICRO Sep. 1997
41SMV Performance
R. Espasa and M. Valero Exploiting Instruction
and Data-Level ParallelismIEEE MICRO Sep. 1997
42V-IRAM1
0.18 µm, 200 MHz, 1.6GFLOPS(64b)/6.4GOPS(16b)/32M
B
Serial I/O
D.A. Patterson New directions in Computer
Architecture Berkeley, June 1998
43Conflict-free access to vectors
Idea Out-of-order access
Memory Modules
P1
P1
P2
P2
Interconnection Network
Interconnection Network
P3
P3
Pn
Pn
Sections
M. Valero et al. ISCA 92, ISCA 95, IEEE-TC 95,
ICS 92, ICS 94,...
44Command Memory System
Command lt_at_,Length,Stride,sizegt Break commands
into bursts at the section controller
J. Corbal, R. Espasa and M. Valero
Command-Vector Memory System PACT98
45System configuration in 2009
T. Watanabe SC98, Orlando.
46Vector Microprocessors
- Ways of reducing the design impact
- Short Vectors (64 x 16 words 8 Kbytes)
- Vector Functionall units shared with INT/FP
units - Vector Register renaming to allow precise
exceptions - Cache hierarchy tuned to vector execution
- Vector data locality allows large data
transactions - Very large bandwidth between cache and vector
registers - High performance for numerical and multimedia
applications
47General Architecture
I-Cache
Fetch
Decode
VRF
1024
Vector Cache
Rambus Controller
8
48Vector PC Vs SuperScalar
49Cache Hierarchy
- Where should be allocated the Vector Cache?
DIRECT RAMBUS
DIRECT RAMBUS
L2
VC
VC
L1
CPU
CPU
50Performance of the cache hierarchies
BDNA
FLO52
HYDRO2D
EIPC
FLOPS/CYCLE
FLOPS/CYCLE
FLOPS/CYCLE
VECTOR CACHE on L1
VECTOR CACHE on L2
PERFECT CACHE
51Importance of media Applications
On the next five years, (1998-2002), we believe
that media processing will become the dominant
force in computer architecture (K. Diefendorf
and P. K. Dubey in IEEE Computer Journal, Sep.97,
pp. 43-45) 90 of Desktop Cycles will Be Spent
on Media Applications by 2000 ( Scott
Kirkpatrick of IBM )
52Characteristics of media Applications
- Examples Image/ speech processing,
communications, virtual reality, graphics - Data structures matrices and vectors
- Data types Integer(8 -32 bits), FP (32- 64)
- Demand for high memory bandwidth
- Low data locality and latency problem
- No critical data-dependences
- Real time necessity
- Fine/coarse grain parallelism
53Multimedia Applications and Architectures
Scientific Applications Multimedia
Superscalar MMX
Vector Architectures
VLIW
Re-discover the parallelism at run-time using a
lot of hardware
54MMX-like processors
- Multimedia extensions are designed to exploit
the parallelism inherent in multimedia
aplications - Targeted to leverage full compatibility with
existing operating systems and applications, plus
minimum chip area investment. - The highlights of multimedia extensions are
- Single Instruction, Multiple Data (SIMD)
techniques - New data types (Multimedia Vectors, 32/64 bits)
- Multimedia registers
- SIMD-like instructions, over small integer data
types
55MMX instruction example
- PADDW Parallel ADD of 4x16-bit data type with
Wrap Around (No Saturation)
15
0
31
47
63
56Superscalar Multimedia Processors
Microprocessor Report Vol 12, N 6, May 11, 1998
57Multimedia Applications and Architectures
Scientific Applications Multimedia
Superscalar MMX
Vector Architectures
VLIW
Re-discover the parallelism at run-time using a
lot of hardware
58Multimedia Embedded Systems
- NEC V830R/AV includes MIX2, a multimedia
instruction extension (SIMD, MMX-like approach) - Hitachi SH4 includes FP 4-length vector
instructions, targeted at geometry transformation
in 3D rendering applications - ARM10 Thumb Family processors will include a
Vector FP unit capable of delivering 600 MFLOPS
59Widen is better(?)
- Most multimedia algorithms exhibit vectors no
longer than 8/16 elements gt widening the
multimedia registers could provide diminishing
returns.
SS
Altivec
MMX
60VLIW Widening vs Replication
Bus configurations
D. López et al. Increasing Memory Bandwidth
with Wide BussesICS-97
61Widening and Replication Performance
D. López et al. Widening versus
replicating... ICS98, MICRO98
62Multimedia Applications and Architectures
Scientific Applications Multimedia
Superscalar MMX
Vector Architectures
VLIW
Re-discover the parallelism at run-time using a
lot of hardware
63Torrent T0 Microprocessor
- The first single-chip vector microprocessor.
- Can sustain over 24 operations per cycle while
having a issue rate of only one 32-bit
instruction per cycle - Features
- 16 vector registers (32 32-bit elements each)
- 2 Vector arithmetic units (8 pipes each)
- Reconfigurable composite operation pipelines
- 128-bit wide, external memory interface
- MIPS-II, 32-bit instruction set, scalar unit.
K. Asanovic et al. The T0 vector microprocessor
. Hot Chips VII, 1995
64Torrent T0 Microprocessor
K. Asanovic et al. The T0 vector microprocessor
. Hot Chips VII, 1995
65Vector versus Superscalar Processors
- Comparison of Die Area
- Processor Die Area (in mm2 scaled to 0.25m)
250.0
69.81
66.92
67.77
37.77
21.86
14.73
C. G. Lee and D. J. DeVries Initial Results on
. MICRO-30, 1997.
66Vector versus Superscalar Processors
C. G. Lee and D. J. DeVries Initial Results on
. MICRO-30, 1997.
67Imagine project
- Focused on developing a programmable architecture
that achieves performance similar to special
purpose hardware on graphics and image
processing. - Matches media applications demands to the current
VLSI capabilities by using a stream-based
programming model. - Most multimedia kernels exhibit a streaming
nature. - Individual stream elements can be operated on in
parallel, thus exploiting data parallelism.
Bill Dally Tomorrow Computing EnginesKeynote
HPCA98
68Imagine architecture
- Organized around a large stream register file
(64Kb) - Memory operations move entire streams of data
- Data streams pass through a set of arithmetic
clusters (8) - Each cluster unit operates a single element under
VLIW control
Bill Dally Tomorrow Computing EnginesKeynote
HPCA98
69Matrix extensions for Multimedia
- By combining conventional vector approach
together with SIMD MMX-like instructions, we can
exploit additional levels of DLP with matrix
oriented multimedia extensions.
MOM
0
15
31
47
63
MMX
A1
A2
A4
A3
SS
A5
A6
A8
A7
A9
A10
A12
A11
15
31
0
47
63
A1
A2
A4
A3
A13
A14
A16
A15
B1
B2
B4
B3
B1
0
C1
C2
C4
C3
C5
C6
C8
C7
C1
C1
C2
C4
C3
C9
C10
C12
C11
C13
C14
C16
C15
70Relative Performance
INVERSE DCT TRANSFORM
MPEG-2 MOTION ESTIMATION
RGB-YCC Color CONVERSION
71Applications and Architectures
Numerical Applications
Integer
Very Slow
Subroutines
FPU
Very Big Improvement !!!
Additional Speed
FPU
72Future Applications
- Integer SPEC-like
- Commercial (OLTP,DSS)
Integer
Integer
Commercial
Numerical
Multimedia
73Acknowledgments
- Roger Espasa
- James E. Smith
- Luis A. Villa
- Francisca Quintana
- Jesús Corbal
- David López
- Josep Llosa
- Eduard Ayguade
- Krste Asanovic
- William Dally
- Christoforos E. Kozyrakis
- Corinna G. Lee
- David A. Patterson
- Steve Wallace
74The End