Title: The Vector-Thread Architecture
1The Vector-Thread Architecture
- Ronny Krashinsky,
- Chris Batten, Krste Asanovic
- Computer Architecture Group
- MIT Laboratory for Computer Science
- ronny_at_mit.edu
- www.cag.lcs.mit.edu/scale
- Boston Area Architecture Workshop (BARC)
- January 30th, 2003
2Introduction
- Architectures are all about exploiting the
parallelism inherent to applications - Performance
- Energy
- The Vector-Thread Architecture is a new approach
which can flexibly take advantage of many forms
of parallelism available in different
applications - instruction, loop, data, thread
- The key goal of the vector-thread architecture is
efficiency high performance with low power
consumption and small area - A clean, compiler-friendly programming model is
key to realizing these goals
3Instruction Parallelism
- Independent instructions can execute concurrently
- Super-scalar architectures dynamically schedule
instructions in hardware to enable out-of-order
and parallel execution - Software statically schedules parallel
instructions on a VLIW machine
Super-scalar
VLIW
track instr. dependencies
4Loop Parallelism
- Operations from disjoint iterations of a loop can
execute in parallel - VLIW architectures use software pipelining to
statically schedule instructions from different
loop iterations to execute concurrently
iter. 0
VLIW
load
iter. 1
add
load
iter. 2
load
store
add
iter. 3
store
add
load
iter. 4
software pipeline
load
store
add
add
store
store
5Data Parallelism
- A single operation can be applied in parallel
across a set of data - In vector architectures, one instruction
identifies a set of independent operations which
can execute in parallel - Control overhead can be amortized
Vector
6Thread Parallelism
- Separate threads of control can execute
concurrently - Multiprocessor architectures allow different
threads to execute at the same time on different
processors - Multithreaded architectures execute multiple
threads at the same time to better utilize a
single set of processing resources
SMT
Multiprocessor
7Vector-Thread Architecture Overview
- Data parallelism start with vector architecture
- Thread parallelism give execution units local
control - Loop parallelism allow fine-grain dataflow
communication between execution units - Instruction parallelism add wide issue
8Vector Architecture
Programming Model
VP0
VP1
VP(N-1)
vector instruction
control thread
- A control thread interacts with a set of virtual
processors (VPs) - VPs contain registers and execution units
- VPs execute instructions under slave control
- Each iteration in a vectorizable loop mapped to
its own VP (w. stripmining)
Using VPs for Vectorizable Loops
VP0
VP1
VP(N-1)
for (i0 iltN i) Ci Ai Bi
i0
i1
iN-1
loadA
loadA
loadA
loadB
loadB
loadB
vector-execute
load A
vector-execute
add
load B
add
add
vector-execute
add
store
vector-execute
store
store
store
9Vector Microarchitecture
Lane 0
Lane 1
Lane 2
Lane 3
Microarchitecture
VP12
VP13
VP14
VP15
VP8
VP9
VP10
VP11
VP4
VP5
VP6
VP7
VP0
VP1
VP2
VP3
from control processor
Execution on Vector Processor
- Lanes contain regfiles and execution units VPs
map to lanes and share physical resources - Operations execute in parallel across lanes and
sequentially for each VP mapped to a lane
control overhead amortized to save energy
Lane 0
Lane 1
Lane 2
Lane 3
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
add
add
add
add
vector-execute
add
add
add
add
load A
add
add
add
add
vector-execute
load B
add
add
add
add
store
store
store
store
vector-execute
add
store
store
store
store
vector-execute
store
store
store
store
store
store
store
store
store
10Vector-Thread Architecture
Programming Model
VP0
VP1
VP(N-1)
micro-threaded control
slave control
cross-VP communication
- Vector of Virtual Processors (similar to
traditional vector architecture) - VPs are decoupled local instruction queues
break the rigid synchronization of vector
architectures - Under slave control, the control thread sends
instructions to all VPs - Under micro-threaded control, each VP fetches its
own instructions - Cross-VP communication allows each VP to send
data to its successor
11Using VPs for Do-Across Loops
for (i0 iltN i) x x Ai Ci x
VP0
VP1
VP(N-1)
i0
i1
i(N-1)
load
load
load
recv
recv
recv
vector-execute
add
add
add
load
send
send
send
recv
AIB
add
store
store
store
send
store
- VPs execute atomic instruction blocks (AIB)
- Each iteration in a data dependent loop is mapped
to its own VP - Cross-VP send and recv operations communicate
do-across results from one VP to the next VP
(next iteration in time)
12Vector-Thread Microarchitecture
Microarchitecture
Lane 0
Lane 1
Lane 2
Lane 3
VP12
VP13
VP14
VP15
execute directives
VP8
VP9
VP10
VP11
VP4
VP5
VP6
VP7
VP0
VP1
VP2
VP3
Instr. cache
Instr. fill
do-across network
- VPs striped across lanes as in traditional vector
machine - Lanes have small instruction cache (e.g. 32
instrs), decoupled execution - Execute directives point to atomic instruction
blocks and indicate which VP(s) the AIB should be
executed for generated by control thread
vector-execute command, or VP fetch instruction - Do-across network includes dataflow handshake
signals receiver stalls until data is ready
13Do-Across Execution
Lane 0
Lane 1
Lane 2
Lane 3
load
load
load
load
vector-execute
recv
load
add
recv
send
recv
store
add
add
send
recv
send
load
store
add
store
send
recv
load
store
add
send
recv
store
load
add
- Dataflow execution resolves do-across
dependencies dynamically - Independent instructions execute in parallel
performance adapts to software critical path - Instruction fetch overhead amortized across loop
iterations
recv
send
load
add
store
send
recv
store
add
load
send
recv
store
add
load
send
recv
store
load
add
recv
send
load
add
store
send
recv
store
load
add
14Micro-Threading VPs
VP0
VP1
VP(N-1)
- VPs also have the ability to fetch their own
instructions enabling each VP to execute its own
thread of control - Control thread can send a vector fetch
instruction to all VPs (i.e. vector fork)
allows efficient thread startup - Control thread can stall until micro-threads
finish (stop fetching instructions) - Enables data-dependent control flow within a loop
iteration (alternative to predication)
15Loop Parallelism and Architectures
Loops are ubiquitous and contain ample
parallelism across iterations Super-scalar must
track dependencies between all instructions in a
loop body (and correctly predict branches) before
executing instruction in the subsequent
iteration and do this repeatedly for each loop
iteration VLIW software pipelining exposes
parallelism, but requires static scheduling which
is difficult and inadequate with dynamic
latencies and dependencies Vector efficient, but
limited to do-all loops, no do-across Vector-threa
d Software efficiently exposes parallelism, and
dynamic dataflow automatically adapts to critical
path. Uses simple in-order execution units, and
amortizes instruction fetch overhead across loop
iterations
16Using the Vector-Thread Architecture
Multi-paradigm Support
Virtual Processors
Control Thread
Vector
DO-ALL Loop
Loop
Performance Energy Efficiency
DO-ACROSS Loop
Threads
ILP
Micro-threading
Vector-Threading
- The Vector-Thread Architecture seeks to
efficiently exploit the available parallelism in
any given application - Using the same set of resources, it can flexibly
transition from pure data parallel operation, to
parallel loop execution with do-across
dependencies, to fine-grain multi-threading
17SCALE-0 Overview
Tile
Outstanding Trans. Table
Clustered Virtual Processor
ctrl proc
Vector Thread Unit
CMMU
256b
128b
Network Interface
Cluster 3
32b
128b
128b
Local Regfile
FP-MUL
4x128b
32KB L1 Configurable I/D Cache
Cluster 2
Local Regfile
FP-ADD
Inter-Cluster Communication
Next-VP
Prev-VP
Cluster 1
Local Regfile
IALU
Cluster 0 (Mem)
Local Regfile
IALU
18(No Transcript)