Title: Krste Asanovic
1Power-Efficient Microarchitectures
- Krste Asanovic
- krste_at_mit.edu
- MIT Computer Science and Artificial Intelligence
Laboratory - http//cag.csail.mit.edu/scale
- IBM ACEED Conference, Austin, TX
- 1 March 2005
2Academic Computer Architectures
- Build one flimsy (but expensive) prototype, that
is never really used - Eventually, some ideas are adopted, enter mass
production, millions sold
3SpecInt 2000
Horowitz, ISSCC2004
4Power
Horowitz, ISSCC2004
5Where does the power go?
IBM, HPCA 2005
- Parallel instruction fetch and decode
- Register renaming, issue window, reorder buffer
- Multiported register files and bypass networks
- Load and store queues
- Multiported primary caches and TLBs
- Energy-oblivious instruction sets (i.e., 360,
x86, RISC) require most of this
microarchitectural machinery to achieve high
performance.
6Energy-Oblivious Instruction Sets
- Current RISC/VLIW ISAs only expose hardware
features that affect critical path through
computation
- Most energy is consumed in microarchitectural
operations that are hidden from software!
7Energy-Exposed Instruction Sets
- Rethinking the hardware-software interface for
lower power - Use compile-time knowledge to reduce run-time
energy dissipation - Without reducing performance
- Without using excessive energy to transmit
compile-time knowledge to hardware at run time
8IBMs Instruction Sets
- Pre-1964 IBM 701, 650, 702, 1401,
- Prehistoric times
- 1964 IBM System/360
- Invention of the instruction set architecture
(ISA) - 1978 IBM System/38, AS/400
- Object-based capability systems
- 1990 IBM POWER
- Superscalar RISC
- Maybe time to start working on next energy-aware
ISA?
9Talk Outline
- Variable-Length Instruction Formats
- Vectors
- Exception Management
- The Vector-Thread Architecture
10Problems with Fixed-Length Instructions
- Waste memory bandwidth/power at all levels of
instruction hierarchy - Reduce effective cache capacity
- Introduce unnecessary serial dependencies to work
around length limits - lui r1, 0x8765 MIPS code to load 32-bit
- ori r1, r1, 0x4321 constant 0x87654321 in r1
- Advantages?
- Easier pipelined or parallel fetch and decode.
11Heads and Tails Format
- Each instruction split into two
portionsfixed-length head variable-length
tail
12Heads and Tails Format
- Each instruction split into two
portionsfixed-length head variable-length
tail - Multiple instructions packed into a fixed-length
bundle
13Heads and Tails Format
- Each instruction split into two
portionsfixed-length head variable-length
tail - Multiple instructions packed into a fixed-length
bundle - A cache line can have multiple bundles
14Heads and Tails Format
15Heads and Tails Format
H0 H1 H2 H3 H4
H0 H1 H2 H3 H4 H5 H6
H0 H1 H2 H3 H4 H5
heads
16Heads and Tails Format
H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
17Heads and Tails Format
unused
H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
18Heads and Tails Format
unused
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
19Heads and Tails Format
? not all heads must have tails
unused
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
20Heads and Tails Format
? not all heads must have tails ? tails at fixed
granularity
unused
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
21Heads and Tails Format
? not all heads must have tails ? tails at fixed
granularity ? granularity of tails independent
of size of heads
unused
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
22Heads and Tails Format
PC
bundle
instruction
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
23Heads and Tails Format
? sequential pc incremented
PC
bundle
instruction
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
24Heads and Tails Format
? sequential pc incremented ? end of bundle
bundle incremented inst reset to 0
PC
bundle
instruction
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
25Heads and Tails Format
? sequential pc incremented ? end of bundle
bundle incremented inst reset to 0 ? branch
inst checked
PC
bundle
instruction
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
26Conventional VL Length-Decoding
Instr 1 Instr 2
Instr 3
Length 1
27Conventional VL Length-Decoding
Instr 1 Instr 2
Instr 3
Length 1
Length 2
? 2nd length decoder needs to know Length1 first
28Conventional VL Length-Decoding
Instr 1 Instr 2
Instr 3
Length 1
Length 2
Length 3
? 3rd length decoder needs to know Length1
Length2
29Conventional VL Length-Decoding
Instr 1 Instr 2
Instr 3
Length 1
Length 2
Length 3
? Need to know all 3 lengths to fetch and align
more instructions.
30HAT Length-Decoding
Head1 Head2 Head3 Tail3
Tail2 Tail1
Length 1
Length 2
Length 3
? Length decoding done in parallel
31HAT Length-Decoding
Head1 Head2 Head3 Tail3
Tail2 Tail1
Length 1
Length 2
Length 3
? Length decoding done in parallel? Only
tail-length adders dependent on previous length
information (carry-save adders delay O(logW))
32HAT Length-Decoding
Head1 Head2 Head3 Tail3
Tail2 Tail1
Length 1
Length 2
Length 3
? Length decoding done in parallel? Only
tail-length adders dependent on previous length
information (carry-save adders delay O(logW))
33HAT Length-Decoding
Head1 Head2 Head3 Tail3
Tail2 Tail1
Length 1
Length 2
Length 3
? Length decoding done in parallel? Only
tail-length adders dependent on previous length
information (carry-save adders delay O(logW))
34Heads and Tails Summary
- Density of variable-length instructions while
retaining pipelined or superscalar instruction
fetch - For recoded MIPS ISA, save 25 static and
dynamic instruction bits using 256-bit bundles - Can design an ISA to exploit HT (e.g., avoid
spurious serializations)
35Vectors
36Parallelism is Good
Horowitz, ISSCC2004
37Forms of Parallelism and Energy per Op
Scalar Pipelined Machine
Energy/ Operation
Performance
1
2
5
38Vectors
- Omission of vectors is single biggest mistake in
commercial computer architectures - Simple
- High performance
- Low power
- Works great with caches
- Mature compiler technology
- Easily understood performance-programming model
- Good for everything, not just scientific
computing - Possibly only valid reasons for omission
- A little harder to make work with virtual memory
and rapid context swaps (see restart markers) - Large vector register files (see vector-thread
architecture)
39Automatic Code Vectorization
for (i0 i lt N i) Ci Ai Bi
Vectorization is a massive compile-time
reordering of operation sequencing ? avoids many
run-time overheads
40Vector Energy Advantages
- Instruction fetch amortized over vector of
operations - Loop bookkeeping factored out into separate
control processor - Efficient vector memory operations move multiple
memory operands with one cache tagTLB lookup - All arithmetic operations only access local lane,
no cross-lane wiring - Length of vector register effectively provides
register renaming and loop unrolling without
additional hardware
41Vector Instruction Parallelism
- Can overlap execution of multiple vector
instructions - example machine has 32 elements per vector
register and 8 lanes
Load Unit
Multiply Unit
Add Unit
time
Instruction issue
Complete 24 operations/cycle while issuing 1
short instruction/cycle
42Why SIMD extensions fall short of Vectors
- Only executes one cycles worth of operands per
instruction fetch - Requires superscalar dispatch to keep multiple
functional units busy - Scalar unit cannot run ahead to find next vector
loop - Tied up issuing SIMD instructions for current
loop - No long vector memory operations
- Memory system cant get ahead in fetching data
without speculation - Doesnt scale to wider datapaths without software
rewrite - Doesnt scale to large register files without
bigger instructions - Awkward interface for compilers
- Extensive microarchitecture-specific loop
unrolling and software pipelining required to
keep pipelines busy - Load/store alignment constraints
- No vector length register
- No scatter/gather
- Causes larger loop startup delays than vectors
43Vectors vs. Superscalar on General-Purpose
Applications
Single-scalar Runtime
- Accelerating 28 of code by factor of 8 gives
same speedup as accelerating all code by 1.3
vectorizing specint95
44Vectorizable Workloads
- Vectors known to work well for scientific and
media applications, but also can help many other
codes, e.g., - Databases
- Hash-joins vectorizable
- String operations
- Operating Systems
- Bzero/bcopy
- Many other important commercial algorithms can
be vectorized - All vendors will soon be telling customers to
multithread their code to get better performance - Vectorization can be simpler, and give much
better power-performance than multithreading
45Exceptions
46Exception Management Overhead
- Large part of power cost in modern
microarchitectures come from need to provide
precise exceptions - Reorder buffer to track original program order
- Register renaming or bypass networks to allow
undo of speculative register writes - Store queues to allow undo of speculative memory
writes - (Even in-order architectures speculate on
exceptions) - But also large opportunity cost because some
things are too difficult to make precise - Deeply exposed machine state
- Overlapped execution of multiple highly parallel
instructions - Special purpose execution units with embedded
state
47Whats important in Exceptions?
- For operating system with multiprogramming and
virtual memory - Must allow fast (and simple) process state save,
to allow process restart later - These swappable exceptions are much easier to
provide than precise exceptions, especially in
highly parallel machines with large quantities of
architectural state
48Software Restart Markers
- Software explicitly marks restart points, e.g.,
by setting barrier bit on each instruction - Hardware saves next-PC into machine register as
each barrier instruction completes - branches store target PC
- must also wait for any earlier potentially
exception-causing instructions to clear exception
checks (trap barrier) - After any trap, OS resumes execution at saved PC
49Idempotent Regions
- Hardware does not buffer state updates and cannot
undo state changes if trap occurs in middle of
region - Can only restart cleanly if regions are
idempotent, i.e., can re-execute from beginning
multiple times with same effect
add r3, r1, r2 st.bar r3, 0(r5) Restart
ld r2, 4(r7) ld r3, 8(r7) add.bar r4, r2, r3
Restart
st r4, 4(r7) st.bar r7, 8(r7) Restart
50Rules for Idempotent Regions
- Sufficient rule is that external read set is
disjoint from internal write set - OK to overwrite value if it was produced within
region - Not necessary because of idempotent update
operations, e.g., - X lt- X AND Y
- Y lt- Y OR Z
- Require that any prefix of region is also
idempotent, e.g., -
51Some Idempotent Functions
- matmul(int m, int k, int n,
- const double a, const double b, double c)
- int sprintf(char s, const char format, ...)
- int sscanf(char s, const char format, ...)
- char strcpy(char, const char) / Also strcmp,
strlen../ - void memcpy(char, const char, size_t) / Also
memset../ - double sin(double) / Also sqrt, exp, etc. /
- double atof(const char) / Also atoi, atol,
strtod,... /
- Can be protected with single restart marker on
calling instruction, saving only entry PC - assuming arguments untouched in stack memory
- For vector machine, almost no (lt1) overhead to
add restart markers to common loops
52Temporary State
- Temporary state is only visible inside restart
region. - Thrown away at any exception.
- Will be rebuilt when restart region is restarted.
- For SCALE, all vector-thread unit state is
temporary - OS unaware of vector-thread unit.
- Provides advantages of exposing more machine
state without the headaches.
53Vector-Thread Architecture
54Vector and Multithreaded Architectures
vector control
Control Processor
thread control
PE0
PE1
PE2
PEN
PE0
PE1
PE2
PEN
Memory
Memory
- Vector processors provide efficient DLP execution
- Amortize instruction control
- Amortize loop bookkeeping overhead
- Exploit structured memory accesses
- Unable to execute loops with loop-carried
dependencies or complex internal control flow
- Multithreaded processors can flexibly exploit TLP
- Unable to amortize common control overhead across
threads - Unable to exploit structured memory accesses
across threads - Costly memory-based synchronization and
communication between threads
55Vector-Thread Architecture
- VT unifies the vector and multithreaded compute
models - A control processor interacts with a vector of
virtual processors (VPs) - Vector-fetch control processor fetches
instructions for all VPs in parallel - Thread-fetch a VP fetches its own instructions
- VT allows a seamless intermixing of vector and
thread control
vector-fetch
Control Processor
thread- fetch
VP0
VP1
VP2
VP3
VPN
Memory
56Virtual Processor Abstraction
vector-fetch
- VPs contain a set of registers
- VPs execute RISC-like instructions grouped into
atomic instruction blocks (AIBs) - VPs have no automatic program counter, AIBs must
be explicitly fetched - VPs contain pending vector-fetch and thread-fetch
addresses - A fetch instruction allows a VP to fetch its own
AIB - May be predicated for conditional branch
- If an AIB does not execute a fetch, the VP thread
stops
VP
thread- fetch
VP thread execution
AIB
instruction
fetch
thread-fetch
fetch
thread-fetch
57Virtual Processor Vector
- A VT architecture includes a control processor
and a virtual processor vector - Two interacting instruction sets
- A vector-fetch command allows the control
processor to fetch an AIB for all the VPs in
parallel - Vector-load and vector-store commands transfer
blocks of data between memory and the VP registers
vector-fetch
Control Processor
VP0
VP1
VPN
vector-load
vector-store
Vector Memory Unit
Memory
58Cross-VP Data Transfers
- Cross-VP connections provide fine-grain data
operand communication and synchronization - VP instructions may target nextVP as destination
or use prevVP as a source - CrossVP queue holds wrap-around data, control
processor can push and pop - Restricted ring communication pattern is cheap to
implement, scalable, and matches the software
usage model for VPs
Control Processor
vector-fetch
VP0
VP1
VPN
crossVP-pop
crossVP-push
crossVP queue
59Mapping Loops to VT
- A broad class of loops map naturally to VT
- Vectorizable loops
- Loops with loop-carried dependencies
- Loops with internal control flow
- Each VP executes one loop iteration
- Control processor manages the execution
- Stripmining enables implementation-dependent
vector lengths - Programmer or compiler only schedules one loop
iteration on one VP - No cross-iteration scheduling
60Vectorizable Loops
- Data-parallel loops with no internal control flow
mapped using vector commands - predication for small conditionals
ld
ld
ltlt
x
operation
st
loop iteration DAG
Control Processor
VP1
VP0
VP2
VP3
VPN
vector-load
ld
ld
ld
ld
ld
vector-load
ld
ld
ld
ld
ld
vector-fetch
ltlt
x
ltlt
x
ltlt
x
ltlt
x
ltlt
x
vector-store
st
st
st
st
st
61Loop-Carried Dependencies
- Loops with cross-iteration dependencies mapped
using vector commands with cross-VP data
transfers - Vector-fetch introduces chain of prevVP receives
and nextVP sends - Vector-memory commands with non-vectorizable
compute
ld
ld
ltlt
x
st
Control Processor
VP1
VP0
VP2
VP3
VPN
vector-load
ld
ld
ld
ld
ld
vector-load
ld
ld
ld
ld
ld
vector-fetch
ltlt
x
ltlt
x
ltlt
x
ltlt
x
ltlt
x
vector-store
st
st
st
st
st
62Loops with Internal Control Flow
- Data-parallel loops with large conditionals or
inner-loops mapped using thread-fetches - Vector-commands and thread-fetches freely
intermixed - Once launched, the VP threads execute to
completion before the next control processor
command
ld
ld
br
st
Control Processor
VP1
VP0
VP2
VP3
VPN
vector-load
ld
ld
ld
ld
ld
vector-fetch
ld
ld
ld
ld
ld
br
br
br
br
br
ld
ld
ld
br
br
br
vector-store
st
st
st
st
st
63VT Physical Model
- A Vector-Thread Unit contains an array of lanes
with physical register files and execution units - VPs map to lanes and share physical resources, VP
execution is time-multiplexed on the lanes - Independent parallel lanes exploit parallelism
across VPs and data operand locality within VPs
64VP Execution Interleaving
- Hardware provides the benefits of loop unrolling
by interleaving VPs - Time-multiplexing can hide thread-fetch, memory,
and functional unit latencies
time-multiplexing
Lane 3
Lane 2
Lane 1
Lane 0
VP0
VP4
VP8
VP12
VP1
VP5
VP9
VP13
VP2
VP6
VP10
VP14
VP3
VP7
VP11
VP15
65VP Execution Interleaving
- Dynamic scheduling of cross-VP data transfers
automatically adapts to software critical path
(in contrast to static software pipelining) - No static cross-iteration scheduling
- Tolerant to variable dynamic latencies
time-multiplexing
Lane 3
Lane 2
Lane 1
Lane 0
VP0
VP4
VP8
VP12
VP1
VP5
VP9
VP13
VP2
VP6
VP10
VP14
VP3
VP7
VP11
VP15
vector-fetch
time
66SCALE Registers and VP Configuration
c0
- Atomic instruction blocks allow VPs to share
temporary state only valid within the AIB - VP general registers divided into private and
shared - Chain registers at ALU inputs avoid reading and
writing general register file to save energy
shared
VP8
VP4
VP0
cr0
cr1
- Number of VP registers in each cluster is
configurable - The hardware can support more VPs when they each
have fewer private registers - Low overhead Control processor instruction
configures VPs before entering stripmine loop, VP
state undefined across reconfigurations
4 VPs with 0 shared regs 8 private regs
7 VPs with 4 shared regs 4 private regs
25 VPs with 7 shared regs 1 private reg
VP12
shared
shared
VP24
VP20
VP8
VP16
VP12
VP4
VP8
VP0
VP4
VP0
67SCALE Prototype and Simulator
- Prototype SCALE processor in development
- Control processor MIPS, 1 instr/cycle
- VTU 4 lanes, 4 clusters/lane, 32
registers/cluster, 128 VPs max - Primary I/D cache 32 KB, 4x128b per cycle,
non-blocking - DRAM 64b, 200 MHz DDR2 (64b at 400Mb/s 3.2GB/s)
- Estimated 10 mm2 in TSMC 180nm, 400 MHz (25 FO4)
68Summary
- Energy/operation for given performance is key
parameter - Increasing parallelism and locality are standard
tricks for improving performance - But standard microarchitectural techniques to
achieve better parallelism and locality increase
energy/operation - Energy-exposed instruction set allows software to
increase parallelism, increase locality, and
reduce microarchitectural waste for lower
energy/op
69SCALE Group
http//cag.csail.mit.edu/scale
- Seongmoo Heo
- Ronny Krashinsky
- Jae Lee
- Rose Liu
- Albert Ma
- Heidi Pan
- Brian Pharris
- Jessica Tseng
- Michael Zhang
- Krste Asanovic
- Gautham Arumilli
- Ken Barr
- Elizabeth Basha
- Chris Batten
- Vimal Bhalodia
- Jared Casper
- Steve Gerding
- Mark Hampton
Funding provided by DARPA, NSF, CMI, IBM,
Infineon, Intel, SGI, Xilinx, MIT Project Oxygen
70(No Transcript)
71Backup
72Lane Execution
- Lanes execute decoupled from each other
- Command management unit handles vector-fetch and
thread-fetch commands - Execution cluster executes instructions in-order
from small AIB cache (e.g. 32 instructions) - AIB caches exploit locality to reduce instruction
fetch energy (on par with register read) - Execute directives point to AIBs and indicate
which VP(s) the AIB should be executed for - For a thread-fetch command, the lane executes the
AIB for the requesting VP - For a vector-fetch command, the lane executes the
AIB for every VP - AIBs and vector-fetch commands reduce control
overhead - 10s100s of instructions executed per fetch
address tag-check, even for non-vectorizable
loops
Lane 0
vector-fetch
thread-fetch
vector-fetch
miss addr
AIB address
miss
AIB tags
VP12
execute directive
VP8
VP4
VP0
VP
AIB cache
ALU
AIB fill
instr.
AIB
73SCALE Vector-Thread Processor
- SCALE is designed to be a complexity-effective
all-purpose embedded processor - Exploit all available forms of parallelism and
locality to achieve high performance and low
energy - Constrained to small area (estimated 10 mm2 in
0.18 µm) - Reduce wire delay and complexity
- Support tiling of multiple SCALE processors for
increased throughput - Careful balance between software and hardware for
code mapping and scheduling - Optimize runtime energy, area efficiency, and
performance while maintaining a clean scalable
programming model
74SCALE Clusters
- VPs partitioned into four clusters to exploit ILP
and allow lane implementations to optimize area,
energy, and circuit delay - Clusters are heterogeneous c0 can execute loads
and stores, c1 can execute fetches, c3 has
integer mult/div - Clusters execute decoupled from each other
Lane 0
Lane 1
Lane 2
Lane 3
Control Processor
SCALE VP
AIB Fill Unit
c3
c3
c3
c3
c3
c2
c2
c2
c2
c2
c1
c1
c1
c1
c1
c0
c0
c0
c0
c0
L1 Cache
75SCALE Micro-Ops
- Assembler translates portable software ISA into
hardware micro-ops - Per-cluster micro-op bundles access local
registers only - Inter-cluster data transfers broken into
transports and writebacks
Software VP code
Hardware micro-ops
cluster micro-op bundle
Cluster 3 not shown
76SCALE Cluster Decoupling
- Cluster execution is decoupled
- Cluster AIB caches hold micro-op bundles
- Each cluster has its own execute-directive queue,
and local control - Inter-cluster data transfers synchronize with
handshake signals - Memory Access Decoupling (see paper)
- Load-data-queue enables continued execution after
a cache miss - Decoupled-store-queue enables loads to slip ahead
of stores
Cluster 3
VP
writeback
compute
AIB cache
Regs
ALU
transport
Cluster 2
writeback
compute
AIBs
transport
Cluster 1
writeback
compute
AIBs
transport
Cluster 0
writeback
compute
AIBs
transport
77Why it might be time for new ISA
- Power-performance crisis
- Single-thread performance plateau
- for real this time
- Memory wall
- Reliability scaling
- Hope for everyday large-scale multithreading
- Software quality crisis
78SpecInt/MHz
79Clock Frequency Scaling
80Clock Cycle in FO4
Alpha
81Forms of Parallelism and Energy per Op
Scalar Pipelined Machine
Energy/ Operation
Performance
1
2
5
82Idempotency is Non-Monotonic with Region Size
C st r1, (r4) B ld r1, (r2) A
st r1, (r3) r3 ! r2 add r1, 1
83Idempotency is Non-Monotonic with Region Size
C st r1, (r4) B ld r1, (r2) A
st r1, (r3) r3 ! r2 add r1, 1
84Idempotency is Non-Monotonic with Region Size
C st r1, (r4) B ld r1, (r2) A
st r1, (r3) r3 ! r2 add r1, 1
85Idempotency is Non-Monotonic with Region Size
C st r1, (r4) B ld r1, (r2) A
st r1, (r3) r3 ! r2 add r1, 1