Title: Memory Consistency in Vector IRAM
1Memory Consistencyin Vector IRAM
2The Memory Consistency Model
- Consistency model applies to instructions in a
single instruction stream (different than
multi-processor consistency!).
a after V vector R read VP virtual
processor W write no sync required S
scalar sync required
- Definition of a XaY sync
- All operations of type Y occurring before the
sync in program order appear to execute before
any operation of type X occurring after the sync
in program order. - Definition of a XaY sync to vector register
vri - The most recent operation of type Y to vri
appears to execute before any operation of type X
occurring after the sync in program order.
3Why Relax Memory Consistency?
- Natural micro-architecture has multiple paths to
memory - Want to decouple scalar and vector units without
complex hardware
Fetch
Scalar Core
Vector Unit
Sync
Memory
- Trade-off between more complex hardware
(speculation, disambiguation, cache coherence)
and more complex software (sync instructions) - Should explore solutions to this trade-off that
involve more hardware e.g. Hardware guarantees
SaV and VaS ordering, but leaves VaV and VP
orderings to software.
4Software Conventions for Syncs
Vector Function
Conventions 1. Execute VaS and VaV syncs on
entry to vector code. 2. Execute SaV sync on exit
from vector code.
VaS,VaV
Scalar Code
Vector Code
SaV
- Vector code is responsible for not messing things
up. - Allows us to vectorize libraries to speed up
existing programs. - Dont want to assume that our compiler will
compile and globally optimize all non-vector code
that we run. - Alternative model Pass around flags to
communicate sync requirements or history - Must assume that our compiler compiles all code
run on IRAM. - Not sure we want to accept that restriction.
5Sync Implementations and Costs
- SaV Stall fetch unit until vector unit has
committed all vector memory instructions. - Could take 1000s of cycles with many indexed
vector memory operations in flight! - Very difficult to delay issue since it is often
issued at the end of a vector routine. - VaS Stall fetch unit until scalar unit has
committed all scalar memory instructions. - Not too expensive (10s of cycles?) because scalar
unit is ahead of the vector unit, because the
scalar core is simple, and because the data cache
is write-thru. - Easy to delay issue because it is often issued at
the start of a vector routine. - VaV and VPaVP No operation.
- Nop because we have 1 vector memory unit and no
vector caches.
6Current Sync Analysis Tool
- Executes a program and tells you
- 1. Whenever two memory references are not
- Ordered by architectural guarantees
- Ordered by register dependencies
- Ordered by an intervening sync instruction
- 2. Whenever a sync instruction is not used to
resolve any hazard, as described in (1). - Caveats
- Hazards are detected from a single program
execution Information may not hold true for all
possible executions of the program. - Hazard detection is conservative in the presence
of synchronization chains.
Two Examples of Synchronization Chains
Write(A) lt- r1 RAW SYNC Read(A) lt- r2 WAR
SYNC Write(A) lt- r3 Write(A) lt- r1 RAW
SYNC Read(A) lt- r2 Write(A) lt- r2
Hazard?
Hazard?
7Optimizing Code
- Basic problem
- Vector unit requires setup VL, VPW, mask,
exceptions - Vector code responsible for issuing syncs
- Both of these are required in a vector routine if
nothing is known about the calling context! - All solutions share the notion of giving control
of the calling context to the compiler. Two
options - (1) Pass around flags so that syncs and setup
code can be avoided at run-time - (2) Do global optimizations so that syncs and
setup code can be eliminated at compile-time
. . . Scalar code Vector setup VaS and VaV
sync Vector function SaV sync Scalar code Vector
setup VaS and VaV sync Vector function SaV
sync Scalar code . . .
8Optimization Example
- Demonstrates potential benefit from optimizing
scalar-vector communication - Code computes ABCDEF in the following manner
A
D
B
C
E
F
- Unoptimized code calls a general vector add
routine 5 times - First optimization inlines the 5 routines and
removes vector initialization sequences - Second optimization also removes unnecessary sync
instructions
- Optimization goal is to avoid sawtooth in
instantaneous performance graphs caused by
draining the vector pipelines between vector loops
9- Large optimization potential for short vector
loops. - SaV syncs are most important to eliminate or
delay. - VaS sync performance impact is unclear.
- VaV syncs are virtually free in VIRAM-1.
- Setup code is expensive. For this example, it is
as expensive as the SaV syncs.