Title: Motivation
1ALP Energy Efficient Support for All Levels of
Parallelism for Complex Media Applications
Ruchira Sasanka, Manlap Li, Sarita Adve (Univ.
of Illinois), Yen-Kuang Chen, Eric Debes (Intel)
Motivation
Results
- Challenges of Complex Media Apps
- Real-time Performance
- Energy Efficiency
- Programmability
- Nature of DLP in Complex Media Apps
- DLP interspersed with control
- Exhibits different forms of DLP
- - sub-word, vectors, streams
- Existing Vector/Stream Processors
- Targeted for large amounts of DLP
- Not ideal for code with control
- New programming paradigms
- Cost of new ISA, vector registers, BW
- Forward/Backward compatibility
- Opportunities
- Lots of parallelism (DLP/TLP/ILP)
- Existing Support on General Purpose Procs
- - ILP/TLP CMP/SMT processors
- - DLP SIMD (e.g., MMX, AltiVec)
- Already multi-core (and SIMD multi-lane)
MPGenc MPGdec RayTrace SpeechRec
FaceRec
ALP (All Levels of Parallelism)
- ALP
- Based on CMP/SMT processors with SIMD
- Uses Indexed Vectors (vectors of SIMD records)
- Only a handful of new instructions
- - only vector loads use vector instructions
- Vector data stored in L1 cache
- Supports both vectors and streams
- Familiar SIMD programming exception model
- Indexed Vectors
- Indexed Vector Registers (IVR) e.g., V0, V1
- Each IVR has a Current Record Pointer (CRP)
- An instruction can access only current record
- CRPs auto-incremented on use
- Computation using SIMD instructions/registers
- CRPs allow scalar processing on vector data
MPGenc MPGdec
RayTrace SpeechRec FaceRec
1T, 4T, 4x2T 1 thread, 4 thread (CMP) and 8
thread (CMP/SMT) S with SIMD SV with
indexed vectors (ALP is 4x2TSV)
ALP over 1T Energy savings 1.5X-15X, EDP savings
7.3X-873X, and Speedups 5X-58X.
Record 0
V0 (IVR)
Record 1
Sub-word 3
- Benefits Over SIMD
- Reduced load/store and overhead instructions
- Increased exposed parallelism
- Load latency tolerance and efficient use of L1
- Energy efficient IVR access (cf. cache accesses)
Record N
Packed Word 0 (Contiguous in memory)
Packed Word 1 (Contiguous in memory)
1
CRP for V0 (Currently Points to Record1)
Programming Example V2 k (V1V2)-16
(A) VLD addrstridelength ? V0
(B) VLD addrstridelength ? V1
(C) VADD V0, V1 ? V3 (D)
VMUL V3, reg1 ? V4 (E) VSUB V4, 16
? V2 (F) VSTORE addrstridelength V2
Conventional Vector code
(1) VLD addrstridelength ? V0 (2)
VLD addrstridelength ? V1 (3)
VALLOCst addrstridelength ? V2
do for all records in vector (4) simd_add V0, V1
? simd_r0 (5) simd_mul simd_r0, simd_r1 ?
simd_r2 (6) simd_sub simd_r2, 16 ? V2 Indexed
Vector Code
Benefits/Drawbacks Over Vectors Few new
instructions Easily handles control intensive
code w/o masks Supports streams and while
loops Flexible scheduling and scalar exception
model Can be scaled back (e.g., for legacy
support) - More dynamic instructions