Title: EPIC Architecture (Explicitly Parallel Instruction Computing)
1EPIC Architecture(Explicitly Parallel
Instruction Computing)
- Yangyang Wen
- CDA5160--Advanced Computer Architecture I
- University of Central Florida
2Outline
- What is EPIC?
- EPIC Philosophy
- Architectural Features Supporting EPIC
- Intels IA-64 Architectural Features
- IA-64s Key Technologies
- Summary and Reference
3Traditional Architectures Limited Parallelism
Todays Processors often 60 Idle
4EPIC Architecture Explicit Parallelism
Better Parallel machine Code
Increases Parallel Execution
5What is EPIC ?
- EPIC means Explicitly Parallel Instruction
computing, and EPIC architecture provides
features that allow compilers to take a proactive
role in enhancing Instruction level parallelism(
ILP) without unacceptable hardware complexity. -
6EPICs Performance
7EPIC Design Philosophy
- EPIC permits the compiler have advanced features
to enhance ILP predication, speculation. - EPIC can design the plan of execution (POE) at
compile-time and communicate the POE to the
hardware. - EPIC must have massive hardware resources for
parallel execution
8Introducing IA-64
- IA-64 comes from Intel and is the first 64-bit
architecture for Intel. - The first instance of a commercially available
EPIC ISA. - The first architecture to bring ILP features to
general-purpose microprocessors.
9IA-64s Architectural Basics
- Explicit Parallelism
- Enhanced ILP
- Compiler-oriented
- Extremely large physical memory
- A huge virtual address space for applications
- 64-bit computation
- Extremely large register files
10(No Transcript)
11IA-64s Key Technologies
- Instructions Bundling
- Predication
- Control Speculation
- Data Speculation
- Software pipelining
12Instruction Bundling
128-bit bundle
41-bits
0
127
Instruction 1
Instruction 0
Template
Insrtruction2
- Uses a form of VLIW architecture
- Three Instructions are combined into a 128-bit
instruction - Parallel Instructions are executed in groups
- Template bits decode and route instructions and
mark the end of groups of parallel instructions.
13ILP Bottlenecks
- Branches
- Deal with branch, take predication.
- Branch mispredications cause 20 to 30 loss in
processor performance . - Memory latency
- Latency is the time it takes to get data from
memory. The longer it takes you to access memory
to get code and data, the longer the CPU sits
idle. - For memory latency, it's the loads that are the
big problem, not the stores.
14Predication
If AgtB
If AgtB
If AgtB SA else SB end if
Predicate SA
SB
SA
The predication is wrong
PS
Throw away SA
SB
(b) IA-64 predication
- Traditional predication
Branching is a major cause of lost performance.
15EPIC Predication Process
16Predication Benefits
- Reduce branches
- Reduce mispredication penalties
- Reduce critical paths
17Control Speculation
IA-64 Architectures
Traditional Architectures
ld.s r8a instr 1 instr 2
instr 1
instr 2
. . .
br
Barrier
br
chk.s r8 use
Load a use
Allows elevation of load, even above a branch
Elevating the load above a branch is not possible
Memory latency is a major performance bottleneck
18Introducing the Token Bit
IA-64
ld.s r8a instr 1 instr 2
Exception Detection
Propagate Exception
br
Exception Delivery
chk.s r8 use
- When elevate ld, give an exception detection
- If the load address is valid, its normal.
- If the load address is invalid, compiler sets
token bit ,and jumps out of this path. - If the code goes to chk.s, and the chk.s detects
the token bit,jumps to fix-up code,executes the
load.
19Data Speculation
Traditional Architectures
IA-64
instr 1
ALAT
load.a instr 1 instr 2
instr 2
. . .
store
Barrier
store
load use
load.c use
Chk.a
Allows the compiler to elevate the load ,even it
isnt sure if the memory reference overlaps.
Cant elevate the load, so prevents from
reordering insts
20Advanced Load Address Table ALAT
reg
Address
reg
Address
- When elevate ld.a,insert ALAT
- When store, remove overlap address records in
ALAT - When chk.a,if no address is found ,there is a
conflict, and jumps to fix-up code to reexecute
the code
reg
Address
...
21Speculation Benefits
- Reduces impact of memory latency
- Study demonstrates performance improvement of 80
when combined with predication - Greatest improvement to code with many cache
accesses - Scheduling flexibility enables new levels of
performance headroom
22Software Pipelining
vs.
- Overlap the execution of different loop
iterations - Get more iterations in same amount of time
23Software Pipelining Example
For(I0Ilt1000I) xIxIs
Loop Ld f0,0(r1) Add f0,f0,f1 Sd
f0,0(r1) Add r1,r1,8 Subi r2,r2,1 Benz loop
Loop SD f2, -4(r1) Add f2,f0,f1 Subi
r2,r2,1 Ld f0, 4(r1) Benz loop
Software pipelining
24Software Pipelining Advantages
- Traditionally performed through loop unrolling
- less code compared loop unrolling, increased
regularity - Smaller code means fewer cache misses
- Especially useful for integer code with small
number of loop iterations
25Software Pipelining disadvantages
- Requires many additional instructions to manage
the loop - Without hardware support the overhead may greatly
increase code size - typically only used in special technical
computing applications
26IA-64 Features Supporting Software Pipelining
- Full predication
- Circular Buffer of General and FP Registers
- Loop Branches Decrement RRBs (register rename
bases)
27Summary
- Predication removes branches
- Parallel compares increase parallelism
- Benefits complex control flow large databases
- Speculation reduces memory latency impact
- IA-64 removes recovery from critical path
- Benefits applications with poor cache locality
server applications, OS - S/W pipelining support with minimal overhead
enables broad usage - Performance for small integer loops with unknown
trip counts as well as monster FP loops
28Reference
- M. S. Schlanker, "EPIC Explicitly Parallel
Instruction Computing", Computer, vol. ?, No. ?,
pp 37--45, 2000. - Jerry Huck et al., "Introducing the IA-64
Architecture", Sept - Oct. 2000, pp. 12-23 - Carole Dulong The IA-64 Architecture at
Work,Computing Practices