Title: Continuous Optimization
1Continuous Optimization
- Brian Fahs Todd Rafacz Sanjay J. Patel
Steven S. Lumetta - Advanced Computer Systems Group
- Department of Electrical and Computer Engineering
- University of Illinois at Urbana-Champaign
2Continuous Optimization
- Concept
- Optimize instructions in processor pipeline
- Technique
- Streaming table-based optimization hardware
- Motivation
- Reduce dataflow height
- Pre-execute instructions
- Catch branch mispredictions early
3Outline
- Continuous optimizer design
- Performance characterization
- Current work
4Continuous Optimization
5Symbolic Values
- Expression format
- Simple enough to implement
- Optimize a large fraction of instructions
value (physical register ltlt scale) /- offset
6Optimizer Organization
Computation Simplification
Memory Simplification
7Computation Simplification
RAT CP/RA Table
CP/RA Optimizer Logic
add r3, 1 -gt r6
8Computation Simplification Three Cases
Optimization not possible
Early execution
Dataflow height reduction
add r3, 1 -gt r6
add pr32, 1 -gt pr38
9Memory Simplification
Produced during Computation Simplification
Data Address 0x12345
RLE/SF/ SSR Table
Unknown store address flushes table
RLE/SF/SSR Optimizer Logic
10Optimizing Loads
Optimizing Stores
st r6 -gt 0x12345
11Value Feedback
12Implementation Issues
- Processing dependent instructions
(default no)
(default 2 stages)
(default 1 cycle)
pipe stages
pipeline stage
add r1, 1, r1
add r1, 1, r1
fetch
optimizer
execute
Xmit delay
13Performance Evaluation
- Experimental Setup
- Alpha ISA
- SPECint, SPECfp, and mediabench
- Pentium 4 style pipeline
- 20 stages minimum for branch resolution
- 22 stages min. with continuous optimizer
14Performance
Average speed up
15Performance Factors
- Dataflow height reduction
- Early instruction execution
- Early branch resolution
- Removal of forwarded loads
- Silent store removal
- Early load address resolution
- Feedback of execution results
16Optimizer Performance
benchmark executed early recovered mispredicted branches load/store address generated loads forwarded silent stores removed
SPECint 32 10 73 13 3
SPECfp 27 29 78 17 3
mediabench 42 20 97 32 3
average 34 20 83 21 3
17Performance Factors
No early load address resolution
No early branch resolution
No early execution
No feedback
18Related Works
- Early load address resolution
- M. Bekerman et al
- Physical register reuse
- S. Jourdan, R. Ronen, and M. Bekerman
- Speculative memory bypassing
- A. Moshovos and G. S. Sohi
- S. Onder and R. Gupta
- G. S. Tyson and T. M. Austin
19Insights
- Primary benefit reduce resource contention and
absorb data cache miss stalls - Reduces execution workload -- rebalance
- Certain optimizations are stream-based
- Those requiring only past information
- Single-pass copy propagation
20Summary
- Concept
- Place streaming table-based optimizer in
processor pipeline - Benefits
- Dataflow height reduction
- Early branch resolution
- Early instruction execution
- Early load address resolution
- Novel extensions
- Value feedback
21Current Work Alternative Application
Hardware-based trace optimizer
- Upsides
- Off critical path
- Allows dead instrs to be removed
- Downsides
- No value feedback
- No early address resolution, no early execution
Processor Pipeline
22Continuous and Trace Optimization
Preliminary data
23Thanks for your time!
- http//www.crhc.uiuc.edu/bfahs
- ACS http//www.crhc.uiuc.edu/ACS