Title: Application of Instruction Analysis/Synthesis Tools to x86
1Application of Instruction Analysis/Synthesis
Tools to x86s Functional Unit Allocation
- Ing-Jer Huang and Ping-Huei Xie
- Institute of Computer Information Engineering
- National Sun Yat-sen University
- Kaohsiung, Taiwan 80441
- R. O. C.
- ijhuang_at_cie.nsysu.edu.tw
2Superscalar Model under Investigation
- Decoupled superscalar architecture
- register renaming
- branch prediction
- Assumptions
- no cache miss
- fast instruction fetcher and decoder
- 100 branch prediction correct
- load/store unit 2 cyclesothers 1 cycle
- large RS and ROB
3The Problem
- Q How many functional units are needed in an x86
compatible superscalar core? - A The distribution of functional unit usage in
typical x86 programs
4How to Obtain FU Distribution?
- Simulation-based approaches
- Shinatani, 1995, Davidson, 1995, Hara et
al., 1996, etc. - Running on different CPU platforms
- Slow, but can explore many configurations
- Monitoring-based approaches
- Adams et al., 1989, Bhandarkar et al., 1997,
Huang, 1997, etc. - Directly running on the same CPU platform
- Fast, but work for only the configuration of the
underlying CPU platform
5A Fast Performance/Cost Approximation Environment
6ASIA Automatic Synthesis of Instruction Set
Architedcture
- GOAL analyzes and synthesizes application-specifi
c instruction set for pipelined uni-processors. - APPROACH a micro-operation scheduling engine
based on a simulated annealing algorithm - ? The superscalar core is an application-specific
RISC core for x86 emulation
7ASIA-II Extensions for Superscalar Architecture
- Register renaming
- Temporary registers are used on the fly to
resolve anti and data dependencies. - Execution window
- Instructions are dispatched sequentially.
- Branch prediction
- Effective sizes of basic blocks are enlarged.
8Register Renaming
- In ASIA-II ignore output, anti dependencies
during scheduling
9Realistic Patterns in the Execution Window
- Balanced distribution 0bjective function
includes both time steps and H/W counts - Window effect MOPs are displaced with a limited
distance long distance is possible with many
iterations of displacement .as long as
performance is improved.
10Basic Block Expansion (Eblocks) Due to Branch
Prediction
11A Small Example from Word97
12Extended Basic Blocks
13Scheduled Eblocks
14A Small Example - FU Usage
15Description of Benchmark
16Micro-operation Level Parallelism (MSP)
17Functional Unit Usage
- Notation
- A - Integer unit
- M - Memory unit
- B - Branch unit
- F - Floating unit
- Others is the sum of that frequent less than 1.0
18Accumulated Coverage of Functional Unit Allocation
(NSC 98)
(IA-64)
(AMD K6)
(Pentium Pro)
(Base Machine)
19Conclusions
- Synthesis/analysis tools have been used to
observe the functional unit usage and MLP in
superscalar core. - Speedup over simulation is over 600 times.
- FUTURE WORK investigate various
microarchitecture features - register renaming vs. branch prediction
- functional unit optimization