A Reconfigurable Functional Unit for Adaptable Custom Instructions - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

A Reconfigurable Functional Unit for Adaptable Custom Instructions

Description:

Custom instructions are detected and created during execution/training ... Synthesis results using Hitachi 0.18 m. Area : 0.9069 mm2. Delay : 7.54 ns. Kyushu ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 36

Provided by: hamid7

Category:

more less

Transcript and Presenter's Notes

Title: A Reconfigurable Functional Unit for Adaptable Custom Instructions

1
A Reconfigurable Functional Unit for Adaptable
Custom Instructions

H. Noori1, F. Mehdipour2, K. Murakami1, K. Inoue1
and M. Saheb Zamani2

1Kyushu University 2Amirkabir University of
Technology
2
Agenda

Research goal
General overview of the architecture
Modes of operation
Profiler
Reconfigurable Functional Unit (RFU)
Sequencer
RFU Architecture A Quantitative Approach
Tool Chain
Generating Custom Instructions
Mapping Custom Instructions
Integrating RFU with base processor
Configuration Memory
Performance Evaluation
Conclusions
Future work

3
Some definitions

Hot Basic Block (HBB)
A basic block which execution frequency is
greater than a given threshold specified in the
profiler
Custom Instructions (CIs)
Are the extended Instruction Set Architecture
(ISA) that are executed on the RFU
Reconfigurable Functional Unit (RFU)
Custom hardware for executing CIs
Training mode
Operation mode for detecting HBBs and generating
CIs
Normal mode
Normal operation mode where CIs are executed on
the RFU

4
Research Goal

Proposal of an Adaptive Dynamic Extensible
Processor for Embedded Systems
Custom instructions are adaptable to the
applications
Custom instructions are detected and created
during execution/training
Generation of custom instruction are done
transparently and automatically
Advantages of the novel approach
Higher performance than GPPs
Higher flexibility compared to Extensible
Processors
Cheaper and shorter design and verification cost
and time compared to ASIPs and Extensible
Processors

5
General overview of the architecture
Adaptive Dynamic Extensible Processor
N-way in-order general RISC
Detects start addresses of Hot Basic Blocks (HBBs)
Base Processor
Reg File
Fetch
Augmented Hardware
Decode
Switches between main processor and RFU
Profiler
Execute
RFU
Memory
Sequencer
Write
Executes Custom Instructions
6
General overview of the architecture

Modes of operation
Training mode
Profiling
Detecting start address of Hot Basic Blocks
(HBBs)
Generating Custom Instructions
Generating Configuration Data for the RFU
Binary rewriting
Initializing the Sequencer Table
? Online
Needs a simple hardware for profiling
All tasks are run on the base processor
? Offline
Needs a PC trace after taken branches/jumps
Normal mode
Profiling (optional)
Executing Custom Instructions on the RFU and
other parts of the code on the base processor

7
Components
8
Operation modes
Training Mode
Training Mode
Normal Mode
Running Tools for Generating Custom Instructions,
Generating Configuration Data for ACC and
Initializing Sequencer Table
Monitors PC and Switches between main processor
and ACC
Detecting Start Address of HBBs
Applications
Applications
Applications
Binary-Level Profiling
Processor
Processor
Processor
Profiler
Profiler
Profiler
Profiler
ACC
Sequencer
ACC
Sequencer
ACC
Sequencer
Binary Rewriting
Executing CIs
9
Profiler
Profiler Table
Current PC
Previous PC
Compare
No
If greater than instruction length
Nothing
Yes
After a taken branch or jump we look at the BBSA
to see if the target PC is on the table. If it is
a miss we include this address and initialize the
counter to 1, otherwise we increment its value.
Is Current PC in the table?
Yes
No
Increment the counter
Add it as a new entry and set the counter to one.
10
Reconfigurable Functional Unit (RFU)

RFU is a matrix of Functional Units (FUs)
RFU is a multi-cycle FU with variable delay
RFU has a two level configuration memory
A multi-context memory (keeps two or four config)
A cache
FUs support only logical operations,
add/subtract, shifts and compare
RFU updates the PC
RFU has variable delay which depends on size of
Custom Instruction

11
Sequencer

The sequencer mainly determines the microcode
execution sequence.
Selects between decoder and config memory for
reading RF
Selects between the output of Functional Unit and
Accelerator
Distinguishes when to switch between different
contexts of multi-context memory
Determines when to load configuration data from
cache to multi-context memory.
Checks the configuration data of custom
instruction
If it is in multi-context memory, custom
instructions will be executed on the accelerator
If it is not in multi-context memory
If there is enough time to load it from cache to
multi-context memory, loads it and execute CI on
the ACC
If there is not enough time, the original code is
executed.

12
Tool Chain
13
Generation of Custom Instructions

Custom instructions
Exclude floating point, multiply, divide and load
instructions
Include at most one STORE, at most one
BRANCH/JUMP and all other fixed point
instructions
Simple algorithm for generating custom
instructions
HBBs usually include 1040 instructions for
Mibench
Custom instruction generator is going to be
executed on the base processor (in online
training mode)

14
Generating Custom Instructions

4052c0 addiu 29,29,-32
4052c8 mov.d f0,f12
4052d0 sw 18,24(29)
4052d8 addu 18,0,6
4052e0 sw 31,28(29)
4052e8 sw 16,16(29)
4052f0 mfc1 16,f0
4052f8 mfc1 17,f1
405300 srl 6,17,0x14
405308 andi 6,6,2047
405310 sltiu 2,6,2047
405318 addu 6,6,18
405320 sltiu 2,6,2047
405328 lui 2,32783
405330 and 17,17,2
405338 andi 2,6,2047
405340 sll 2,2,0x14
405348 or 17,17,2
405350 mtc1 16,f0

Finding the biggest sequence of instructions in
the HBB that can be executed on the ACC
Moving the instructions and appending supportable
instructions to the head of the detected
instruction sequence after checking
flow-dependency and anti-dependency
Moving the instructions and appending supportable
instructions to the tail of the detected
instruction sequence after checking
flow-dependency and anti-dependency
Rewriting object code if instructions have been
moved
Moving instructions, should not modify the logic
of the application
Custom instruction generation is done without
considering any other constraints.

15
Generating Custom Instructions

Block 3 (B3) is selected as the biggest
instructions sequence that can be executed on the
ACC
Block 2 (B2) can not be executed on ACC
Block 1 (B1) can be executed on ACC
If there is no flow and anti-dependency between
B1 and B2 exchange them.
This is done for B3, B4 and B5.

16
Example 2 (rewriting obj code)

400d10 addiu 29,29,-8
400d18 addu 8,0,4
400d20 addu 7,0,0
400d28 lui 9,49152
400d30 sll 4,4,0x2
400d38 and 2,8,9
400d40 srl 2,2,0x1e
400d48 lw 22,0(29)
400d50 addu 4,4,2
400d58 sll 8,8,0x2
400d60 sll 6,3,0x1
400d68 sll 3,3,0x2
400d70 sltu 2,4,3
400d78 bne 2,0,400db8 ltusqrt0xa8gt

17
Example 1

400d10 addiu 29,29,-8
400d18 addu 8,0,4
400d20 sw 0,0(29)
400d28 addu 4,0,0
400d30 addu 7,0,0
400d38 lui 9,49152
400d40 sll 4,4,0x2
400d48 and 2,8,9
400d50 srl 29,2,0x1e
400d58 lw 3,0(29)
400d60 addu 4,4,3
400d68 sll 8,8,0x2
400d70 sll 6,3,0x1
400d78 sll 3,3,0x2
400d80 addiu 3,3,1
400d88 sltu 2,4,3
400d90 sw 6,0(29)
400d98 bne 2,0,400db8 ltusqrt0xa8gt

Customized Instruction 1
Customized Instruction 2
18
An Example for Mapping CI on the RFU
19
RFU Architecture A Quantitative Approach

22 programs of MiBench were chosen
Simplescalar toolset was utilized for simulation
RFU is a matrix of FUs
No of Inputs
No of Outputs
No of FUs
Connections
Location of Inputs Outputs
Some definitions
Considering frequency and weight in measurement
CI Execution Frequency
Weight (To equal number of executed instructions)
Average for all CIs (SFreqWeight)
Rejection Percentage of CI that could not be
mapped on the RFU
Coverage Percentage of CI that could be mapped
on the RFU

20
RFU Inputs (no constraint)
96.37
89.37
98.48
8
21
RFU Outputs (no constraint)
96.58
6
22
RFU Node No (Input8, Output8)
94.74
16
23
RFU Width (Inp8, Out8, Node16)
97.65
95.65
6
24
RFU Depth (Inp8, Out8, Node16)
93.41
6
25
RFU Architecture 1

Input8
Output8
Node16
Width 6,4,3,2,1
Depth 5
Inputs are applied to the first row
Outputs of each row are connected only to the
inputs of the subsequent row
MOVE is used for transferring data
Mapping rate is 77.53
Rejection rate is 22.47

Synthesis results using Hitachi 0.18 µm Area
0.9069 mm2 Delay 7.54 ns
26
RFU Architecture 2

Distributing Inputs in different rows
Row1 7
Row 2 2
Row 3 2
Row 4 2
Row 5 1
Connections with Variable Length
row1 ? row3 1
row1 ? row4 1
row1 ? row5 1
row2 ? row4 1
Mapping rate is 90.48
Rejection rate is 9.52

Synthesis results using Hitachi 0.18 µm Area
1.1534 mm2 Delay 9.66 ns
27
Function Types

Three types of functions
logical operations (type 1)
add/sub/compare (type 2)
shift operations (type 3)

28
Integrating RFU with the Base Processor
Reg0
Reg31
.
Config Mem
Decoder
Sequencer
DEC/EXE Pipeline Registers
FU1
FU2
FU3
FU4
ACC
Sequencer
EXE/MEM Pipeline Registers
29
Control Bits Immediate Data

308 bits are needed as Control Bits for
Multiplexers
Functional Units
204 bits are needed for Immediates
Each CI configuration needs (308204 512 bits)

30
Performance Evaluation

Simplescalar was configured to behave as a
4-issue in-order RISC processor. The base
processor supports MIPS instruction set.
22 applications of Mibench

31
Delay of RFU according to CI length