A Reconfigurable Functional Unit for Adaptable Custom Instructions - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

A Reconfigurable Functional Unit for Adaptable Custom Instructions

Description:

Custom instructions are detected and created during execution/training ... Synthesis results using Hitachi 0.18 m. Area : 0.9069 mm2. Delay : 7.54 ns. Kyushu ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 36
Provided by: hamid7
Category:

less

Transcript and Presenter's Notes

Title: A Reconfigurable Functional Unit for Adaptable Custom Instructions


1
A Reconfigurable Functional Unit for Adaptable
Custom Instructions
  • H. Noori1, F. Mehdipour2, K. Murakami1, K. Inoue1
    and M. Saheb Zamani2

1Kyushu University 2Amirkabir University of
Technology
2
Agenda
  • Research goal
  • General overview of the architecture
  • Modes of operation
  • Profiler
  • Reconfigurable Functional Unit (RFU)
  • Sequencer
  • RFU Architecture A Quantitative Approach
  • Tool Chain
  • Generating Custom Instructions
  • Mapping Custom Instructions
  • Integrating RFU with base processor
  • Configuration Memory
  • Performance Evaluation
  • Conclusions
  • Future work

3
Some definitions
  • Hot Basic Block (HBB)
  • A basic block which execution frequency is
    greater than a given threshold specified in the
    profiler
  • Custom Instructions (CIs)
  • Are the extended Instruction Set Architecture
    (ISA) that are executed on the RFU
  • Reconfigurable Functional Unit (RFU)
  • Custom hardware for executing CIs
  • Training mode
  • Operation mode for detecting HBBs and generating
    CIs
  • Normal mode
  • Normal operation mode where CIs are executed on
    the RFU

4
Research Goal
  • Proposal of an Adaptive Dynamic Extensible
    Processor for Embedded Systems
  • Custom instructions are adaptable to the
    applications
  • Custom instructions are detected and created
    during execution/training
  • Generation of custom instruction are done
    transparently and automatically
  • Advantages of the novel approach
  • Higher performance than GPPs
  • Higher flexibility compared to Extensible
    Processors
  • Cheaper and shorter design and verification cost
    and time compared to ASIPs and Extensible
    Processors

5
General overview of the architecture
Adaptive Dynamic Extensible Processor
N-way in-order general RISC
Detects start addresses of Hot Basic Blocks (HBBs)
Base Processor
Reg File
Fetch
Augmented Hardware
Decode
Switches between main processor and RFU
Profiler
Execute
RFU
Memory
Sequencer
Write
Executes Custom Instructions
6
General overview of the architecture
  • Modes of operation
  • Training mode
  • Profiling
  • Detecting start address of Hot Basic Blocks
    (HBBs)
  • Generating Custom Instructions
  • Generating Configuration Data for the RFU
  • Binary rewriting
  • Initializing the Sequencer Table
  • ? Online
  • Needs a simple hardware for profiling
  • All tasks are run on the base processor
  • ? Offline
  • Needs a PC trace after taken branches/jumps
  • Normal mode
  • Profiling (optional)
  • Executing Custom Instructions on the RFU and
    other parts of the code on the base processor

7
Components
8
Operation modes
Training Mode
Training Mode
Normal Mode
Running Tools for Generating Custom Instructions,
Generating Configuration Data for ACC and
Initializing Sequencer Table
Monitors PC and Switches between main processor
and ACC
Detecting Start Address of HBBs
Applications
Applications
Applications
Binary-Level Profiling
Processor
Processor
Processor
Profiler
Profiler
Profiler
Profiler
ACC
Sequencer
ACC
Sequencer
ACC
Sequencer
Binary Rewriting
Executing CIs
9
Profiler
Profiler Table
Current PC
Previous PC
Compare
No
If greater than instruction length
Nothing
Yes
After a taken branch or jump we look at the BBSA
to see if the target PC is on the table. If it is
a miss we include this address and initialize the
counter to 1, otherwise we increment its value.
Is Current PC in the table?
Yes
No
Increment the counter
Add it as a new entry and set the counter to one.
10
Reconfigurable Functional Unit (RFU)
  • RFU is a matrix of Functional Units (FUs)
  • RFU is a multi-cycle FU with variable delay
  • RFU has a two level configuration memory
  • A multi-context memory (keeps two or four config)
  • A cache
  • FUs support only logical operations,
    add/subtract, shifts and compare
  • RFU updates the PC
  • RFU has variable delay which depends on size of
    Custom Instruction

11
Sequencer
  • The sequencer mainly determines the microcode
    execution sequence.
  • Selects between decoder and config memory for
    reading RF
  • Selects between the output of Functional Unit and
    Accelerator
  • Distinguishes when to switch between different
    contexts of multi-context memory
  • Determines when to load configuration data from
    cache to multi-context memory.
  • Checks the configuration data of custom
    instruction
  • If it is in multi-context memory, custom
    instructions will be executed on the accelerator
  • If it is not in multi-context memory
  • If there is enough time to load it from cache to
    multi-context memory, loads it and execute CI on
    the ACC
  • If there is not enough time, the original code is
    executed.

12
Tool Chain
13
Generation of Custom Instructions
  • Custom instructions
  • Exclude floating point, multiply, divide and load
    instructions
  • Include at most one STORE, at most one
    BRANCH/JUMP and all other fixed point
    instructions
  • Simple algorithm for generating custom
    instructions
  • HBBs usually include 1040 instructions for
    Mibench
  • Custom instruction generator is going to be
    executed on the base processor (in online
    training mode)

14
Generating Custom Instructions
  • 4052c0 addiu 29,29,-32
  • 4052c8 mov.d f0,f12
  • 4052d0 sw 18,24(29)
  • 4052d8 addu 18,0,6
  • 4052e0 sw 31,28(29)
  • 4052e8 sw 16,16(29)
  • 4052f0 mfc1 16,f0
  • 4052f8 mfc1 17,f1
  • 405300 srl 6,17,0x14
  • 405308 andi 6,6,2047
  • 405310 sltiu 2,6,2047
  • 405318 addu 6,6,18
  • 405320 sltiu 2,6,2047
  • 405328 lui 2,32783
  • 405330 and 17,17,2
  • 405338 andi 2,6,2047
  • 405340 sll 2,2,0x14
  • 405348 or 17,17,2
  • 405350 mtc1 16,f0
  • Finding the biggest sequence of instructions in
    the HBB that can be executed on the ACC
  • Moving the instructions and appending supportable
    instructions to the head of the detected
    instruction sequence after checking
    flow-dependency and anti-dependency
  • Moving the instructions and appending supportable
    instructions to the tail of the detected
    instruction sequence after checking
    flow-dependency and anti-dependency
  • Rewriting object code if instructions have been
    moved
  • Moving instructions, should not modify the logic
    of the application
  • Custom instruction generation is done without
    considering any other constraints.

15
Generating Custom Instructions
  • Block 3 (B3) is selected as the biggest
    instructions sequence that can be executed on the
    ACC
  • Block 2 (B2) can not be executed on ACC
  • Block 1 (B1) can be executed on ACC
  • If there is no flow and anti-dependency between
    B1 and B2 exchange them.
  • This is done for B3, B4 and B5.

16
Example 2 (rewriting obj code)
  • 400d10 addiu 29,29,-8
  • 400d18 addu 8,0,4
  • 400d20 addu 7,0,0
  • 400d28 lui 9,49152
  • 400d30 sll 4,4,0x2
  • 400d38 and 2,8,9
  • 400d40 srl 2,2,0x1e
  • 400d48 lw 22,0(29)
  • 400d50 addu 4,4,2
  • 400d58 sll 8,8,0x2
  • 400d60 sll 6,3,0x1
  • 400d68 sll 3,3,0x2
  • 400d70 sltu 2,4,3
  • 400d78 bne 2,0,400db8 ltusqrt0xa8gt

17
Example 1
  • 400d10 addiu 29,29,-8
  • 400d18 addu 8,0,4
  • 400d20 sw 0,0(29)
  • 400d28 addu 4,0,0
  • 400d30 addu 7,0,0
  • 400d38 lui 9,49152
  • 400d40 sll 4,4,0x2
  • 400d48 and 2,8,9
  • 400d50 srl 29,2,0x1e
  • 400d58 lw 3,0(29)
  • 400d60 addu 4,4,3
  • 400d68 sll 8,8,0x2
  • 400d70 sll 6,3,0x1
  • 400d78 sll 3,3,0x2
  • 400d80 addiu 3,3,1
  • 400d88 sltu 2,4,3
  • 400d90 sw 6,0(29)
  • 400d98 bne 2,0,400db8 ltusqrt0xa8gt

Customized Instruction 1
Customized Instruction 2
18
An Example for Mapping CI on the RFU
19
RFU Architecture A Quantitative Approach
  • 22 programs of MiBench were chosen
  • Simplescalar toolset was utilized for simulation
  • RFU is a matrix of FUs
  • No of Inputs
  • No of Outputs
  • No of FUs
  • Connections
  • Location of Inputs Outputs
  • Some definitions
  • Considering frequency and weight in measurement
  • CI Execution Frequency
  • Weight (To equal number of executed instructions)
  • Average for all CIs (SFreqWeight)
  • Rejection Percentage of CI that could not be
    mapped on the RFU
  • Coverage Percentage of CI that could be mapped
    on the RFU

20
RFU Inputs (no constraint)
96.37
89.37
98.48
8
21
RFU Outputs (no constraint)
96.58
6
22
RFU Node No (Input8, Output8)
94.74
16
23
RFU Width (Inp8, Out8, Node16)
97.65
95.65
6
24
RFU Depth (Inp8, Out8, Node16)
93.41
6
25
RFU Architecture 1
  • Input8
  • Output8
  • Node16
  • Width 6,4,3,2,1
  • Depth 5
  • Inputs are applied to the first row
  • Outputs of each row are connected only to the
    inputs of the subsequent row
  • MOVE is used for transferring data
  • Mapping rate is 77.53
  • Rejection rate is 22.47

Synthesis results using Hitachi 0.18 µm Area
0.9069 mm2 Delay 7.54 ns
26
RFU Architecture 2
  • Distributing Inputs in different rows
  • Row1 7
  • Row 2 2
  • Row 3 2
  • Row 4 2
  • Row 5 1
  • Connections with Variable Length
  • row1 ? row3 1
  • row1 ? row4 1
  • row1 ? row5 1
  • row2 ? row4 1
  • Mapping rate is 90.48
  • Rejection rate is 9.52

Synthesis results using Hitachi 0.18 µm Area
1.1534 mm2 Delay 9.66 ns
27
Function Types
  • Three types of functions
  • logical operations (type 1)
  • add/sub/compare (type 2)
  • shift operations (type 3)

28
Integrating RFU with the Base Processor
Reg0
Reg31
.
Config Mem
Decoder
Sequencer
DEC/EXE Pipeline Registers
FU1
FU2
FU3
FU4
ACC
Sequencer
EXE/MEM Pipeline Registers
29
Control Bits Immediate Data
  • 308 bits are needed as Control Bits for
  • Multiplexers
  • Functional Units
  • 204 bits are needed for Immediates
  • Each CI configuration needs (308204 512 bits)

30
Performance Evaluation
  • Simplescalar was configured to behave as a
    4-issue in-order RISC processor. The base
    processor supports MIPS instruction set.
  • 22 applications of Mibench

31
Delay of RFU according to CI length
  • Synopsys Tools Hitachi 0.18µm

32
Performance Evaluation - Results
33
Conclusions
  • Adaptive Dynamic Extensible Processor
  • Binary Profiler
  • RFU (Inp8, Out6, Nodes16, Width6,4,3,2,1
    Depth5)
  • Sequencer
  • Average Speedup is 1.21 for 300 MHz base
    processor
  • Adaptive Dynamic Extensible Processor
  • No design time
  • No extra read port and write port
  • No design and verification cost
  • No compiler
  • No new opcode

34
Future Work
  • Generating multi-exit CIs
  • Proposing an RFU for supporting multi-exit CIs
  • Scheduling CIs on the processor
  • Details of sequencer

35
  • Thank you for your listening
Write a Comment
User Comments (0)
About PowerShow.com