Code Compaction for UniCore on Link-Time Optimization Platform

About This Presentation

Title:

Code Compaction for UniCore on Link-Time Optimization Platform

Description:

Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 25

Provided by: KoenD3

Learn more at: http://cadlab.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Code Compaction for UniCore on Link-Time Optimization Platform

1
Code Compaction for UniCoreon Link-Time
Optimization Platform

Zhang Jiyu
Compilation Toolchain Group
MPRC

2
Compilation Process
3
Our Optimization Process
4
CLOU is a Link-time Optimizer for UniCore
Linking
Code
Code
Code
Data
Data
Data
Meta
Meta
Meta
Translation to IR
CFG construction Optimizations
Exec
Layout Assembling
A Graph Modified From Diablo
5
Code Compaction based on CLOU

Motivation of code compaction
Limited memory and energy resources for embedded
systems
Code density affects both memory and energy
consumption
Goal reducing code size without losing
performance
Code compaction in different levels
1. Typical optimizations for code size
reduction at link-time
2. Hot/cold code splitting
3. New mixed code generation method

6
Typical Optimizations for Code Size Reduction

Redundant code elimination
Computations whose results have been computed
previously and are guaranteed to be available at
that point
Unreachable code elimination
Code fragments which there is no control flow
path to from the entry node
Many of them are following useless comparisons
Dead code elimination
Computations whose results are never used
Peephole optimization
Procedural abstraction -- might lead to
performance loss

7
Experiments for Typical Optimizations for Code
Size Reduction

Benchmark Mediabench
Code size reduction
Average 12.8
Max 22.3
Performance improvement
Average 2.4
Max 4.2

8
Hot/Cold Code Splitting

Less code transferred from remote to local, from
disk to memory, or from memory to cache
Question might be too conservative or lead to
performance loss?
Get hot/cold code splitted through basic block
reordering

9
Hot/Cold Code Splitting

PH A popular greedy approach
Structural Analysis Based Basic Block Reordering
Most part of a program can be
decomposed into several typical structures
Cost Module for each structure
Minimal-cost layout ? Optimal layout
for each local structure based on
profiling information

10
Basic Block Reordering

Cost Model
Different kinds of control flow edges have
different cost
For a specific order,
A list can be got for each structure
f (structure, frequencies of all edges) ? the
best order of basic blocks for the local structure

control flow edges
11
Experiments

Complexity O(Nlog N),N number of basic blocks
Experiment results (not using other link-time
optimizations)
Normalized cycle counts Normalized cache
miss rate

12
Mixed Code Generation

Dual-width Instruction Set
32-bit ISA more powerful
16-bit ISA more compact
Less coding space for operations
Less register field
Less immediate field

32-bit add r0, r0, 0xff800000
16-bit str r2, addr mov r2, 0xff lsl r2,
1 add r2, 1 lsl r2, 24 add r0,
r2 ld r2, addr
13
Mixed Code Generation

Related works in dual-width Instruction Set
design and mixed code generation
Coarse-grained function-level mixed code
generation
By BX in arm and JALX in MIPS
Simple fine-grained instruction-level mixed code
generation
By BX in arm and JALX in MIPS
By single specific mode-changing instruction
Specialized coding
One-leading instruction word indicates one 32-bit
instruction
Zero-leading instruction word indicates two
16-bit instruction.
16-bit ISA extensions
Problem Always lead to performance loss

14
Potential benefit

Analysis of Programs in Mediabench

27851 different instructions in all programs
Log(27851)15

Rank Unicore32 Instruction Average Percentage
1 mov 23
2 ldr 16
3 cmp 8
4 add 8
5 str 6
6 b 5
Total 66
1
2
15
Two Main Kinds of Frequent Instructions

Two-operand instructions
mov rd, rm
or short immediate
cmp rn, rm
or short immediate

Branch/Jump
Distribution of immediate-offsets of branch
instructions.

16
The Idea of Mode-Changing Instruction Set (MC)

Extend the 32-bit ISA to add a small MC
Instruction Set (using the reserved coding space)
Change the CPU mode
Perform its own normal operation
Scan for suitable 32-bit instructions to be
encoded into 16-bit instructions
A mixed code fraction with MC instructions

32-bit instructions 32-bit instructions
MC instruction UniCore16 instruction
UniCore16 instruction UniCore16 instruction

UniCore16 instruction UniCore16 instruction
MC instruction UniCore16 instruction
32-bit instructions 32-bit instructions
17
Modification to Micro Architecture

Mixed code execution in Unicore-I pipeline

Improved mixed code executionin Unicore-I
pipeline

No extra cycles
One more 16-bit instruction-fetch buffer
An MC-decoder

18
Mixed Code Generation
Instruction Analyzer
program
Link-Time Optimizer
program
program
program

Mixed coded Program
Mode -Changing Instructions
Simulator
19
Experiment Results

Normalized code size (results not using other
link-time optimizations)

20
Conclusion

Code compaction on Link-Time Optimization
Platform
Compiler optimizations applied at link time
Typical optimizations for code size reduction
Program layout optimization
Hot/cold code splitting through basic block
reordering
Machine code generation
Mixed code generation
Experiment Results
Average code size reduction 32.9
Average performance improvement 9.1

Thank you

22
(No Transcript)
23

Instruction Analysis

Instruction format type classifications
3 regs, all in r0-r7 / r8-r15 / r16-r23/ r24-r31 2 regs, one in r0-r31, one in r0-r16 / r17-r31 1 reg and 1 imme, imme field 4-6 bits 1 imme, imme field 9 bits reg short for register imme short for immediate field
24
EXPERIMENT RESULTS