Title: Code Compaction for UniCore on Link-Time Optimization Platform
1Code Compaction for UniCoreon Link-Time
Optimization Platform
- Zhang Jiyu
- Compilation Toolchain Group
- MPRC
2Compilation Process
3Our Optimization Process
4CLOU is a Link-time Optimizer for UniCore
Linking
Code
Code
Code
Data
Data
Data
Meta
Meta
Meta
Translation to IR
CFG construction Optimizations
Exec
Layout Assembling
A Graph Modified From Diablo
5Code Compaction based on CLOU
- Motivation of code compaction
- Limited memory and energy resources for embedded
systems - Code density affects both memory and energy
consumption - Goal reducing code size without losing
performance - Code compaction in different levels
- 1. Typical optimizations for code size
reduction at link-time - 2. Hot/cold code splitting
- 3. New mixed code generation method
6Typical Optimizations for Code Size Reduction
- Redundant code elimination
- Computations whose results have been computed
previously and are guaranteed to be available at
that point - Unreachable code elimination
- Code fragments which there is no control flow
path to from the entry node - Many of them are following useless comparisons
- Dead code elimination
- Computations whose results are never used
- Peephole optimization
- Procedural abstraction -- might lead to
performance loss
7Experiments for Typical Optimizations for Code
Size Reduction
- Benchmark Mediabench
- Code size reduction
- Average 12.8
- Max 22.3
- Performance improvement
- Average 2.4
- Max 4.2
8Hot/Cold Code Splitting
- Less code transferred from remote to local, from
disk to memory, or from memory to cache - Question might be too conservative or lead to
performance loss? - Get hot/cold code splitted through basic block
reordering
9Hot/Cold Code Splitting
- PH A popular greedy approach
- Structural Analysis Based Basic Block Reordering
- Most part of a program can be
- decomposed into several typical structures
- Cost Module for each structure
- Minimal-cost layout ? Optimal layout
- for each local structure based on
- profiling information
10Basic Block Reordering
- Cost Model
- Different kinds of control flow edges have
different cost - For a specific order,
-
- A list can be got for each structure
- f (structure, frequencies of all edges) ? the
best order of basic blocks for the local structure
control flow edges
11Experiments
- Complexity O(Nlog N),N number of basic blocks
- Experiment results (not using other link-time
optimizations) - Normalized cycle counts Normalized cache
miss rate
12Mixed Code Generation
- Dual-width Instruction Set
- 32-bit ISA more powerful
- 16-bit ISA more compact
- Less coding space for operations
- Less register field
- Less immediate field
32-bit add r0, r0, 0xff800000
16-bit str r2, addr mov r2, 0xff lsl r2,
1 add r2, 1 lsl r2, 24 add r0,
r2 ld r2, addr
13Mixed Code Generation
- Related works in dual-width Instruction Set
design and mixed code generation - Coarse-grained function-level mixed code
generation - By BX in arm and JALX in MIPS
- Simple fine-grained instruction-level mixed code
generation - By BX in arm and JALX in MIPS
- By single specific mode-changing instruction
- Specialized coding
- One-leading instruction word indicates one 32-bit
instruction - Zero-leading instruction word indicates two
16-bit instruction. - 16-bit ISA extensions
- Problem Always lead to performance loss
14Potential benefit
- Analysis of Programs in Mediabench
- 27851 different instructions in all programs
- Log(27851)15
Rank Unicore32 Instruction Average Percentage
1 mov 23
2 ldr 16
3 cmp 8
4 add 8
5 str 6
6 b 5
Total 66
1
2
15Two Main Kinds of Frequent Instructions
- Two-operand instructions
- mov rd, rm
- or short immediate
- cmp rn, rm
- or short immediate
- Branch/Jump
- Distribution of immediate-offsets of branch
instructions.
16The Idea of Mode-Changing Instruction Set (MC)
- Extend the 32-bit ISA to add a small MC
Instruction Set (using the reserved coding space) - Change the CPU mode
- Perform its own normal operation
- Scan for suitable 32-bit instructions to be
encoded into 16-bit instructions - A mixed code fraction with MC instructions
32-bit instructions 32-bit instructions
MC instruction UniCore16 instruction
UniCore16 instruction UniCore16 instruction
UniCore16 instruction UniCore16 instruction
MC instruction UniCore16 instruction
32-bit instructions 32-bit instructions
17Modification to Micro Architecture
- Mixed code execution in Unicore-I pipeline
- Improved mixed code executionin Unicore-I
pipeline
- No extra cycles
- One more 16-bit instruction-fetch buffer
- An MC-decoder
18Mixed Code Generation
Instruction Analyzer
program
Link-Time Optimizer
program
program
program
Mixed coded Program
Mode -Changing Instructions
Simulator
19Experiment Results
- Normalized code size (results not using other
link-time optimizations)
20Conclusion
- Code compaction on Link-Time Optimization
Platform - Compiler optimizations applied at link time
- Typical optimizations for code size reduction
- Program layout optimization
- Hot/cold code splitting through basic block
reordering - Machine code generation
- Mixed code generation
- Experiment Results
- Average code size reduction 32.9
- Average performance improvement 9.1
21 22(No Transcript)
23Instruction format type classifications
3 regs, all in r0-r7 / r8-r15 / r16-r23/ r24-r31 2 regs, one in r0-r31, one in r0-r16 / r17-r31 1 reg and 1 imme, imme field 4-6 bits 1 imme, imme field 9 bits reg short for register imme short for immediate field
24EXPERIMENT RESULTS
- Normalized dynamic instruction numbers