Code Compaction for UniCore on Link-Time Optimization Platform - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Code Compaction for UniCore on Link-Time Optimization Platform

Description:

Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 25
Provided by: KoenD3
Category:

less

Transcript and Presenter's Notes

Title: Code Compaction for UniCore on Link-Time Optimization Platform


1
Code Compaction for UniCoreon Link-Time
Optimization Platform
  • Zhang Jiyu
  • Compilation Toolchain Group
  • MPRC

2
Compilation Process
3
Our Optimization Process
4
CLOU is a Link-time Optimizer for UniCore
Linking
Code
Code
Code
Data
Data
Data
Meta
Meta
Meta
Translation to IR
CFG construction Optimizations
Exec
Layout Assembling
A Graph Modified From Diablo
5
Code Compaction based on CLOU
  • Motivation of code compaction
  • Limited memory and energy resources for embedded
    systems
  • Code density affects both memory and energy
    consumption
  • Goal reducing code size without losing
    performance
  • Code compaction in different levels
  • 1. Typical optimizations for code size
    reduction at link-time
  • 2. Hot/cold code splitting
  • 3. New mixed code generation method

6
Typical Optimizations for Code Size Reduction
  • Redundant code elimination
  • Computations whose results have been computed
    previously and are guaranteed to be available at
    that point
  • Unreachable code elimination
  • Code fragments which there is no control flow
    path to from the entry node
  • Many of them are following useless comparisons
  • Dead code elimination
  • Computations whose results are never used
  • Peephole optimization
  • Procedural abstraction -- might lead to
    performance loss

7
Experiments for Typical Optimizations for Code
Size Reduction
  • Benchmark Mediabench
  • Code size reduction
  • Average 12.8
  • Max 22.3
  • Performance improvement
  • Average 2.4
  • Max 4.2

8
Hot/Cold Code Splitting
  • Less code transferred from remote to local, from
    disk to memory, or from memory to cache
  • Question might be too conservative or lead to
    performance loss?
  • Get hot/cold code splitted through basic block
    reordering

9
Hot/Cold Code Splitting
  • PH A popular greedy approach
  • Structural Analysis Based Basic Block Reordering
  • Most part of a program can be
  • decomposed into several typical structures
  • Cost Module for each structure
  • Minimal-cost layout ? Optimal layout
  • for each local structure based on
  • profiling information

10
Basic Block Reordering
  • Cost Model
  • Different kinds of control flow edges have
    different cost
  • For a specific order,
  • A list can be got for each structure
  • f (structure, frequencies of all edges) ? the
    best order of basic blocks for the local structure

control flow edges
11
Experiments
  • Complexity O(Nlog N),N number of basic blocks
  • Experiment results (not using other link-time
    optimizations)
  • Normalized cycle counts Normalized cache
    miss rate

12
Mixed Code Generation
  • Dual-width Instruction Set
  • 32-bit ISA more powerful
  • 16-bit ISA more compact
  • Less coding space for operations
  • Less register field
  • Less immediate field

32-bit add r0, r0, 0xff800000
16-bit str r2, addr mov r2, 0xff lsl r2,
1 add r2, 1 lsl r2, 24 add r0,
r2 ld r2, addr
13
Mixed Code Generation
  • Related works in dual-width Instruction Set
    design and mixed code generation
  • Coarse-grained function-level mixed code
    generation
  • By BX in arm and JALX in MIPS
  • Simple fine-grained instruction-level mixed code
    generation
  • By BX in arm and JALX in MIPS
  • By single specific mode-changing instruction
  • Specialized coding
  • One-leading instruction word indicates one 32-bit
    instruction
  • Zero-leading instruction word indicates two
    16-bit instruction.
  • 16-bit ISA extensions
  • Problem Always lead to performance loss

14
Potential benefit
  • Analysis of Programs in Mediabench
  • 27851 different instructions in all programs
  • Log(27851)15

Rank Unicore32 Instruction Average Percentage
1 mov 23
2 ldr 16
3 cmp 8
4 add 8
5 str 6
6 b 5
Total 66
1
2
15
Two Main Kinds of Frequent Instructions
  • Two-operand instructions
  • mov rd, rm
  • or short immediate
  • cmp rn, rm
  • or short immediate
  • Branch/Jump
  • Distribution of immediate-offsets of branch
    instructions.

16
The Idea of Mode-Changing Instruction Set (MC)
  • Extend the 32-bit ISA to add a small MC
    Instruction Set (using the reserved coding space)
  • Change the CPU mode
  • Perform its own normal operation
  • Scan for suitable 32-bit instructions to be
    encoded into 16-bit instructions
  • A mixed code fraction with MC instructions

32-bit instructions 32-bit instructions
MC instruction UniCore16 instruction
UniCore16 instruction UniCore16 instruction

UniCore16 instruction UniCore16 instruction
MC instruction UniCore16 instruction
32-bit instructions 32-bit instructions
17
Modification to Micro Architecture
  • Mixed code execution in Unicore-I pipeline
  • Improved mixed code executionin Unicore-I
    pipeline
  • No extra cycles
  • One more 16-bit instruction-fetch buffer
  • An MC-decoder

18
Mixed Code Generation
Instruction Analyzer
program
Link-Time Optimizer
program
program
program

Mixed coded Program
Mode -Changing Instructions
Simulator
19
Experiment Results
  • Normalized code size (results not using other
    link-time optimizations)


20
Conclusion
  • Code compaction on Link-Time Optimization
    Platform
  • Compiler optimizations applied at link time
  • Typical optimizations for code size reduction
  • Program layout optimization
  • Hot/cold code splitting through basic block
    reordering
  • Machine code generation
  • Mixed code generation
  • Experiment Results
  • Average code size reduction 32.9
  • Average performance improvement 9.1

21
  • Thank you

22
(No Transcript)
23
  • Instruction Analysis

Instruction format type classifications
3 regs, all in r0-r7 / r8-r15 / r16-r23/ r24-r31 2 regs, one in r0-r31, one in r0-r16 / r17-r31 1 reg and 1 imme, imme field 4-6 bits 1 imme, imme field 9 bits reg short for register imme short for immediate field
24
EXPERIMENT RESULTS
  • Normalized dynamic instruction numbers
  • Normalized cycle counts

Write a Comment
User Comments (0)
About PowerShow.com