Title: Reducing Code Size with Runtime Decompression
1Reducing Code Size with Run-time Decompression
- Charles Lefurgy, Eva Piccininni,
- and Trevor Mudge
- Advanced Computer Architecture Laboratory
- Electrical Engineering and Computer Science Dept.
- The University of Michigan, Ann Arbor
- High-Performance Computer Architecture (HPCA-6)
- January 10-12, 2000
2Motivation
- Problem embedded code size
- Constraints cost, area, and power
- Fit program in on-chip memory
- Compilers vs. hand-coded assembly
- Portability
- Development costs
- Code bloat
- Solution code compression
- Reduce compiled code size
- Take advantage of instruction repetition
- Implementation
- Hardware or software?
- Code size?
- Execution speed?
ROM Program
RAM
CPU
I/O
Original Program
RAM
CPU
ROM
I/O
Compressed Program
Embedded Systems
3Software decompression
- Previous work
- Decompression unit whole program Tauton91
- No memory savings
- Decompression unit procedures Kirovski97Ernst9
7 - Requires large decompression memory
- Fragmentation of decompression memory
- Slow
- Our work
- Decompression unit 1 or 2 cache-lines
- High performance focus
- New profiling method
4Dictionary compression algorithm
- Goal fast decompression
- Dictionary contains unique instructions
- Replace program instructions with short index
32 bits
16 bits
32 bits
5
lw r15,r3
5
lw r15,r3
30
.dictionary segment
lw r15,r3
30
lw r15,r3
30
.text segment
.text segment (contains indices)
Original program
Compressed program
5Decompression
- Algorithm
- 1. I-cache miss invokes decompressor (exception
handler) - 2. Fetch index
- 3. Fetch dictionary word
- 4. Place instruction in I-cache (special
instruction) - Write directly into I-cache
- Decompressed instructions only exist in I-cache
Memory
?
?
?
Add r1,r2,r3
I-cache
Dictionary
Proc.
Indices
5
...
?
D-cache
6CodePack
- Overview
- IBM
- PowerPC
- First system with instruction stream compression
- Decompress during I-cache miss
- Software CodePack
7Compression ratio
-
- CodePack 55 - 63
- Dictionary 65 - 82
8Simulation environment
- SimpleScalar
- Pipeline 5 stage, in-order
- I-cache 16KB, 32B lines, 2-way
- D-cache 8KB, 16B lines, 2-way
- Memory 10 cycle latency, 2 cycle rate
9Performance
- CodePack very high overhead
- Reduce overhead by reducing cache misses
10Cache miss
- Control slowdown by optimizing I-cache miss ratio
11Selective compression
- Hybrid programs
- Only compress some procedures
- Trade size for speed
- Avoid decompression overhead
- Profile methods
- Count dynamic instructions
- Example Thumb
- Use when compressed code has more instructions
- Reduce number of executed instructions
- Count cache misses
- Example CodePack
- Use when compressed code has longer cache miss
latency - Reduce cache miss latency
New!
12Cache miss profiling
- Cache miss profile reduces overhead 50
- Loop-oriented benchmarks benefit most
- Approach performance of native code
13CodePack vs. Dictionary
- More compression may have better performance
- CodePack has smaller size than Dictionary
compression - Even with some native code, CodePack is smaller
- CodePack is faster due to using more native code
14Conclusions
- High-performance SW decompression possible
- Dictionary faster than CodePack, but 5-25
compression ratio difference - Hardware support
- I-cache miss exception
- Store-instruction instruction
- Tune performance by reducing cache misses
- Cache size
- Code placement
- Selective compression
- Use cache miss profile for loop-oriented
benchmarks - Code placement affects decompression overhead
- Future unify code placement and compression
15Web page
http//www.eecs.umich.edu/compress
16Code placement
Original code
Memory
Whole compression
decompress region (in L1 cache only)
compressed code
Decompress
Same order
Selective compression
native region
compressed code
decompress region
Decompress
Different order!
17Hardware or software decompression?
- Hardware
- Fast translation
- Potential speedup
- Tune compression for each benchmark
- Software
- Low cost
- Re-target for new algorithms
- New algorithm for each benchmark
- Slow
18CodePack encoding
- 32-bit insn is split into 2 16-bit words
- Each 16-bit word compressed separately
Encoding for upper 16 bits
Encoding for lower 16 bits
Encodes zero
0
0
x
x
x
0
0
8
1
32
0
1
x
x
x
x
x
0
1
x
x
x
x
16
64
1
0
0
x
x
x
x
x
x
1
0
0
x
x
x
x
x
23
128
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
128
1
0
1
1
0
1
256
1
1
0
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
0
256
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
Tag
Escape
Index
Raw bits
19CodePack decompression
31
26
25
6
5
0
L1 I-cache miss address
Index table(in main memory)
Fetch index
Byte-aligned block address
Compressed bytes (in main memory)
Fetch compressed instructions
Compression Block(16 instructions)
Hi tag
Low tag
Low index
Hi index
1 compressed instruction
Decompress
High dictionary
Low dictionary
High 16-bits
Low 16-bits
Native Instruction