Instruction Stream Compression

About This Presentation

Title:

Instruction Stream Compression

Description:

... seven faculty (Bird, Chen, Davidson, Hayes, Papaefthymiou, Patt, ... Homegrown C compiler using Edison Group frontend. Current targets. PowerPC. Intel IA32 ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 45

Provided by: charles161

Category:

more less

Transcript and Presenter's Notes

Title: Instruction Stream Compression

1
Instruction Stream Compression

Trevor Mudge
Peter Bird
Richard Brown
Advanced Computer Architecture Laboratory
Electrical Engineering and Computer ScienceThe
University of Michigan, Ann Arbor
http//www.eecs.umich.edu/tnm/compress
DARPA review December 10, 1997

2
Program Overview
3
Instruction Stream Compression
New Ideas
ROM Program memory tables
CPU
RAM
MIRV a high level intermediate representation
simplifies compression and code generation for
multiple platforms
I/O
Original Program
Compiler directed code compression templates
define sequences of instructions
Smaller systems on a chip With lower development
costs
Embedded Systems
Variable length codewords replace sequences of
instructions
CPU
RAM
Compressed Program
Instruction set redefinition in hardware
(microcode) or software
ROM
I/O
Schedule
Impact
Lower hardware costs for high volume cost
sensitive systems
High level language programming for embedded
systems code size no longer a problem
Easy retargeting simplifies heterogeneous
embedded systems programming
Program development and life cycle costs reduced
The University of Michigan Trevor Mudge, Peter
Bird, and Richard Brown
4
Schedule of Major Tasks
5
Program Background
6
Program Background

Facilities
Research is being conducted in the Advanced
Computer Architecture Lab (T. Mudge dir. P.
Bird) and Solid-State Lab (R. Brown)
ACAL has seven faculty (Bird, Chen, Davidson,
Hayes, Papaefthymiou, Patt, Reinhardt, Sakallah,
Davidson) and about 70 PhD students working in
core technologies emphasizing experimental work
Computer architecture, Compilers, Operating
Systems, CAD
ACAL has a network of about 80 workstations plus
a simulation pool of about 100 Intel Pentium II
class machines, it also has an installed base of
CAD tools suitable for this project
HDL tools, cell libraries, compiler front-ends,
many homegrown simulators
Solid-state lab has extensive facilities for chip
test (including HP82000)

7
Principal Investigators

Trevor Mudge
Designed and built prototype experimental
computers in two Darpa funded projects (w/Brown)
exotic technologies and area interconnect
Conducts research into computer architecture,
compiler/architecture trade-offs, OS/architecture
trade-offs, verification (Darpa w/Hayes)
S-Corp. that designs embedded systems on a chip
(IS Inc)
20 PhDs Olukotun (Stanford) Nagle (CMU) Uhlig
(MRL) Pierce (Merced compiler) Abdelrahman
(Toronto) Golden (AMD)
170 papers, Fellow of IEEE ... recent winner of
Heaviside premium
Teaching computers 100 to 500 freshmen
Peter Bird
83-90 Compiler/ISA for mini-supercomputer ADI
Corp
91-94 ACRI France principal architect for
multinode supercomputer w/compiler controlled
multithreading
95-present UoM and software for embedded
applications (power grid management) designed
mixed signal 8-bit processor
Teaches compiler classes

8
Principal Investigators

Richard Brown
Vice-President of Engineering, Holman Industries,
Oakdale, CA
Manager of Computer Development, Cardinal
Industries, Webb City, MO
PhD in solid-state/VLSI at the University of Utah
in 1985.
VLSI program and courses at U. of M.
Chairman of the 1997 Advanced Research in VLSI
Guest editor of the IEEE Journal of Solid State
Circuits
Research solid-state sensors, high temperature
CMOS, and high performance computing systems esp.
high performance circuits and related CAD tools
Associate Head of EE

9
Prior Related Work

Chips designs
ARM 2 core
Missing MUL and MLA (mul acc) incomplete bus
interface
0.6 um 3M rules gave a 3.5 x 3.1 11 mm sq. chip
(with pad ring)
Processes HP14B (fabbed)
IBM SOI process too 3M 0.8um die size 3.4 x
5.3 18 mm sq.
Verified on a Quickturn Enterprise at 1 MHz
ARM 2 differs from more recent ARMs in
address bus is only 26 bits
2-3 less instructions
only 4 processor modes (vs. 6)
PIP chip PCI interface chip
Compilers for high-performance prototypes
Cynus C compiler (gcc w/support) for PowerPC
source used to subset the PPC ISA
Greenhills C compiler for PowerPC
Homegrown C compiler using Edison Group frontend
Current targets
PowerPC

10
Genesis of Compression Ideas

Improving the hit rate of caches
Low integration levels imply small on-chip caches
Impact of Instruction Compression on I-cache
performance. I. Cheng, P. Bird, T. Mudge, CSE
Tech Rept. CSE-TR-330-97 (http//www.eecs.umich.ed
u/DCO/techreports/cse97.html)
Preliminary experiments
looked for recurring word patterns in application
binaries
proposed replacing them by byte codes
Observation
we were discovering instruction patterns created
by compiler templates
Integrate compression into the compiler

11
Technical Program Task Overview
12
Work Accomplished So Far

Compression Algorithms
Summary of our MICRO30 paper
Improving Code Density Using Compression
Techniques - Charles Lefurgy, Peter Bird, I-Cheng
Chen, and Trevor Mudge
C Compiler for Embedded Systems
MIRV 0 Michigan Intermediate Representation
Version 0
Retargetting and distribution are easy
compression integral to compilation

J
C
Frontends
MIRV
IA32
ARM
PPC
13
MICRO 30

Challenges for embedded systems
Cost, size, and power
Instruction memory dominates size and power
Fit program in on-chip memory
Compilers vs. hand-coded assembly
Size/speed optimization
Portability
Development costs
Code bloat
Program Compression
Reduce compiled code size
Take advantage of instruction repetition
Trade-off performance for code density
Use cheaper processor with smaller on-chip memory

14
Related Work Instruction Sets

SuperH, MCore
16-bit instructions, 32-bit data
Thumb, MIPS-16
Instructions are subsets of the base instruction
sets
Reduce instruction size from 32-bits to 16-bits
TriCore, V8xx (NEC)
16-bit and 32-bit instructions may be mixed
together
CISC/RISC

15
MCore (Motorola)

16-bit instructions
16 32-bit general registers
1 condition bit
Destructive (2 register) instructions
Load/store architecture (8,16,32-bit data,
multiple regs)
Divide, multiply, bit twiddle, find first 1
Alternate register file
14 of opcode space unused
Low power modes

16
Thumb (ARM)

16-bit instructions
8 32-bit general registers
Destructive (2 register) instructions
Load/store architecture (8,32-bit data, reg mask)
Multiply
Missing from ARM
Multiply-accumulate
Atomic memory operations
Reverse subtract
Co-processor operations
Conditional execution
In-line shifts

17
MIPS-16 (SGI)

16-bit instructions
8 32-bit general registers
Destructive (2 register) instructions
Load/store architecture (8,16,32,64-bit data)
Divide, Multiply
64-bit arithmetic
Extend instruction makes allows many instructions
to be extended to 32-bits and use longer
immediate values.
Missing from MIPS
Branch delay
Signed arithmetic
Floating-point

18
SuperH (Hitachi)

16-bit instructions
16 32-bit general registers additional bank of
8
Destructive (2 register) instructions
Load/store architecture (8,16,32-bit data)
Divide, multiply-acc, bit twiddle memory ops
Delayed unconditional branch
No barrel shifter (only shifts by 1,2,8,16)
Pre/post increment addressing
Power mode instructions

19
TriCore (Siemens)

16 32-bit instructions (16-bit are subset of
32-bit)
32 16-bit general registers (16 address, 16 data)
Destructive (2 register) instructions
Load/store architecture (8,16,32-bit data)
Divide, multiply-acc, bit twiddle memory ops
Circular,reverse addressing modes for
filters/FFTs
Packed DSP data format for operations on small
data
Count leading 1/0s, absolute value, min/max

20
Related Work Researchers

Wolfe et al. (MICRO92, ICCD94, ARVLSI97)
Compressed Code RISC Processor (CCPR)
Huffman encoding on cache lines
Ernst et al. (PLDI97)
Small code size for network transmission
Represent multiple instructions with 1 codeword
Liao et al. (ARVLSI95)
Dictionary compression
Codeword is "mini-subroutine" call with implicit
return
Kirovski et al. (MICRO97)
Procedure based compression
Decompressed procedure cache

21
Organization

Text Compression
Our Compression Technique
Compression Algorithm
Implementations
Conclusions

22
Text Compression

Classes of text compression
Statistical
Codeword represents one character
Example Huffman coding
Dictionary
Codeword represents entire phrase
Example Ziv-Lempel coding
Evaluation Metrics
Compression ratio
Decode efficiency

23
Overview of Compression Technique

Dictionary method
Put common sequences of instructions in
dictionary (sequences do not cross basic block
boundaries)
Replace sequences in program with dictionary
codewords
Final program contains compressed code and the
dictionary

Original
Compressed
Dictionary
24
Compression Algorithm

Greedy algorithm
Put all potential dictionary entries in pool
For each codeword
Estimate savings of all remaining entries in pool
Pick entry with highest savings and place in
dictionary
For each instance in program
Replace with codeword
Remove replaced instructions from pool
Branch instructions
Branch instructions are not compressed
Patch branches to use compressed program
addresses
Scale branch offsets to codeword alignment (need
to address subwords)
Range of branches reduced, but few branches are
affected

25
Compression Architecture

Compression is tuned for each application
All levels of memory contain compressed
instructions
Dictionary is in program memory or dedicated
on-chip table

Compressed
instruction memory
Dictionary index logic
(usually ROM)
Codeword
Index
Dictionary
Uncompressed
instruction
Uncompressed instruction
CPU core
26
Fixed-length Codeword Implementation

Implementation
Used PowerPC as base architecture
Codewords are 2-bytes in length
Codewords are specified using illegal PowerPC
opcodes
A maximum of 8192 codewords can be specified
Benchmarks
Spec95 Integer
Compiled with GCC -O2

Illegal Opcode (1,4,5,6,56,57,60,61)
Specify ID
0
5
15
Bit offset
6
Value Range
0-1023
Opcode
27
Fixed-length Codeword Results

Dictionary size has strong effect on compression
ratio
Long instruction sequences (gt4) provide only
small improvement
Small dictionaries can be effective

100
Maximum number of
instructions in each
80
dictionary entry
1
60
Compression
2
Ratio
40
4
20
8
0
16
128
1024
8192
Maximum Number of Dictionary Entries
28
Top dictionary entries for ijpeg on PowerPC
29
Top dictionary entries for go on PowerPC
30
Composition of compressed program
31
Variable-length Codeword Implementation

Use smaller codewords to obtain better
compression
Re-code all instructions
Useful for instruction sets without unused
opcodes
Codes lengths are multiples of 4 bits
Short codewords are assigned to patterns with
high frequency

32
Comparison of Compression Ratios
33
Comparison of Overall Program Size

Sizes shown are relative to original PowerPC
program size.
The smallest programs are compressed MIPS-16 code.

34
Comparison with MIPS-16

Compressing MIPS-2 is better than using MIPS-16
on large programs due to more repeated
instructions
Compressing MIPS-16 yields the smallest programs

35
Summary of Initial Studies

Combined previous techniques
Liao Dictionary compression
Wolfe Small, variable-length codewords
Achieved compression ratio similar to Thumb and
MIPS-16
Advantages over Thumb and MIPS-16
Number of instructions executed does not increase
Retain all modes and operations of underlying
instruction set
Floating-point instructions can be compressed
No overhead to switch between compressed-uncompres
sed modes
Variable-length compression is generalizable to
other instruction sets

36
Action Items

Improve compression algorithm
Select instruction patterns with cover algorithm
Compiler
Try not to produce instructions with encodings
that are used only once
Produce code with identical byte sequences
Prologue and epilogue code should save registers
to the same stack locations
Reduce amount of instructions not compressed

37
Thoughts on Implementation

Hardware
non-aligned memory accesses
fast creation of indices from variable length
codewords solved
requires RAM/ROM store for tables
Software
use page 0 (known) for tables
create indices and length of sequence with bit
twiddling ops time consuming
cache the results
Microcode
identify a compact (stack) ISA and compile
directly to it
support an emulator for the original ISA zero
address form

38
C Compiler for Embedded Systems MIRV

Platform independent program distribution form
Fast and high quality code generation at run-time
Perform analysis off-line
Annotate program off-line to help code generation
Optimize in background using run-time feedback
Platform independence
Some degree of source language independence
Current source language C/C
Program management/distribution

39
Comparison of Intermediate Representations

ANDF
Intended as an architecture neutral program
distribution format
Compilation at installation time
Problem too low level. Does not preserve high
level program semantics
SUIF
Intended as a high level intermediate language
for code optimizations
Problem SUIF is not a program distribution
format
MIRV
Platform independent distribution format
Preserves high level program semantics
Compilation at installation or run-time
Off-line annotations carried with the program
representation to aid fast, high quality code
generation
Can also be used as IR for compiler

40
First Example of Templates

Here are 2 instruction sequences from go
The only difference is that the offset 16 changes
to 32
Large opportunity for compression!

41
Second Example of Templates

Most templates are much smaller and have many
variations

42
Possible Future Directions

Allocate registers in HW (use stack-like
templates)
Templates contain dependency information between
instructions
Let all state be stored in memory on statement
boundaries
Trade-off removes register names from code size,
but adds extra load/store instructions
These extra load/store instructions may be
generated by the processor automatically instead
of putting them explicitly in the program
Many constants frequently re-used
Store them in a table. Use a small index to
access them.
Example from go benchmark
C code ba x
MIRV produces li, lw, li, mulu, add, lw, sw
template covers 18 of program size
The template contains 4 values that are different
in each template instance (base address of array,
global variable addresses)
A small table of 16 values accounts for gt80 use
of these 4 values