Instruction Stream Compression - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Instruction Stream Compression

Description:

... seven faculty (Bird, Chen, Davidson, Hayes, Papaefthymiou, Patt, ... Homegrown C compiler using Edison Group frontend. Current targets. PowerPC. Intel IA32 ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 45
Provided by: charles161
Category:

less

Transcript and Presenter's Notes

Title: Instruction Stream Compression


1
Instruction Stream Compression
  • Trevor Mudge
  • Peter Bird
  • Richard Brown
  • Advanced Computer Architecture Laboratory
    Electrical Engineering and Computer ScienceThe
    University of Michigan, Ann Arbor
  • http//www.eecs.umich.edu/tnm/compress
  • DARPA review December 10, 1997

2
Program Overview
3
Instruction Stream Compression
New Ideas
ROM Program memory tables
CPU
RAM
MIRV a high level intermediate representation
simplifies compression and code generation for
multiple platforms
I/O
Original Program
Compiler directed code compression templates
define sequences of instructions
Smaller systems on a chip With lower development
costs
Embedded Systems
Variable length codewords replace sequences of
instructions
CPU
RAM
Compressed Program
Instruction set redefinition in hardware
(microcode) or software
ROM
I/O
Schedule
Impact
Lower hardware costs for high volume cost
sensitive systems
High level language programming for embedded
systems code size no longer a problem
Easy retargeting simplifies heterogeneous
embedded systems programming
Program development and life cycle costs reduced
The University of Michigan Trevor Mudge, Peter
Bird, and Richard Brown
4
Schedule of Major Tasks
5
Program Background
6
Program Background
  • Facilities
  • Research is being conducted in the Advanced
    Computer Architecture Lab (T. Mudge dir. P.
    Bird) and Solid-State Lab (R. Brown)
  • ACAL has seven faculty (Bird, Chen, Davidson,
    Hayes, Papaefthymiou, Patt, Reinhardt, Sakallah,
    Davidson) and about 70 PhD students working in
    core technologies emphasizing experimental work
  • Computer architecture, Compilers, Operating
    Systems, CAD
  • ACAL has a network of about 80 workstations plus
    a simulation pool of about 100 Intel Pentium II
    class machines, it also has an installed base of
    CAD tools suitable for this project
  • HDL tools, cell libraries, compiler front-ends,
    many homegrown simulators
  • Solid-state lab has extensive facilities for chip
    test (including HP82000)

7
Principal Investigators
  • Trevor Mudge
  • Designed and built prototype experimental
    computers in two Darpa funded projects (w/Brown)
    exotic technologies and area interconnect
  • Conducts research into computer architecture,
    compiler/architecture trade-offs, OS/architecture
    trade-offs, verification (Darpa w/Hayes)
  • S-Corp. that designs embedded systems on a chip
    (IS Inc)
  • 20 PhDs Olukotun (Stanford) Nagle (CMU) Uhlig
    (MRL) Pierce (Merced compiler) Abdelrahman
    (Toronto) Golden (AMD)
  • 170 papers, Fellow of IEEE ... recent winner of
    Heaviside premium
  • Teaching computers 100 to 500 freshmen
  • Peter Bird
  • 83-90 Compiler/ISA for mini-supercomputer ADI
    Corp
  • 91-94 ACRI France principal architect for
    multinode supercomputer w/compiler controlled
    multithreading
  • 95-present UoM and software for embedded
    applications (power grid management) designed
    mixed signal 8-bit processor
  • Teaches compiler classes

8
Principal Investigators
  • Richard Brown
  • Vice-President of Engineering, Holman Industries,
    Oakdale, CA
  • Manager of Computer Development, Cardinal
    Industries, Webb City, MO
  • PhD in solid-state/VLSI at the University of Utah
    in 1985.
  • VLSI program and courses at U. of M.
  • Chairman of the 1997 Advanced Research in VLSI
  • Guest editor of the IEEE Journal of Solid State
    Circuits
  • Research solid-state sensors, high temperature
    CMOS, and high performance computing systems esp.
    high performance circuits and related CAD tools
  • Associate Head of EE

9
Prior Related Work
  • Chips designs
  • ARM 2 core
  • Missing MUL and MLA (mul acc) incomplete bus
    interface
  • 0.6 um 3M rules gave a 3.5 x 3.1 11 mm sq. chip
    (with pad ring)
  • Processes HP14B (fabbed)
  • IBM SOI process too 3M 0.8um die size 3.4 x
    5.3 18 mm sq.
  • Verified on a Quickturn Enterprise at 1 MHz
  • ARM 2 differs from more recent ARMs in
  • address bus is only 26 bits
  • 2-3 less instructions
  • only 4 processor modes (vs. 6)
  • PIP chip PCI interface chip
  • Compilers for high-performance prototypes
  • Cynus C compiler (gcc w/support) for PowerPC
  • source used to subset the PPC ISA
  • Greenhills C compiler for PowerPC
  • Homegrown C compiler using Edison Group frontend
  • Current targets
  • PowerPC

10
Genesis of Compression Ideas
  • Improving the hit rate of caches
  • Low integration levels imply small on-chip caches
  • Impact of Instruction Compression on I-cache
    performance. I. Cheng, P. Bird, T. Mudge, CSE
    Tech Rept. CSE-TR-330-97 (http//www.eecs.umich.ed
    u/DCO/techreports/cse97.html)
  • Preliminary experiments
  • looked for recurring word patterns in application
    binaries
  • proposed replacing them by byte codes
  • Observation
  • we were discovering instruction patterns created
    by compiler templates
  • Integrate compression into the compiler

11
Technical Program Task Overview
12
Work Accomplished So Far
  • Compression Algorithms
  • Summary of our MICRO30 paper
  • Improving Code Density Using Compression
    Techniques - Charles Lefurgy, Peter Bird, I-Cheng
    Chen, and Trevor Mudge
  • C Compiler for Embedded Systems
  • MIRV 0 Michigan Intermediate Representation
    Version 0
  • Retargetting and distribution are easy
    compression integral to compilation

J
C
Frontends
MIRV
IA32
ARM
PPC
13
MICRO 30
  • Challenges for embedded systems
  • Cost, size, and power
  • Instruction memory dominates size and power
  • Fit program in on-chip memory
  • Compilers vs. hand-coded assembly
  • Size/speed optimization
  • Portability
  • Development costs
  • Code bloat
  • Program Compression
  • Reduce compiled code size
  • Take advantage of instruction repetition
  • Trade-off performance for code density
  • Use cheaper processor with smaller on-chip memory

14
Related Work Instruction Sets
  • SuperH, MCore
  • 16-bit instructions, 32-bit data
  • Thumb, MIPS-16
  • Instructions are subsets of the base instruction
    sets
  • Reduce instruction size from 32-bits to 16-bits
  • TriCore, V8xx (NEC)
  • 16-bit and 32-bit instructions may be mixed
    together
  • CISC/RISC

15
MCore (Motorola)
  • 16-bit instructions
  • 16 32-bit general registers
  • 1 condition bit
  • Destructive (2 register) instructions
  • Load/store architecture (8,16,32-bit data,
    multiple regs)
  • Divide, multiply, bit twiddle, find first 1
  • Alternate register file
  • 14 of opcode space unused
  • Low power modes

16
Thumb (ARM)
  • 16-bit instructions
  • 8 32-bit general registers
  • Destructive (2 register) instructions
  • Load/store architecture (8,32-bit data, reg mask)
  • Multiply
  • Missing from ARM
  • Multiply-accumulate
  • Atomic memory operations
  • Reverse subtract
  • Co-processor operations
  • Conditional execution
  • In-line shifts

17
MIPS-16 (SGI)
  • 16-bit instructions
  • 8 32-bit general registers
  • Destructive (2 register) instructions
  • Load/store architecture (8,16,32,64-bit data)
  • Divide, Multiply
  • 64-bit arithmetic
  • Extend instruction makes allows many instructions
    to be extended to 32-bits and use longer
    immediate values.
  • Missing from MIPS
  • Branch delay
  • Signed arithmetic
  • Floating-point

18
SuperH (Hitachi)
  • 16-bit instructions
  • 16 32-bit general registers additional bank of
    8
  • Destructive (2 register) instructions
  • Load/store architecture (8,16,32-bit data)
  • Divide, multiply-acc, bit twiddle memory ops
  • Delayed unconditional branch
  • No barrel shifter (only shifts by 1,2,8,16)
  • Pre/post increment addressing
  • Power mode instructions

19
TriCore (Siemens)
  • 16 32-bit instructions (16-bit are subset of
    32-bit)
  • 32 16-bit general registers (16 address, 16 data)
  • Destructive (2 register) instructions
  • Load/store architecture (8,16,32-bit data)
  • Divide, multiply-acc, bit twiddle memory ops
  • Circular,reverse addressing modes for
    filters/FFTs
  • Packed DSP data format for operations on small
    data
  • Count leading 1/0s, absolute value, min/max

20
Related Work Researchers
  • Wolfe et al. (MICRO92, ICCD94, ARVLSI97)
  • Compressed Code RISC Processor (CCPR)
  • Huffman encoding on cache lines
  • Ernst et al. (PLDI97)
  • Small code size for network transmission
  • Represent multiple instructions with 1 codeword
  • Liao et al. (ARVLSI95)
  • Dictionary compression
  • Codeword is "mini-subroutine" call with implicit
    return
  • Kirovski et al. (MICRO97)
  • Procedure based compression
  • Decompressed procedure cache

21
Organization
  • Text Compression
  • Our Compression Technique
  • Compression Algorithm
  • Implementations
  • Conclusions

22
Text Compression
  • Classes of text compression
  • Statistical
  • Codeword represents one character
  • Example Huffman coding
  • Dictionary
  • Codeword represents entire phrase
  • Example Ziv-Lempel coding
  • Evaluation Metrics
  • Compression ratio
  • Decode efficiency

23
Overview of Compression Technique
  • Dictionary method
  • Put common sequences of instructions in
    dictionary (sequences do not cross basic block
    boundaries)
  • Replace sequences in program with dictionary
    codewords
  • Final program contains compressed code and the
    dictionary

Original
Compressed
Dictionary
24
Compression Algorithm
  • Greedy algorithm
  • Put all potential dictionary entries in pool
  • For each codeword
  • Estimate savings of all remaining entries in pool
  • Pick entry with highest savings and place in
    dictionary
  • For each instance in program
  • Replace with codeword
  • Remove replaced instructions from pool
  • Branch instructions
  • Branch instructions are not compressed
  • Patch branches to use compressed program
    addresses
  • Scale branch offsets to codeword alignment (need
    to address subwords)
  • Range of branches reduced, but few branches are
    affected

25
Compression Architecture
  • Compression is tuned for each application
  • All levels of memory contain compressed
    instructions
  • Dictionary is in program memory or dedicated
    on-chip table

Compressed
instruction memory
Dictionary index logic
(usually ROM)
Codeword
Index
Dictionary
Uncompressed
instruction
Uncompressed instruction
CPU core
26
Fixed-length Codeword Implementation
  • Implementation
  • Used PowerPC as base architecture
  • Codewords are 2-bytes in length
  • Codewords are specified using illegal PowerPC
    opcodes
  • A maximum of 8192 codewords can be specified
  • Benchmarks
  • Spec95 Integer
  • Compiled with GCC -O2

Illegal Opcode (1,4,5,6,56,57,60,61)
Specify ID
0
5
15
Bit offset
6
Value Range
0-1023
Opcode
27
Fixed-length Codeword Results
  • Dictionary size has strong effect on compression
    ratio
  • Long instruction sequences (gt4) provide only
    small improvement
  • Small dictionaries can be effective

100
Maximum number of
instructions in each
80
dictionary entry
1
60
Compression
2
Ratio
40
4
20
8
0
16
128
1024
8192
Maximum Number of Dictionary Entries
28
Top dictionary entries for ijpeg on PowerPC
29
Top dictionary entries for go on PowerPC
30
Composition of compressed program
31
Variable-length Codeword Implementation
  • Use smaller codewords to obtain better
    compression
  • Re-code all instructions
  • Useful for instruction sets without unused
    opcodes
  • Codes lengths are multiples of 4 bits
  • Short codewords are assigned to patterns with
    high frequency

32
Comparison of Compression Ratios
33
Comparison of Overall Program Size
  • Sizes shown are relative to original PowerPC
    program size.
  • The smallest programs are compressed MIPS-16 code.

34
Comparison with MIPS-16
  • Compressing MIPS-2 is better than using MIPS-16
    on large programs due to more repeated
    instructions
  • Compressing MIPS-16 yields the smallest programs

35
Summary of Initial Studies
  • Combined previous techniques
  • Liao Dictionary compression
  • Wolfe Small, variable-length codewords
  • Achieved compression ratio similar to Thumb and
    MIPS-16
  • Advantages over Thumb and MIPS-16
  • Number of instructions executed does not increase
  • Retain all modes and operations of underlying
    instruction set
  • Floating-point instructions can be compressed
  • No overhead to switch between compressed-uncompres
    sed modes
  • Variable-length compression is generalizable to
    other instruction sets

36
Action Items
  • Improve compression algorithm
  • Select instruction patterns with cover algorithm
  • Compiler
  • Try not to produce instructions with encodings
    that are used only once
  • Produce code with identical byte sequences
  • Prologue and epilogue code should save registers
    to the same stack locations
  • Reduce amount of instructions not compressed

37
Thoughts on Implementation
  • Hardware
  • non-aligned memory accesses
  • fast creation of indices from variable length
    codewords solved
  • requires RAM/ROM store for tables
  • Software
  • use page 0 (known) for tables
  • create indices and length of sequence with bit
    twiddling ops time consuming
  • cache the results
  • Microcode
  • identify a compact (stack) ISA and compile
    directly to it
  • support an emulator for the original ISA zero
    address form

38
C Compiler for Embedded Systems MIRV
  • Platform independent program distribution form
  • Fast and high quality code generation at run-time
  • Perform analysis off-line
  • Annotate program off-line to help code generation
  • Optimize in background using run-time feedback
  • Platform independence
  • Some degree of source language independence
  • Current source language C/C
  • Program management/distribution

39
Comparison of Intermediate Representations
  • ANDF
  • Intended as an architecture neutral program
    distribution format
  • Compilation at installation time
  • Problem too low level. Does not preserve high
    level program semantics
  • SUIF
  • Intended as a high level intermediate language
    for code optimizations
  • Problem SUIF is not a program distribution
    format
  • MIRV
  • Platform independent distribution format
  • Preserves high level program semantics
  • Compilation at installation or run-time
  • Off-line annotations carried with the program
    representation to aid fast, high quality code
    generation
  • Can also be used as IR for compiler

40
First Example of Templates
  • Here are 2 instruction sequences from go
  • The only difference is that the offset 16 changes
    to 32
  • Large opportunity for compression!

41
Second Example of Templates
  • Most templates are much smaller and have many
    variations

42
Possible Future Directions
  • Allocate registers in HW (use stack-like
    templates)
  • Templates contain dependency information between
    instructions
  • Let all state be stored in memory on statement
    boundaries
  • Trade-off removes register names from code size,
    but adds extra load/store instructions
  • These extra load/store instructions may be
    generated by the processor automatically instead
    of putting them explicitly in the program
  • Many constants frequently re-used
  • Store them in a table. Use a small index to
    access them.
  • Example from go benchmark
  • C code ba x
  • MIRV produces li, lw, li, mulu, add, lw, sw
  • template covers 18 of program size
  • The template contains 4 values that are different
    in each template instance (base address of array,
    global variable addresses)
  • A small table of 16 values accounts for gt80 use
    of these 4 values

43
Initial Work Accomplished
  • Status
  • Compiles SPECINT95
  • Retargeted to PowerPC and IA32
  • User manual for MIRV0 is scheduled for the new
    year
  • Initial compression experiments
  • Examine templates produced by compilation
  • instead of learning them through frequency
    analysis
  • dont have to rediscover structure
  • Cluster analysis of parameterized templates
    started

44
Program Challenges/Issues
  • Obtain real-world benchmarks
  • DECT, and Airbag code National Semiconductor
  • Fabrication possibilities
  • IBM Austin
  • National Semiconductor Santa Clara
  • Action items
  • Implementation studies
  • Hardware
  • Software
  • Microcode
  • Cluster analysis
  • Retarget MIRV to ARM ISA
  • Start behavioral simulator for demonstration
    system
Write a Comment
User Comments (0)
About PowerShow.com