Hamming Transcoders for Power Reduction on Internal Buses - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Hamming Transcoders for Power Reduction on Internal Buses

Description:

... processor descriptions in Verilog. picoJava (for now) UltraSparc ... Future, running SPEC benchmarks on Verilog RTL model offered by Sun (sparc v8 release) ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 21
Provided by: victo56
Category:

less

Transcript and Presenter's Notes

Title: Hamming Transcoders for Power Reduction on Internal Buses


1
Hamming Transcoders for Power Reduction on
Internal Buses
  • Victor Wen
  • July 12, 2000
  • University of California, Berkeley

2
Outline
  • Motivations
  • Related Work
  • General Coding Setup
  • Transition Code Technique
  • Simulation Results
  • What does it cost?
  • Future Work/Conclusion

3
Motivation
  • Increasing importance of wires relative to
    transistors
  • Spend transistors to drive wires more
    efficiently?
  • Try to reduce transitions over wires decrease
    capacitive charge/discharge.
  • Orthogonal to other power-saving techniques
  • e.g. voltage reduction, low-swing driver/receiver
  • clock gating
  • Parallel function blocks (like vectors!)
  • Important for portable devices where power and
    energy is a major constraint

4
Power reduction through coding
Encoded Value
Output
Input
Decoder
Encoder
  • Can we encode information in a way that takes
    less power?
  • Do this on chip?!
  • Do this dynamically?! Apply value prediction
    technique to track data pattern.

5
Related Work
  • Bus Invert Coding, by M. R. Stan and W. P.
    Burleson
  • Reduce peak power by 50, avg by up to 25
  • Work-zone Encoding, by E. Musoll et al.
  • Compare favorably with other techniques
  • Test Vector Ordering, by P. Girard et al.
  • Result 8.2 to 54.1 less activities
  • Minimizing Power consumption, by A. Chandrakasan
    and R. Broderson
  • Introduces power-saving techniques at different
    levels
  • The Predictability of Data Values, by Y. Sazeides
    and J. Smith
  • The context-based predictor suggests using
    previous n values to help tracking data

6
Dynamic Transcoder
  • Have two FSM and some hardware cache at two ends
    of bus, tracking each other. Use values send
    across the bus to synchronize the FSM thus,
    extra overhead communication via the bus is
    avoided.
  • The FSM decides when to admit new data value into
    the hardware table(s) and which low hamming
    weight code to assign to it.
  • The hardware table varies in size and design.

FSM
FSM
Hardware Table(s)
Hardware Table(s)
Input
Output
Encoded Value
7
FSM Details
State Transition Diagram
Code 0xFF Freq 10
Potentially 232 entries in table for 32 bit bus!
5
6
Code 0x00 Freq 2620
2
1
  • Most frequent arc assigned lowest-weight code
    (e.g. 0x0) the codes are re-assigned dynamically
    to reflect most frequent values in the current
    phase of the trace.
  • The state could represent actual value or it
    could be class of values (e.g. state 1 could be
    the difference of 1 in current and previous
    input). We call the latter filtered input. The
    filtered input could capture more unique values
    (e.g. all input values differ by 1 is captured in
    one entry).
  • Use output codes to XOR transmission line
  • Every 1 in coded version causes transition on the
    bus
  • Most frequent arcs cause least number of
    transitions

8
Hardware Table Details
  • Hardware table consists of a filter, and a
    combination of shift register, pending table and
    actual map table.
  • The tables store most frequent values (could be
    actual or filtered inputs)
  • Shift registers and pending table used to admit
    new frequent values and hold evicted values from
    map table.
  • Currently investigate different hardware
    combination and admission policies to find a
    balance between tracking ability vs. hardware
    complexity.
  • Currently three setups are under consideration

Pending Table
Map Table
Filter(s)
Output
Input
Shift Regs
Map Table
Shift Regs
P. Table
M. Table
9
Evaluation Input pattern and Unique values (gcc)
Inputs to Transcoder
  • The input pattern graph (a) demonstrates raw and
    filtered input transition to the transcoder. The
    filtered inputs actually has more activities than
    raw input.
  • The unique value graph (b) demonstrate number of
    unique values in the gcc trace.
  • Unique-ness given a sliding window of size n, if
    the current input matches any value in the
    window, then it is not unique and vice versa.
  • Notice that when n gt 31, the total number of
    unique values drops drastically.
  • Graph (a) and (b) would suggest that transcoder
    with table size gt 31 and no filter would work
    best.

(a)
Unique values in the trace
(b)
10
Evaluation Transition saving for gcc and compress
  • Graph (a) shows the resulting transitions of gcc
    trace after running through the dynamic
    transcoder, with no, xor and subtract filter.
  • The trends show that dynamic transcoder was able
    to track the input and reduce activities. Also,
    notice that area between the nocode line and
    other lines represents energy saved.
  • The results of static oracle transcoder is also
    shown. The three traces has table size of 1, 31
    and all.
  • Note The oracle transcoder reads in the trace
    file and construct the map table statically, with
    most frequent values assigned lower weight. Then
    the trace file is re-read to find out the
    resulting transitions.
  • Graph (b) shows result of similar experiment for
    compress trace.
  • Note Both static and dynamic transcoder perform
    much better here. It is due to that input is more
    predictable and filters gave the transcoder
    better input to adapt (the input pattern graph
    for compress is not shown)

(a)
(b)
11
Conclusion Future Work
  • Conclusion
  • Transition coding attacks the root of the problem
  • Static oracle transcoder still beats the
    dynamic transcoder, suggesting room for
    improvement.
  • Changes to existing circuits are transparent.
  • Orthogonal to other low power techniques.
  • Future work
  • Allow more context in doing filtering (e.g. use
    before-n values instead of immediate previous
    value).
  • Simulate SPEC on Sparc UltraSparc RTL.
  • Implement all three architecture described above.
  • Implement actual hardware and estimate how much
    power the transcoder itself would take up.

12
Hardware Cost?
  • Given table size n, bus width m,
  • xor (4 transistor/bit, pass transistor logic) gt
    136 T
  • nm AND gates, plus n-input m-bit OR gate for
    associative lookup gt 2nm T 2nm T 4nm T
  • 6nm T for table storage
  • a n-bit encoder (for the code)
  • 32 1-bit inverters gt 64T
  • a majority voting circuit (to decide whether to
    invert or not)
  • FSM circuit to perform pattern tracking alg.
  • n 8-bit counters (to keep track of the hit
    frequency)
  • muxes

13
Huffman-based Compression
  • Variable bit length problem!
  • Possible soln macro clock
  • Less bits ! less transitions

14
Hamming Weight
  • Find a map function to minimize transition
  • Search space is large 256! (For 8-bit bus)
  • Leads to transition code idea

15
Simulation Setup
  • Sun offering processor descriptions in Verilog
  • picoJava (for now)
  • UltraSparc (soon)

16
Simulation Results (1)
  • Savings
  • Rank 9 saves 79.52
  • Rank 256 saves 79.68

9th bit overhead Rank 1 23 Rank 9 0.29
17
Simulation Results (2)
Number of transitions drops quickly as ranks
increases 256x256 table might not be
necessary Other trace files show similar trends
Note icu_data connects between instruction cache
unit and integer unit. A fairly long bus
according to picoJavas floorplan
18
Hamming Transcoder (cont)
  • Only transitions matter, not absolute value
  • Recognize more frequent transitions assign
    low-weight code to them
  • Guarantees more frequent transitions have less
    bits changes on the wire

19
Transition Code Overview
34
32
32
Coder
Decoder
Encoder
Cur bus value
Hardware Table(s)
Prev input
Filter
Transcode
32
34
To Bus
XOR
Coded?
Cur input
Invert?
20
Simulation Setup
  • Now, running SPEC95 on high-level processor
    model.
  • Future, running SPEC benchmarks on Verilog RTL
    model offered by Sun (sparc v8 release).
Write a Comment
User Comments (0)
About PowerShow.com