Title: Nektarios Paisios.
1Nektarios Paisios.
- An Overview of the Techniques of Space and Energy
Reduction using data compression.
2Introduction
- Data Compression a technique of data reduction.
- Space is costly.
- Study "The cost is not for the processor but for
the memory" - In the past memory provided enough space for
then current application footprint, - but disk space too small to hold data.
- Compression An old method of saving disk space.
- 1994 advertisement Up to 50-100 more free disk
space.
3Data compression why?
- Explosive growth of disk space drop in prices.
- But still network links slow processors will
soon reach their chip limit according to Moor's
Law (2010) - eg. 2 GHZ 4 years ago, 3 GHZ 2 years ago, what
now? - New methods of speeding up need to be invented.
- By bringing data closer to processor by
providing data faster by making predictions
more accurate. - Compression can
- Store more data in caches closer to processor.
- Store more data in predictors more accurate
predictions. - But faster?
4Data compression why?
- Clusters of computers built out of commodity
equipment can make wanders. - Less cost due to commoditisation, but
- More energy needed more energy means more
cooling. - Compression can
- reduce data structures less energy
requirements. - But maintain equal performance.
5Data Compression what?
- Two forms Lossless and lossy.
- Pictures music lossy - other files lossless.
- Both useful in processors.
- Data caches lossless because of program accuracy
and integrity. - Predictors lossy up to an acceptable point. Why?
- Lossy faster.
- Prediction needs to be faster than actual program
execution
6Data compression how?
- Commonest method
- Finds common patterns.
- Isolates them.
- Replaces them with a pointer.
- Example The fat cat sat like that.
- at the space a common pattern.
- Three techniques proposed in processors
- Pattern matching,
- pattern differentiation,
- common repeating bit elimination.
7Three techniques.
- 1. Pattern matching,
- Produces a dictionary of common items.
- But
- How to make the dictionary (what to choose)?
- When to update it?
- How big is the dictionary (speed)?
- 2. Pattern differentiation
- Finds common changes increments - decrements.
- Used when we have series of data with an expected
dispersion. - eg. value predictors.
- Can it be used in other cases?
- 3. common repeating bit elimination
- Large memory blocks are all zeros.
- A series of 0s or a series of 1s can be replaced
with a code.
8Example1 Compression in caches and memory.
- From Technical Report 1500, Computer Sciences
Dept., UW-Madison, April 2004 - Aims
- Increase effective memory size,
- reduce memory address and data bandwidth,
- increase effective cache size.
- Three aproaches Dictionary, differencial,
significance. - Dictionary Common patterns are stored in a
separate table and a pointer to them is place in
the compressed data. - Differential The common patterns are stored with
the compressed data together with a list of
differences amongst the various data parts. - Significance Not all bits are required and the
upper once are usually zero.
9Dictionary-based compression in main memory.
- From Technical Report 1500, Computer Sciences
Dept., UW-Madison, April 2004 - IBMs Memory Compression.
- IBMs MXT technology 26 employs real-time
main-memory content compression. - Effectively double memory.
- Implemented in the Pinnacle chip single-chip
memory controller. - Franaszek, et al. CRAM. (MXT)
- Kjelso, et al. X-Match hardware compression.
(4-byte entries) - Lempel-Ziv (LZ77) sequential algorithm
- Block-Referential Compression with Directory
Sharing, - divides the input data block (1 KB in MXT) into
sub-blocks - Four 256-byte sub-blocks, cooperatively
constructs dictionaries while compressing all
sub-blocks in parallel.
10Dictionary-based compression in caches.
- Lee, et al. selectively compress L2 cache
memory blocks if can be reduced to half their
original size. - (SCMS) use of the X-RL compression algorithm
similar to X-Match. - Speed considerations?
- Parallel decompression
- Selective compression not everything is compress
if not worth it. - Chen, et al. divide cache into different
section of compressibility. - Use of LZ algorithm.
11Dictionary-based compression in caches.
- Frequent-Value-Based Compression.
- Yang and Gupta analysed the SPECint95
benchmarks. - Discovered that a small number of distinct
values occupy a large fraction of memory access
values. - This value locality enabled the design
energy-efficient caches data compressed caches. - How? Each line in the L1 cache can be either one
uncompressed line or two lines compressed to at
least half based on frequent values. - Zhang, et al. value-centric data cache design
called the frequent value cache (FVC). - Added a small direct-mapped cache with values
frequently found in the benchmarks. - greatly reduce the cache miss rate.
- Is this a right aproach?
12Differential-based compression in caches.
- Benini, et al. uncompressed caches but
compressed memory. - Assumption it is likely for data words in same
cache line to have some bits in common. - Zhang and Gupta added 6 new data compression
instructions to MIPS. - New instructions
- Compress 32-bit data and addresses into 15 bits.
- By common prefixes and narrow data trasformations
13Significance-Based Compression.
- Most significant bits are shared amongst data and
instruction data addresses. - Addresses Why transfer long addresses with
repeating patterns? - Farrens and Park "many address references
transferred between processor and memory have
redundant information in their high-order (most
significant) portions". - Solution cache these high order bits in a group
of dynamically allocated base registers, - only transferr small register indexes rather than
the high-order address bits between the processor
and memory. - Also Citron and Rudolph store common high-order
bits in address and data words in a table, - transfer only an index plus the low order bits
between the processor and memory.
14Significance-Based Compression.
- Canal , et al. compress addresses
instructions. - Keep only the significant bytes.
- Maintain a two - three extension bits to maintain
significant byte positions. - Results Reduces power consumption in the
pipeline. - Kant and Iyer most significant bits of address
can be predicted with high accuracy whilst data
with lower accuracy. - Simple solution
- Compress individual cache lines on a word-by-word
basis by storing common word patterns in a
compressed format. - Store each word with an appropriate prefix.
15Significance-Based Compression.
- Significant bits of processor structure entries
are the same or are to be found in a small data
set - BTB 256 entry table can store 99 of higher bits.
- Data bits Why have multible instances of them in
every BTB, cache, etc, entry? - Solution Use multible tables with different
sizes, - use pointers amongst the different table levels.
16Frequent value caches how do they work?
- They work as follows
- The cache is divided into two arrays.
- One let's say 5 lower bits and the other 27 upper
bits. - If the lower 5 bits let's say belong to a value
which is frequent, the remaining 27 bits are not
read and they are read instead from a smaller
high speed register file containing 2 power of 5
places. - Otherwise, If the let's say lower 5 bits do not
belong to a frequent value then the rest of the
27 bits are read from the second cache array. - Thus, the actual value sharing is not done
between the two cache tables but between 3
tables - the two cache tables and the smaller fast
register file. - Also, an extra flag bit is used to indicate
wether a value is frequent and so, although there
is always an indirection, (either between the two
cache tables, or between the first cache table
and the special register file), and thus a delay,
there is no extra pointer and so the first of the
two delays could have been in theory avoided. - Why don't they do it simpler?
17cache compression schemes a summary.
- Cache compression schemes
- 1. Indirect tags "The IIC does not associate a
tag with a specific data block instead, each tag
contains a pointer into a data array which
contains the blocks." - 2. FVC "The Frequent Value Cache (FVC) replaces
the top N frequently used 32bit values with log
(N) bits. When built as a separate structure the
FVC can increase cache size if an entire cache
block is made up of frequent values." - Probability decreases though with larger caches,
since larger cache more uniqueness in the data. - So suitable for small structures the paper
mentions only l1 cache. - 3. Dynamic Zero Compression (DZC) If a byte is
all zero then only one bit is used to signify
this saving the other 7 bits.
18cache compression schemes a summary.
- Cache compression schemes
- 4. Separate banks Kim et al. utilize the
knowledge that most of the bits of values stored
in a L1 data cache are merely sign bits. - Their scheme compresses the upper portion of a
word to a single bit if it is all 1s or all 0s. - These high order bits can be stored in a separate
cache bank and accessed only if needed, - or, tags can be further modified indicating
whether an access to the second cache bank is
necessary. - 5. Alameldeen and Wood algorithm called
frequent pattern compression (FPC). - What? Adaptive scheme of compression sometimes
compresses sometimes not based on whether the
penalty of uncompression is more or less than the
potential penalties incurred by cache misses.
Very elegant! - 6. "general compression algorithm. Cache lines
are compressed in pairs (where the line address
is the same except for the low-order bit). If
both lines compress by 50 or more, they are
stored in a single cache line, freeing a cache
line in an adjacent set. - Paper doesn't specify compression algorithm
though. Also, does not specify how these lines
are tagged differently.
19cache compression schemes a summary.
- G. Hallnor and S. K. Reinhardt, "A Compressed
Memory Hierarchy using an Indirect Index Cache". - Compression through an indirect table of tags.
- Cache fully associative and lines are referenced
through a pointer stored alongside the tag. - More than one pointers slots are present to allow
compression. - Algorithm used LZSS.
- Compression carried out only if line can be
compressed to fit into the size of the sector
architecturally specified, otherwise no
compression. - Attains greater than 50 of the performance gain
of doubling the cache size, with about one
tenth the area overhead. - Disadvantages
- The speed of LZSS is dependent on the number of
simultaneous compressions. - 6 bytes per tag extra for the pointers.
- Pointers may be unused if not compression is
possible for that line. - Resulting in 134 kb for an 1 mb cache are for
the indirection table (tags, pointers, etc). Bad!
20cache compression schemes a summary.
- N. Kim, T. Austin, T. Mudge, Low-Energy Data
Cache using Sign Compression and Cache Line
Bisection - How does the sign compression work?
- "each word fetched during a cache line miss is
not changed, - But the upper half-words are replaced by a zero
or one when the upper half-words are all zeros or
all ones respectively. - Uses some sign compression bits instead.
- However
- Allows uncompressed words in the line too.
- Extra bits to indicate uncompressed / compressed
/ sign bits. - Innovation
- Two tags per cache line instead of one.
21cache compression schemes a summary.
- N. Kim, T. Austin, T. Mudge, Low-Energy Data
Cache using Sign Compression and Cache Line
Bisection - It allows energy savings as only half the line is
accessed based on where the block in question is
and given that sign compression is carried out. - Energy precharge using a MRU mechanism.
- Uses empty spaces in a block to store new blocks
fetched having the same index. - Reduces misses.
22cache compression schemes a summary.
- Adaptive Cache Compression for High-Performance
Processors - Alaa R. Alameldeen and David A. Wood Computer
Sciences Department, University of
Wisconsin-Madison alaa, david_at_cs.wisc.edu - Adaptive simply means that sometimes you compress
sometimes not based on two factors - 1. Decompression latency and
- 2. Avoided cache misses.
- If the cost of decompressing is more than the
time that would be saved by avoiding potential
misses if compression was used, then compression
is not performed, otherwise compression is
carried out. - How? A single global saturating counter predicts
whether the L2 cache should store a line in
compressed or uncompressed form. - Counter updated by the L2 controller.
- Based on whether "compression could (or did)
eliminate a (potential) miss or incurs an
unnecessary decompression overhead." - Not a new idea though virtual memory.
23cache compression schemes a summary.
- L. Villa, M. Zhang, K. Asanovic, Dynamic Zero
Compression for Cache Energy Reduction - "Dynamic Zero Compression reduces the energy
required for cache accesses by only writing and
reading a single bit for every zero-valued byte." - Invisible to software.
- Basically what it does is for each byte if it is
all zeros it uses only one bit to store it. - Disadvantages
- Compression scheme for every byte of the cache
line, - Increases the complexity of the cache
architecture. - Lose the opportunity to compress all ones and
only deal with zeros.
24cache compression schemes a summary.
- Chuanjun Zhang, Jun Yang and Frank Vahid, Low
Static-Power Frequent-Value Data Caches - "Recently, a frequent value low power data cache
design was proposed based on the observation that
a major portion of data cache accesses involves
frequent values that can be" separated and stored
only once." - Basically it means that if a cache line value is
"frequent" then you store it only once and you
keep a pointer to it. - Same idea.
- But Proposes a method to shut off the unused
bits to conserve energy in the case that a
pointer is used. - They are also proposing to reduce the latency of
reading both a frequent value table and the
ordinary cache.
25Compression in caches conclusion.
- Cache designers might consider using cache
compression to increase cache capacity and reduce
off-chip bandwidth. - "A key challenge in the design of a compressed
data store is the management of variable-sized
data blocks." - Generally, in the studies carried out, a lot of
work has been done. - Compression has been examined from a thousand
angles.
26Compression in caches conclusion.
- Compression has been examined from a thousand
angles - Most are using the idea that 0s and 1s come
together in great numbers. - Some deal with common "frequent" bit patterns.
- However, found none that shows a mechanism of
finding those "frequent" values. - They rely on prophiling or on hard-coding the
values from what I understand. - Marios paper?
27Example 2 compression in predictors.
- Prediction important for high parallelism.
- Branches 15 of program.
- Pentium 4k BTB.
- Do branch targets exhibit the same pattern
behaviour as cache lines? - Surely targets might not be as compressible as
cache lines by the removal of leading zero bits
but there might be pattern repetition in them.
28Compression in predictors.
- Ideal Dynamic allocation of target space
according to the needs of each instruction. - Rehashable BTB
- Recognises polymorphic branches and store them in
a common BTB space. - Value predictors
- Loh H. Gabriel stores values in separate tables
based on length. - Energy saving upto 25 space upto 75.
- However Cannot be used with the BTB.
29What did we do with the BTB?
- Mission Minimize the waste of space in the BTB.
- Data compression to avoid duplicate entries,
meaning bit sharing. - How?
- Simple Two-table structure.
30Methodology
- Aim Find all Entries/branches that have the same
or partially the same target. - We used No replacements BTB BTB with multible
tables.
31Results.
- Questions
- 1. What width will each table have?
- 2. How many entries?
- 3. How to join them up?
32Q1 Bit Ranges.
- GCC95 results
- BTB performance
- BTB type Num-of-branches that are correct
hits percentage of performance - Normal BTB 19164012 87.8982
- BTB with no replacements 19892653 91.2402
- Bits 1-16 21802503 99.9999
- Bits 5-20 21791021 99.9473
- Bits 9-24 21706107 99.5578
- Bits 13-28 21304116 97.714
- Bits 17-32 20013651 91.7951
- Bits 25-32 21801499 99.9953
- Bits 1-24 21706107 99.5578
- 1-24 bits 25-32 best performance than bits 1-16
17-32.
33Q2 How much space for each table.
- BTB type 1-32bit hits percentage 25-32bit
hits percentage 1-24bit hits percentage - 4k normal 19164012 87.8982 19164385 87.8999 1984
7639 91.0337 - 4k no replacement 19892653 91.2402 21801499 99.99
53 21706107 99.5578 - 2k normal 18571627 85.1811 18571985 85.1827 1924
8618 88.2862 - 2k no replacement 19567018 89.7466 21801090 99.99
34 21701108 99.5349 - 1k normal 17522962 80.3713 17523248 80.3726 1818
7048 83.4172 - 1k no replacement 18885228 86.6195 21799372 99.98
56 21688593 99.4775 - 512 normal 16149816 74.0732 16150001 74.074 1678
6062 76.9914 - 512 no replacement 17720797 81.2787 21795542 99.9
68 21671880 99.4008 - 256 normal 14327910 65.7168 14328057 65.7174 149
24424 68.4528 - 256 no replacement 15905413 72.9522 21770601 99.8
536 21582680 98.9917 - For BTB without replacements critical point at
256 places for lower 8 bits 8-bits are after
all. - Upper 24 bits very common!
34Results.
- Benchmark BTB size Num of correct hits Num of
correct hits - Name in
normal BTB in improved BTB - GCC95 8k places 19528170 89.5684 19384014 88.9072
- GCC95 4k places 19164012 87.8982 19034779 87.3054
- GCC95 2k places 18571627 85.1811 18456969 84.6552
- MCF2000 8k places 149389263 99.2259 149389263 99.
2259 - MCF2000 4k places 149387899 99.225 149387899 99.2
25 - MCF2000 2k places 147731903 98.125 147731903 98.1
25 - Vortex2000 8k places 89371437 86.9625 88953923 86
.5563 - Vortex2000 4k places 88405444 86.0226 87995598 85
.6238 - Vortex2000 2k places 85185389 82.8893 84805232 82
.5194 - Up to 80 of original size!
35Costs.
- Num of 1st table size of normal size of
improved reduction - Entries BTB BTB
- 8k entries 376832 bits
299520 bits 20.516 - 4k entries 192512 bits
156160 bits 18.882 - 2k entries 98304 bits
82432 bits 16.145 - Generally reduces size requirements by 20.
36Don't use the page number, but a pointer to it.
- Andr6 Seznec Brilliant proposal.
- Caches Relative size of addresses (tags) is huge
especially in small blocks. - Predictors Accuracy affected due to large
addresses (targets tags). - Curious finding Addresses represented 3 times,
in cache tags, in instructions, in BTB, in TLB. - Removed by
- 1. Store page number-s only once
- 2. Do not use the page number, but a pointer to it
37Don't use the page number, but a pointer to it.
- Andr6 Seznec Brilliant proposal.
- How?
- Page number stored in a page number cache.
- Can be the TLB when vertual addresses are used or
another buffer if physical. - Store 5-bit pointers in place of addresses.
- Reduce cache, reduced predictor tags.
- Cache If a page pointer is invalidated (page
miss) all entries are invalidated. - But Why not invalidating the BTB entries as well
as the tlb ones?
38Don't use the page number, but a pointer to it.
- How does it compare?
- Andr6 Seznec comparison with other schemes
- Isolated compression scheme Seznec scheme
- Touch only the targets touch both tags and
targets - Predictor size Dependent on address width
Predictor size independent of address width - 8-bit pointers 6-bit pointers
- Second-level table accessed every time table
with page pointers accessed only when getting
outside processor, ie. to ram. - A specific predictor only solution A BTB,
cache and tlb solution - Not affected by page misses Affected by
misses though according to paper not much - Only specific predictor changed cache,
BTB even program counter has to be modified to
the new scheme to be effective
39Conclusion.
- Compression a huge field and we have touched the
surface. - The key to a successful algorithm
- 1. Speed, speed, speed!
- 2. Simple to implement in hardware,
- 3. Balances space energy savings with overhead.
- Based on the above are
- Decision trees,
- Classification algorithms,
- etc,
- worth it?
40References
- Technical Report 1500, Computer Sciences Dept.,
UW-Madison, April 2004 - Target Prediction for Indirect Jumps Po-Yung
Chang Eric Hao Yale N. Patt - Don't use the page number, but a pointer to it
Andr6 Seznec - A. R. Alameldeen and D. Wood, "Adaptive Cache
Compression for High-Performance Processors",
Proc. of the 31st International Symposium on
Computer Architecture, June 2004, pg. 212-223. - G. Hallnor and S. K. Reinhardt, "A Compressed
Memory Hierarchy using an Indirect Index Cache",
Technical Report CSE-TR-488-04, 2004. - L. Villa, M. Zhang, K. Asanovic, Dynamic Zero
Compression for Cache Energy Reduction, In the
proceedings of the 33 rd International Symposium
on Microarchitecture, Dec2000. - P. R. Wilson, S. F. Kaplan, Y. Smaragdakis, The
Case for Compressed Caching in Virtual Memory
Systems, In the proceedings of USENIX 1999. - J. Yang, R. Gupta, Energy Efficient Frequent
Value Data Cache Design, In the proceedings of
the 35 th Annual International Symposium on
Microarchitecture, 2002, (MICRO- - Y. Zhang, J. Yang, R. Gupta, Frequent Value
Locality and Value-Centric Data Cache Design,
In the proceedings of the Ninth International
Conference on Architectural Support for
Programming Languages and Operating Systems, Nov.
2000 - N. Kim, T. Austin, T. Mudge, Low-Energy Data
Cache using Sign Compression and Cache Line
Bisection, 2 nd Annual Workshop on Memory
Performance Issues, May 2002 - P. R. Wilson, S. F. Kaplan, Y. Smaragdakis, The
Case for Compressed Caching in Virtual Memory
Systems, In the proceedings of USENIX 1999. - Chuanjun Zhang, Jun Yang and Frank Vahid, Low
Static-Power Frequent-Value Data Caches
41References
- Li, T. Joxn, L., K. (2001). Rehashable BTB An
Adaptive Branch Target Buffer to Improve the
Target Predictability of Java Code. The
University of Texas at Austin. - Sazeides Y. Smith J. E. (1998). Implementations
of the Context-Based Value Predictors. University
of Wisconsin-Madison. - Loh. G. H. (2003). Width-Partitioned Load Value
Predictors. Journal of Instruction-Level
Parallelism. College of Computing Georgia
Institude of Technology Atlanta. - Gifford S. Huang C.-W. Yang Z. Yu C.
(2003). A Comprehensive Front-end Architecture
for the VeriSimple Alpha Pipeline. University of
Michigan. - Yung R. (1996). Design of the UltraSPARC
Instruction Fetch Unit. Sn Microsystems. - Chang P.-Y. Hao E. Patt Y. N. (1997). Target
Prediction for Indirect Jumps. Department of
Electrical Engineering and Computer Science the
University of Michigan. - Calder B. Grunwald D. (1995). Next Cache Line
and Set Prediction. Department of Computer
Science University of Colorado. - McFarling S. (1993). Combining Branch Predictors.
Western Research Laboratory California. - Hinton G. Sager D. Upton M. Boggs D.
Carmean D. Kyker A. Roussel P. (2001). The
Microarchitecture of the Pentium 4 Processor.
Intel Technology Journal Q1. - Lohy G. H. Henrizy D. S. Krishnamurthyy A.
(2003). Exploiting Bias in the Hysteresis Bit of
a Two-bit Saturating Counters in Branch
Predictows. Journal of Instruction Level
Parallelism. - Kalla R. Sinharoy B. Tendler J. M. (2004).
IBM Power5 Chip A Dual-Core Multithreaded
Processor. IEEE Computer Society. - Arora K., Sharangpani H. (2000). Itanium
Processor Microarchitecture. IEEE Computer
Society. - Perleberg C. H. Smith A. J. (1993). Branch
Target Buffer Design and Optimizationn. IEEE
Transactions on Computers.
42Other interesting references
- Gabriel H. Loh
- Simulation Differences Between Academia and
Industry A Branch Prediction Case Study - To appear in the International Symposium on
Performance Analysis of Software and Systems
(ISPASS), March , 2005, Austin, TX, USA. - Gabriel H. Loh
- The Frankenpredictor Stitiching Together Nasty
Bits of Other Predictors - In the 1st Championship Branch Prediction Contest
(CBP1), pp. 1-4, Dec 6, 2004, Portland, OR, USA.
(Held in conjunction with MICRO-37.) - Gabriel H. Loh
- The Frankenpredictor Satisfying Multiple
Objectives in a Balanced Branch Predictor Design - Invited to appear in the Journal of Instruction
Level Parallelism (JILP). - Gabriel H. Loh, Dana S. Henry
- Predicting Conditional Branches With Fusion-Based
Hybrid Predictors - In the 11th Conference on Parallel Architectures
and Compilation Techniques (PACT), pp. 165-176,
September 22-25, 2002, Charlottesville, VA, USA.