Title: Adaptive Cache Compression for High-Performance Processors
1Adaptive Cache Compression for High-Performance
Processors
- Alaa Alameldeen and David Wood
- University of Wisconsin-Madison
- Wisconsin Multifacet Project
- http//www.cs.wisc.edu/multifacet
2Overview
- Design of high performance processors
- Processor speed improves faster than memory
- Memory latency dominates performance
- Need more effective cache designs
- On-chip cache compression
- Increases effective cache size
- Increases cache hit latency
- Does cache compression help or hurt?
3Does Cache Compression Help or Hurt?
4Does Cache Compression Help or Hurt?
5Does Cache Compression Help or Hurt?
6Does Cache Compression Help or Hurt?
- Adaptive Compression determines when compression
is beneficial
7Outline
- Motivation
- Cache Compression Framework
- Compressed Cache Hierarchy
- Decoupled Variable-Segment Cache
- Adaptive Compression
- Evaluation
- Conclusions
8Compressed Cache Hierarchy
9Decoupled Variable-Segment Cache
- Objective pack more lines into the same space
Tag Area
Data Area
Address A
Address B
- 2-way set-associative with 64-byte lines
- Tag Contains Address Tag, Permissions, LRU
(Replacement) Bits
10Decoupled Variable-Segment Cache
- Objective pack more lines into the same space
Tag Area
Data Area
Address A
Address B
Address C
Address D
Add two more tags
11Decoupled Variable-Segment Cache
- Objective pack more lines into the same space
Tag Area
Data Area
Address A
Address B
Address C
Address D
Add Compression Size, Status, More LRU bits
12Decoupled Variable-Segment Cache
- Objective pack more lines into the same space
Tag Area
Data Area
Address A
Address B
Address C
Address D
Divide Data Area into 8-byte segments
13Decoupled Variable-Segment Cache
- Objective pack more lines into the same space
Tag Area
Data Area
Address A
Address B
Address C
Address D
Data lines composed of 1-8 segments
14Decoupled Variable-Segment Cache
- Objective pack more lines into the same space
Tag Area
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
Tag is present but line isnt
Compression Status
Compressed Size
15Outline
- Motivation
- Cache Compression Framework
- Adaptive Compression
- Key Insight
- Classification of L2 accesses
- Global compression predictor
- Evaluation
- Conclusions
16Adaptive Compression
- Use past to predict future
- Key Insight
- LRU Stack Mattson, et al., 1970 indicates for
each reference whether compression helps or hurts
17Cost/Benefit Classification
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
- Classify each cache reference
- Four-way SA cache with space for two 64-byte
lines - Total of 16 available segments
18An Unpenalized Hit
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
- Read/Write Address A
- LRU Stack order 1 2 ? Hit regardless of
compression - Uncompressed Line ? No decompression penalty
- Neither cost nor benefit
19A Penalized Hit
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
- Read/Write Address B
- LRU Stack order 2 2 ? Hit regardless of
compression - Compressed Line ? Decompression penalty incurred
- Compression cost
20An Avoided Miss
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
- Read/Write Address C
- LRU Stack order 3 gt 2 ? Hit only because of
compression - Compression benefit Eliminated off-chip miss
21An Avoidable Miss
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
Sum(CSize) 15 16
- Read/Write Address D
- Line is not in the cache but tag exists at LRU
stack order 4 - Missed only because some lines are not compressed
- Potential compression benefit
22An Unavoidable Miss
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
- Read/Write Address E
- LRU stack order gt 4 ? Compression wouldnt have
helped - Line is not in the cache and tag does not exist
- Neither cost nor benefit
23Compression Predictor
- Estimate Benefit(Compression)
Cost(Compression) - Single counter Global Compression Predictor
(GCP) - Saturating up/down 19-bit counter
- GCP updated on each cache access
- Benefit Increment by memory latency
- Cost Decrement by decompression latency
- Optimization Normalize to decompression latency
1 - Cache Allocation
- Allocate compressed line if GCP ? 0
- Allocate uncompressed lines if GCP lt 0
24Outline
- Motivation
- Cache Compression Framework
- Adaptive Compression
- Evaluation
- Simulation Setup
- Performance
- Conclusions
25Simulation Setup
- Simics full system simulator augmented with
- Detailed OoO processor simulator TFSim, Mauer,
et al., 2002 - Detailed memory timing simulator Martin, et al.,
2002 - Workloads
- Commercial workloads
- Database servers OLTP and SPECJBB
- Static Web serving Apache and Zeus
- SPEC2000 benchmarks
- SPECint bzip, gcc, mcf, twolf
- SPECfp ammp, applu, equake, swim
26System configuration
- A dynamically scheduled SPARC V9 uniprocessor
- Configuration parameters
L1 Cache Split ID, 64KB each, 2-way SA, 64B line, 2-cycles/access
L2 Cache Unified 4MB, 8-way SA, 64B line, 20cyclesdecompression latency per access
Memory 4GB DRAM, 400-cycle access time, 128 outstanding requests
Processor pipeline 4-wide superscalar, 11-stage pipeline fetch (3), decode(3), schedule(1), execute(1), retire(3)
Reorder buffer 64 entries
27Simulated Cache Configurations
- Always All compressible lines are stored in
compressed format - Decompression penalty for all compressed lines
- Never All cache lines are stored in uncompressed
format - Cache is 8-way set associative with half the
number of sets - Does not incur decompression penalty
- Adaptive Our adaptive compression scheme
28Performance
SpecINT
SpecFP
Commercial
29Performance
30Performance
35 Speedup
18 Slowdown
31Performance
Bug in GCP update
Adaptive performs similar to the best of Always
and Never
32Effective Cache Capacity
33Cache Miss Rates
Misses Per 1000 Instructions
0.09 2.52 12.28
14.38
Penalized Hits Per Avoided Miss
6709 489 12.3
4.7
34Adapting to L2 Sizes
Misses Per 1000 Instructions
104.8 36.9 0.09
0.05
Penalized Hits Per Avoided Miss
0.93 5.7 6503
326000
35Conclusions
- Cache compression increases cache capacity but
slows down cache hit time - Helps some benchmarks (e.g., apache, mcf)
- Hurts other benchmarks (e.g., gcc, ammp)
- Our Proposal Adaptive compression
- Uses (LRU) replacement stack to determine whether
compression helps or hurts - Updates a single global saturating counter on
cache accesses - Adaptive compression performs similar to the
better of Always Compress and Never Compress
36Backup Slides
- Frequent Pattern Compression (FPC)
- Decoupled Variable-Segment Cache
- Classification of L2 Accesses
- (LRU) Stack Replacement
- Cache Miss Rates
- Adapting to L2 Sizes mcf
- Adapting to L1 Size
- Adapting to Decompression Latency mcf
- Adapting to Decompression Latency ammp
- Phase Behavior gcc
- Phase Behavior mcf
- Can We Do Better Than Adaptive?
37Decoupled Variable-Segment Cache
- Each set contains four tags and space for two
uncompressed lines - Data area divided into 8-byte segments
- Each tag is composed of
- Address tag
- Permissions
- CStatus 1 if the line is compressed, 0
otherwise - CSize Size of compressed line in segments
- LRU/replacement bits
Same as uncompressed cache
38Frequent Pattern Compression
- A significance-based compression algorithm
- Related Work
- X-Match and X-RL Algorithms Kjelso, et al.,
1996 - Address and data significance-based compression
Farrens and Park, 1991, Citron and Rudolph,
1995, Canal, et al., 2000 - A 64-byte line is decompressed in five cycles
- More details in technical report
- Frequent Pattern Compression A
Significance-Based Compression Algorithm for L2
Caches,
Alaa R. Alameldeen and
David A. Wood, Dept. of Computer Sciences
Technical Report CS-TR-2004-1500, April 2004
(available online).
39Frequent Pattern Compression (FPC)
- A significance-based compression algorithm
combined with zero run-length encoding - Compresses each 32-bit word separately
- Suitable for short (32-256 byte) cache lines
- Compressible Patterns zero runs, sign-ext.
4,8,16-bits, zero-padded half-word, two SE
half-words, repeated byte - A 64-byte line is decompressed in a five-stage
pipeline - More details in technical report
- Frequent Pattern Compression A
Significance-Based Compression Algorithm for L2
Caches,
Alaa R. Alameldeen and
David A. Wood, Dept. of Computer Sciences
Technical Report CS-TR-2004-1500, April 2004
(available online).
40Classification of L2 Accesses
- Cache hits
- Unpenalized hit Hit to an uncompressed line that
would have hit without compression - Penalized hit Hit to a compressed line that
would have hit without compression - Avoided miss Hit to a line that would NOT have
hit without compression - Cache misses
- Avoidable miss Miss to a line that would have
hit with compression - Unavoidable miss Miss to a line that would have
missed even with compression
41(LRU) Stack Replacement
- Differentiate penalized hits and avoided misses?
- Only hits to top half of the tags in the LRU
stack are penalized hits - Differentiate avoidable and unavoidable misses?
- Is not dependent on LRU replacement
- Any replacement algorithm for top half of tags
- Any stack algorithm for the remaining tags
42Cache Miss Rates
43Adapting to L2 Sizes
Misses Per 1000 Instructions
98.9 88.1 12.4
0.02
Penalized Hits Per Avoided Miss
11.6 4.4 12.6
2x106
44Adapting to L1 Size
45Adapting to Decompression Latency
46Adapting to Decompression Latency
47Phase Behavior
Predictor Value (K)
Cache Size (MB)
48Phase Behavior
Predictor Value (K)
Cache Size (MB)
49Can We Do Better Than Adaptive?
- Optimal is an unrealistic configuration Always
with no decompression penalty