Adaptive Cache Compression for High-Performance Processors - PowerPoint PPT Presentation

About This Presentation

Title:

Adaptive Cache Compression for High-Performance Processors

Description:

Title: Adaptive Cache Compression for High-Performance Processors Author: alaa Last modified by: Alaa Created Date: 9/25/2002 4:33:58 PM Document presentation format – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 50

Provided by: Alaa81

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive Cache Compression for High-Performance Processors

1
Adaptive Cache Compression for High-Performance
Processors

Alaa Alameldeen and David Wood
University of Wisconsin-Madison
Wisconsin Multifacet Project
http//www.cs.wisc.edu/multifacet

2
Overview

Design of high performance processors
Processor speed improves faster than memory
Memory latency dominates performance
Need more effective cache designs
On-chip cache compression
Increases effective cache size
Increases cache hit latency
Does cache compression help or hurt?

3
Does Cache Compression Help or Hurt?
4
Does Cache Compression Help or Hurt?
5
Does Cache Compression Help or Hurt?
6
Does Cache Compression Help or Hurt?

Adaptive Compression determines when compression
is beneficial

7
Outline

Motivation
Cache Compression Framework
Compressed Cache Hierarchy
Decoupled Variable-Segment Cache
Adaptive Compression
Evaluation
Conclusions

8
Compressed Cache Hierarchy
9
Decoupled Variable-Segment Cache

Objective pack more lines into the same space

Tag Area
Data Area
Address A
Address B

2-way set-associative with 64-byte lines
Tag Contains Address Tag, Permissions, LRU
(Replacement) Bits

10
Decoupled Variable-Segment Cache

Objective pack more lines into the same space

Tag Area
Data Area
Address A
Address B
Address C
Address D
Add two more tags
11
Decoupled Variable-Segment Cache

Objective pack more lines into the same space

Tag Area
Data Area
Address A
Address B
Address C
Address D
Add Compression Size, Status, More LRU bits
12
Decoupled Variable-Segment Cache

Objective pack more lines into the same space

Tag Area
Data Area
Address A
Address B
Address C
Address D
Divide Data Area into 8-byte segments
13
Decoupled Variable-Segment Cache

Objective pack more lines into the same space

Tag Area
Data Area
Address A
Address B
Address C
Address D
Data lines composed of 1-8 segments
14
Decoupled Variable-Segment Cache

Objective pack more lines into the same space

Tag Area
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
Tag is present but line isnt
Compression Status
Compressed Size
15
Outline

Motivation
Cache Compression Framework
Adaptive Compression
Key Insight
Classification of L2 accesses
Global compression predictor
Evaluation
Conclusions

16
Adaptive Compression

Use past to predict future
Key Insight
LRU Stack Mattson, et al., 1970 indicates for
each reference whether compression helps or hurts

17
Cost/Benefit Classification
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4

Classify each cache reference
Four-way SA cache with space for two 64-byte
lines
Total of 16 available segments

18
An Unpenalized Hit
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4

Read/Write Address A
LRU Stack order 1 2 ? Hit regardless of
compression
Uncompressed Line ? No decompression penalty
Neither cost nor benefit

19
A Penalized Hit
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4

Read/Write Address B
LRU Stack order 2 2 ? Hit regardless of
compression
Compressed Line ? Decompression penalty incurred
Compression cost

20
An Avoided Miss
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4

Read/Write Address C
LRU Stack order 3 gt 2 ? Hit only because of
compression
Compression benefit Eliminated off-chip miss

21
An Avoidable Miss
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
Sum(CSize) 15 16

Read/Write Address D
Line is not in the cache but tag exists at LRU
stack order 4
Missed only because some lines are not compressed
Potential compression benefit

22
An Unavoidable Miss
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4

Read/Write Address E
LRU stack order gt 4 ? Compression wouldnt have
helped
Line is not in the cache and tag does not exist
Neither cost nor benefit

23
Compression Predictor

Estimate Benefit(Compression)
Cost(Compression)
Single counter Global Compression Predictor
(GCP)
Saturating up/down 19-bit counter
GCP updated on each cache access
Benefit Increment by memory latency
Cost Decrement by decompression latency
Optimization Normalize to decompression latency
1
Cache Allocation
Allocate compressed line if GCP ? 0
Allocate uncompressed lines if GCP lt 0

24
Outline

Motivation
Cache Compression Framework
Adaptive Compression
Evaluation
Simulation Setup
Performance
Conclusions

25
Simulation Setup

Simics full system simulator augmented with
Detailed OoO processor simulator TFSim, Mauer,
et al., 2002
Detailed memory timing simulator Martin, et al.,
2002
Workloads
Commercial workloads
Database servers OLTP and SPECJBB
Static Web serving Apache and Zeus
SPEC2000 benchmarks
SPECint bzip, gcc, mcf, twolf
SPECfp ammp, applu, equake, swim

26
System configuration

A dynamically scheduled SPARC V9 uniprocessor
Configuration parameters

L1 Cache Split ID, 64KB each, 2-way SA, 64B line, 2-cycles/access
L2 Cache Unified 4MB, 8-way SA, 64B line, 20cyclesdecompression latency per access
Memory 4GB DRAM, 400-cycle access time, 128 outstanding requests
Processor pipeline 4-wide superscalar, 11-stage pipeline fetch (3), decode(3), schedule(1), execute(1), retire(3)
Reorder buffer 64 entries
27
Simulated Cache Configurations

Always All compressible lines are stored in
compressed format
Decompression penalty for all compressed lines
Never All cache lines are stored in uncompressed
format
Cache is 8-way set associative with half the
number of sets
Does not incur decompression penalty
Adaptive Our adaptive compression scheme

28
Performance
SpecINT
SpecFP
Commercial
29
Performance
30
Performance
35 Speedup
18 Slowdown
31
Performance
Bug in GCP update
Adaptive performs similar to the best of Always
and Never
32
Effective Cache Capacity
33
Cache Miss Rates
Misses Per 1000 Instructions
0.09 2.52 12.28
14.38
Penalized Hits Per Avoided Miss
6709 489 12.3
4.7
34
Adapting to L2 Sizes
Misses Per 1000 Instructions
104.8 36.9 0.09
0.05
Penalized Hits Per Avoided Miss
0.93 5.7 6503
326000
35
Conclusions

Cache compression increases cache capacity but
slows down cache hit time
Helps some benchmarks (e.g., apache, mcf)
Hurts other benchmarks (e.g., gcc, ammp)
Our Proposal Adaptive compression
Uses (LRU) replacement stack to determine whether
compression helps or hurts
Updates a single global saturating counter on
cache accesses
Adaptive compression performs similar to the
better of Always Compress and Never Compress

36
Backup Slides

Frequent Pattern Compression (FPC)
Decoupled Variable-Segment Cache
Classification of L2 Accesses
(LRU) Stack Replacement
Cache Miss Rates
Adapting to L2 Sizes mcf
Adapting to L1 Size
Adapting to Decompression Latency mcf
Adapting to Decompression Latency ammp
Phase Behavior gcc
Phase Behavior mcf
Can We Do Better Than Adaptive?

37
Decoupled Variable-Segment Cache

Each set contains four tags and space for two
uncompressed lines
Data area divided into 8-byte segments
Each tag is composed of
Address tag
Permissions
CStatus 1 if the line is compressed, 0
otherwise
CSize Size of compressed line in segments
LRU/replacement bits

Same as uncompressed cache
38
Frequent Pattern Compression

A significance-based compression algorithm
Related Work
X-Match and X-RL Algorithms Kjelso, et al.,
1996
Address and data significance-based compression
Farrens and Park, 1991, Citron and Rudolph,
1995, Canal, et al., 2000
A 64-byte line is decompressed in five cycles
More details in technical report
Frequent Pattern Compression A
Significance-Based Compression Algorithm for L2
Caches,
Alaa R. Alameldeen and
David A. Wood, Dept. of Computer Sciences
Technical Report CS-TR-2004-1500, April 2004
(available online).

39
Frequent Pattern Compression (FPC)

A significance-based compression algorithm
combined with zero run-length encoding
Compresses each 32-bit word separately
Suitable for short (32-256 byte) cache lines
Compressible Patterns zero runs, sign-ext.
4,8,16-bits, zero-padded half-word, two SE
half-words, repeated byte
A 64-byte line is decompressed in a five-stage
pipeline
More details in technical report
Frequent Pattern Compression A
Significance-Based Compression Algorithm for L2
Caches,
Alaa R. Alameldeen and
David A. Wood, Dept. of Computer Sciences
Technical Report CS-TR-2004-1500, April 2004
(available online).

40
Classification of L2 Accesses

Cache hits
Unpenalized hit Hit to an uncompressed line that
would have hit without compression
Penalized hit Hit to a compressed line that
would have hit without compression
Avoided miss Hit to a line that would NOT have
hit without compression
Cache misses
Avoidable miss Miss to a line that would have
hit with compression
Unavoidable miss Miss to a line that would have
missed even with compression

41
(LRU) Stack Replacement

Differentiate penalized hits and avoided misses?
Only hits to top half of the tags in the LRU
stack are penalized hits
Differentiate avoidable and unavoidable misses?
Is not dependent on LRU replacement
Any replacement algorithm for top half of tags
Any stack algorithm for the remaining tags

42
Cache Miss Rates
43
Adapting to L2 Sizes
Misses Per 1000 Instructions
98.9 88.1 12.4
0.02
Penalized Hits Per Avoided Miss
11.6 4.4 12.6
2x106
44
Adapting to L1 Size
45
Adapting to Decompression Latency
46
Adapting to Decompression Latency
47
Phase Behavior
Predictor Value (K)
Cache Size (MB)
48
Phase Behavior
Predictor Value (K)
Cache Size (MB)
49
Can We Do Better Than Adaptive?