Adaptive Cache Compression for High-Performance Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Adaptive Cache Compression for High-Performance Processors

Description:

Title: Adaptive Cache Compression for High-Performance Processors Author: alaa Last modified by: Alaa Created Date: 9/25/2002 4:33:58 PM Document presentation format – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 50
Provided by: Alaa81
Category:

less

Transcript and Presenter's Notes

Title: Adaptive Cache Compression for High-Performance Processors


1
Adaptive Cache Compression for High-Performance
Processors
  • Alaa Alameldeen and David Wood
  • University of Wisconsin-Madison
  • Wisconsin Multifacet Project
  • http//www.cs.wisc.edu/multifacet

2
Overview
  • Design of high performance processors
  • Processor speed improves faster than memory
  • Memory latency dominates performance
  • Need more effective cache designs
  • On-chip cache compression
  • Increases effective cache size
  • Increases cache hit latency
  • Does cache compression help or hurt?

3
Does Cache Compression Help or Hurt?
4
Does Cache Compression Help or Hurt?
5
Does Cache Compression Help or Hurt?
6
Does Cache Compression Help or Hurt?
  • Adaptive Compression determines when compression
    is beneficial

7
Outline
  • Motivation
  • Cache Compression Framework
  • Compressed Cache Hierarchy
  • Decoupled Variable-Segment Cache
  • Adaptive Compression
  • Evaluation
  • Conclusions

8
Compressed Cache Hierarchy
9
Decoupled Variable-Segment Cache
  • Objective pack more lines into the same space

Tag Area
Data Area
Address A
Address B
  • 2-way set-associative with 64-byte lines
  • Tag Contains Address Tag, Permissions, LRU
    (Replacement) Bits

10
Decoupled Variable-Segment Cache
  • Objective pack more lines into the same space

Tag Area
Data Area
Address A
Address B
Address C
Address D
Add two more tags
11
Decoupled Variable-Segment Cache
  • Objective pack more lines into the same space

Tag Area
Data Area
Address A
Address B
Address C
Address D
Add Compression Size, Status, More LRU bits
12
Decoupled Variable-Segment Cache
  • Objective pack more lines into the same space

Tag Area
Data Area
Address A
Address B
Address C
Address D
Divide Data Area into 8-byte segments
13
Decoupled Variable-Segment Cache
  • Objective pack more lines into the same space

Tag Area
Data Area
Address A
Address B
Address C
Address D
Data lines composed of 1-8 segments
14
Decoupled Variable-Segment Cache
  • Objective pack more lines into the same space

Tag Area
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
Tag is present but line isnt
Compression Status
Compressed Size
15
Outline
  • Motivation
  • Cache Compression Framework
  • Adaptive Compression
  • Key Insight
  • Classification of L2 accesses
  • Global compression predictor
  • Evaluation
  • Conclusions

16
Adaptive Compression
  • Use past to predict future
  • Key Insight
  • LRU Stack Mattson, et al., 1970 indicates for
    each reference whether compression helps or hurts

17
Cost/Benefit Classification
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
  • Classify each cache reference
  • Four-way SA cache with space for two 64-byte
    lines
  • Total of 16 available segments

18
An Unpenalized Hit
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
  • Read/Write Address A
  • LRU Stack order 1 2 ? Hit regardless of
    compression
  • Uncompressed Line ? No decompression penalty
  • Neither cost nor benefit

19
A Penalized Hit
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
  • Read/Write Address B
  • LRU Stack order 2 2 ? Hit regardless of
    compression
  • Compressed Line ? Decompression penalty incurred
  • Compression cost

20
An Avoided Miss
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
  • Read/Write Address C
  • LRU Stack order 3 gt 2 ? Hit only because of
    compression
  • Compression benefit Eliminated off-chip miss

21
An Avoidable Miss
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
Sum(CSize) 15 16
  • Read/Write Address D
  • Line is not in the cache but tag exists at LRU
    stack order 4
  • Missed only because some lines are not compressed
  • Potential compression benefit

22
An Unavoidable Miss
LRU Stack
Data Area
Addr A uncompressed 3
Addr B compressed 2
Addr C compressed 6
Addr D compressed 4
  • Read/Write Address E
  • LRU stack order gt 4 ? Compression wouldnt have
    helped
  • Line is not in the cache and tag does not exist
  • Neither cost nor benefit

23
Compression Predictor
  • Estimate Benefit(Compression)
    Cost(Compression)
  • Single counter Global Compression Predictor
    (GCP)
  • Saturating up/down 19-bit counter
  • GCP updated on each cache access
  • Benefit Increment by memory latency
  • Cost Decrement by decompression latency
  • Optimization Normalize to decompression latency
    1
  • Cache Allocation
  • Allocate compressed line if GCP ? 0
  • Allocate uncompressed lines if GCP lt 0

24
Outline
  • Motivation
  • Cache Compression Framework
  • Adaptive Compression
  • Evaluation
  • Simulation Setup
  • Performance
  • Conclusions

25
Simulation Setup
  • Simics full system simulator augmented with
  • Detailed OoO processor simulator TFSim, Mauer,
    et al., 2002
  • Detailed memory timing simulator Martin, et al.,
    2002
  • Workloads
  • Commercial workloads
  • Database servers OLTP and SPECJBB
  • Static Web serving Apache and Zeus
  • SPEC2000 benchmarks
  • SPECint bzip, gcc, mcf, twolf
  • SPECfp ammp, applu, equake, swim

26
System configuration
  • A dynamically scheduled SPARC V9 uniprocessor
  • Configuration parameters

L1 Cache Split ID, 64KB each, 2-way SA, 64B line, 2-cycles/access
L2 Cache Unified 4MB, 8-way SA, 64B line, 20cyclesdecompression latency per access
Memory 4GB DRAM, 400-cycle access time, 128 outstanding requests
Processor pipeline 4-wide superscalar, 11-stage pipeline fetch (3), decode(3), schedule(1), execute(1), retire(3)
Reorder buffer 64 entries
27
Simulated Cache Configurations
  • Always All compressible lines are stored in
    compressed format
  • Decompression penalty for all compressed lines
  • Never All cache lines are stored in uncompressed
    format
  • Cache is 8-way set associative with half the
    number of sets
  • Does not incur decompression penalty
  • Adaptive Our adaptive compression scheme

28
Performance
SpecINT
SpecFP
Commercial
29
Performance
30
Performance
35 Speedup
18 Slowdown
31
Performance
Bug in GCP update
Adaptive performs similar to the best of Always
and Never
32
Effective Cache Capacity
33
Cache Miss Rates
Misses Per 1000 Instructions
0.09 2.52 12.28
14.38
Penalized Hits Per Avoided Miss
6709 489 12.3
4.7
34
Adapting to L2 Sizes
Misses Per 1000 Instructions
104.8 36.9 0.09
0.05
Penalized Hits Per Avoided Miss
0.93 5.7 6503
326000
35
Conclusions
  • Cache compression increases cache capacity but
    slows down cache hit time
  • Helps some benchmarks (e.g., apache, mcf)
  • Hurts other benchmarks (e.g., gcc, ammp)
  • Our Proposal Adaptive compression
  • Uses (LRU) replacement stack to determine whether
    compression helps or hurts
  • Updates a single global saturating counter on
    cache accesses
  • Adaptive compression performs similar to the
    better of Always Compress and Never Compress

36
Backup Slides
  • Frequent Pattern Compression (FPC)
  • Decoupled Variable-Segment Cache
  • Classification of L2 Accesses
  • (LRU) Stack Replacement
  • Cache Miss Rates
  • Adapting to L2 Sizes mcf
  • Adapting to L1 Size
  • Adapting to Decompression Latency mcf
  • Adapting to Decompression Latency ammp
  • Phase Behavior gcc
  • Phase Behavior mcf
  • Can We Do Better Than Adaptive?

37
Decoupled Variable-Segment Cache
  • Each set contains four tags and space for two
    uncompressed lines
  • Data area divided into 8-byte segments
  • Each tag is composed of
  • Address tag
  • Permissions
  • CStatus 1 if the line is compressed, 0
    otherwise
  • CSize Size of compressed line in segments
  • LRU/replacement bits

Same as uncompressed cache
38
Frequent Pattern Compression
  • A significance-based compression algorithm
  • Related Work
  • X-Match and X-RL Algorithms Kjelso, et al.,
    1996
  • Address and data significance-based compression
    Farrens and Park, 1991, Citron and Rudolph,
    1995, Canal, et al., 2000
  • A 64-byte line is decompressed in five cycles
  • More details in technical report
  • Frequent Pattern Compression A
    Significance-Based Compression Algorithm for L2
    Caches,
    Alaa R. Alameldeen and
    David A. Wood, Dept. of Computer Sciences
    Technical Report CS-TR-2004-1500, April 2004
    (available online).

39
Frequent Pattern Compression (FPC)
  • A significance-based compression algorithm
    combined with zero run-length encoding
  • Compresses each 32-bit word separately
  • Suitable for short (32-256 byte) cache lines
  • Compressible Patterns zero runs, sign-ext.
    4,8,16-bits, zero-padded half-word, two SE
    half-words, repeated byte
  • A 64-byte line is decompressed in a five-stage
    pipeline
  • More details in technical report
  • Frequent Pattern Compression A
    Significance-Based Compression Algorithm for L2
    Caches,
    Alaa R. Alameldeen and
    David A. Wood, Dept. of Computer Sciences
    Technical Report CS-TR-2004-1500, April 2004
    (available online).

40
Classification of L2 Accesses
  • Cache hits
  • Unpenalized hit Hit to an uncompressed line that
    would have hit without compression
  • Penalized hit Hit to a compressed line that
    would have hit without compression
  • Avoided miss Hit to a line that would NOT have
    hit without compression
  • Cache misses
  • Avoidable miss Miss to a line that would have
    hit with compression
  • Unavoidable miss Miss to a line that would have
    missed even with compression

41
(LRU) Stack Replacement
  • Differentiate penalized hits and avoided misses?
  • Only hits to top half of the tags in the LRU
    stack are penalized hits
  • Differentiate avoidable and unavoidable misses?
  • Is not dependent on LRU replacement
  • Any replacement algorithm for top half of tags
  • Any stack algorithm for the remaining tags

42
Cache Miss Rates
43
Adapting to L2 Sizes
Misses Per 1000 Instructions
98.9 88.1 12.4
0.02
Penalized Hits Per Avoided Miss
11.6 4.4 12.6
2x106
44
Adapting to L1 Size
45
Adapting to Decompression Latency
46
Adapting to Decompression Latency
47
Phase Behavior
Predictor Value (K)
Cache Size (MB)
48
Phase Behavior
Predictor Value (K)
Cache Size (MB)
49
Can We Do Better Than Adaptive?
  • Optimal is an unrealistic configuration Always
    with no decompression penalty
Write a Comment
User Comments (0)
About PowerShow.com