Cache Pipelining with Partial Operand Knowledge - PowerPoint PPT Presentation

About This Presentation

Title:

Cache Pipelining with Partial Operand Knowledge

Description:

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin Madison – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 23

Provided by: Mikk90

Learn more at: https://pharm.ece.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cache Pipelining with Partial Operand Knowledge

1
Cache Pipelining with Partial Operand Knowledge

Erika Gunadi and Mikko H. Lipasti
Department of Electrical and Computer Engineering
University of WisconsinMadison

http//www.ece.wisc.edu/pharm
2
Cache Power Consumption

Increasing on-chip cache size
Increasing cache power consumption
Increasing clock frequency
Increasing dynamic power
Lots of prior work to reduce cache power
consumption

3
Prior Work

Cache subbanking, bitline segmentationSu et al.
1995, Ghose et al. 2001
Cache decomposition Huang et al. 2001
Block buffering Su et al. 1995
Reducing Leakage power
Drowsy caches Flautner et al. 2002, Kim et al.
2002
Cache decay Kaxiras et al. 2001
Gated Vdd Powell et al. 2000

4
Cache Subbanking

Proposed by Su et al. 1995
Fetching only requested subline
Partitioned data array vertically into several
subbanks
Further study by Ghose et al. 2001
Partitioned data array vertically and
horizontally
Only activate the requested subbanks

5
Bit-sliced ALU

Originally proposed by Hsu et al. 1985
Slices the addition operations
i.e. 32-bit addition -gt four 8-bit addition
Avoids waiting for full-width addition
Bypasses partial operand result
Has been successfully implemented in Pentium 4
staggered adder

6
Outline

Motivation
Prior Work
Bit-sliced Cache
Experiment Results
Conclusion

7
Power Consumption in Cache

Row decoding consumes up to 40 of active power

8
Bit-sliced Cache

Extends cache subbanking technique
Saves decoding power
Enables only row decoders that are accessed
Serializes subarray decoding with row decoding
Uses low order index bits to select row decoder
Minimal changes to subbanking technique

9
Pipelining the Cache Access

Cache access time increases due to
Serializing subarray decoder with row decoder
Pipeline the access to hide the delay
Need to balance the latency of each stage
Choose operations for each stage carefully
Provide more throughput
Same throughput as a conventional cache with n
ports

10
Pipelined-Caches Access Steps

Cycle 1 ltCycle 1gt
Start subarray decoding for data and tag
Cycle 2
Activate necessary row decoders
Read tag array while waiting
Cycle 3 ltCycle 2gt
Read data array
Concurrently, do partial tag comparison
Cycle 4
Compare the rest of the tag bits
Use tag comparison result to select data

11
Bit-sliced Cache
12
Bit-sliced Cache Bit-sliced ALU

Optimal performance benefit
Cache access starts sooner
As soon as the first slice is available

Limited number of subarrays
According to the number of bits per slice
When the bitslice is too small
Unable to achieve optimal power saving

13
Pipelining with Bit-sliced Cache
Pipelined Execution Stage with Pipelined Cache
lw R1, 0(R3)
Bit-sliced Execution Stage with Pipelined Cache
Bit-sliced Execution Stage with Bit-sliced Cache
14
Cache Model Simulation

Estimates energy consumption and cache latency
Uses a modified version of CACTI 3.0
Parameters Ntbl, Ndbl, Ntwl, Ndwl.
Enumerates all possible configurations
Chooses the one with the best weighted value
(cycle time and energy consumption)
Simulates
Various cache sizes (8K-512K), 64 B blocks
DM, 2-way, 4-way, and 8-way
Uses 0.18 um technology

15
Processor Simulation