Cache Pipelining with Partial Operand Knowledge - PowerPoint PPT Presentation

About This Presentation
Title:

Cache Pipelining with Partial Operand Knowledge

Description:

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin Madison – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 23
Provided by: Mikk90
Category:

less

Transcript and Presenter's Notes

Title: Cache Pipelining with Partial Operand Knowledge


1
Cache Pipelining with Partial Operand Knowledge
  • Erika Gunadi and Mikko H. Lipasti
  • Department of Electrical and Computer Engineering
  • University of WisconsinMadison

http//www.ece.wisc.edu/pharm
2
Cache Power Consumption
  • Increasing on-chip cache size
  • Increasing cache power consumption
  • Increasing clock frequency
  • Increasing dynamic power
  • Lots of prior work to reduce cache power
    consumption

3
Prior Work
  • Cache subbanking, bitline segmentationSu et al.
    1995, Ghose et al. 2001
  • Cache decomposition Huang et al. 2001
  • Block buffering Su et al. 1995
  • Reducing Leakage power
  • Drowsy caches Flautner et al. 2002, Kim et al.
    2002
  • Cache decay Kaxiras et al. 2001
  • Gated Vdd Powell et al. 2000

4
Cache Subbanking
  • Proposed by Su et al. 1995
  • Fetching only requested subline
  • Partitioned data array vertically into several
    subbanks
  • Further study by Ghose et al. 2001
  • Partitioned data array vertically and
    horizontally
  • Only activate the requested subbanks

5
Bit-sliced ALU
  • Originally proposed by Hsu et al. 1985
  • Slices the addition operations
  • i.e. 32-bit addition -gt four 8-bit addition
  • Avoids waiting for full-width addition
  • Bypasses partial operand result
  • Has been successfully implemented in Pentium 4
    staggered adder

6
Outline
  • Motivation
  • Prior Work
  • Bit-sliced Cache
  • Experiment Results
  • Conclusion

7
Power Consumption in Cache
  • Row decoding consumes up to 40 of active power

8
Bit-sliced Cache
  • Extends cache subbanking technique
  • Saves decoding power
  • Enables only row decoders that are accessed
  • Serializes subarray decoding with row decoding
  • Uses low order index bits to select row decoder
  • Minimal changes to subbanking technique

9
Pipelining the Cache Access
  • Cache access time increases due to
  • Serializing subarray decoder with row decoder
  • Pipeline the access to hide the delay
  • Need to balance the latency of each stage
  • Choose operations for each stage carefully
  • Provide more throughput
  • Same throughput as a conventional cache with n
    ports

10
Pipelined-Caches Access Steps
  • Cycle 1 ltCycle 1gt
  • Start subarray decoding for data and tag
  • Cycle 2
  • Activate necessary row decoders
  • Read tag array while waiting
  • Cycle 3 ltCycle 2gt
  • Read data array
  • Concurrently, do partial tag comparison
  • Cycle 4
  • Compare the rest of the tag bits
  • Use tag comparison result to select data

11
Bit-sliced Cache
12
Bit-sliced Cache Bit-sliced ALU
  • Optimal performance benefit
  • Cache access starts sooner
  • As soon as the first slice is available
  • Limited number of subarrays
  • According to the number of bits per slice
  • When the bitslice is too small
  • Unable to achieve optimal power saving

13
Pipelining with Bit-sliced Cache
Pipelined Execution Stage with Pipelined Cache
lw R1, 0(R3)
Bit-sliced Execution Stage with Pipelined Cache
Bit-sliced Execution Stage with Bit-sliced Cache
14
Cache Model Simulation
  • Estimates energy consumption and cache latency
  • Uses a modified version of CACTI 3.0
  • Parameters Ntbl, Ndbl, Ntwl, Ndwl.
  • Enumerates all possible configurations
  • Chooses the one with the best weighted value
    (cycle time and energy consumption)
  • Simulates
  • Various cache sizes (8K-512K), 64 B blocks
  • DM, 2-way, 4-way, and 8-way
  • Uses 0.18 um technology

15
Processor Simulation
  • Estimates performance benefit
  • Uses a heavily modified SimpleScalar 3.0
  • Supports bit-sliced execution stage
  • Supports speculative slice execution
  • Benchmarks
  • Eight Spec2000 Integer benchmarks
  • Full reference input set
  • Fast forward 500M, simulate 100M

16
Machine Configuration
  • 4-wide fetch, issue, commit
  • 128 entry ROB
  • 32 entry scheduler
  • 20 stage pipeline
  • 64K-entry gshare
  • L1 I-Cache 32KB, 2-way, 64B block
  • L1 D-Cache 8KB, 4-way, 64B block
  • L2 Cache 512KB, 8-way, 128B block

17
Energy Consumption / Access
18
Cycle Time Comparison
19
Speed Up Comparison
20
Speed Up Comparison
21
Conclusion
  • Bit-sliced cache
  • Achieves significant power reduction
  • Without adds much complexity
  • Adds some delay to access latency
  • Pipelined bit-sliced cache
  • Reduces cycle time
  • Provides more bandwidth
  • Measurable speed up (w/ bit-sliced ALU)

22
Question?
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com