Tim Anderson and Sanjive Agarwala

About This Presentation

Title:

Description:

Number of Views:43

Avg rating:3.0/5.0

Slides: 19

Provided by: aa267

Category:

Tags: agarwala | anderson | sanjive | setup | tim

Transcript and Presenter's Notes

Title: Tim Anderson and Sanjive Agarwala

1

Effective Hardware-Based Two-Way Loop Cache for
High Performance Low Power Processors

2
Overview

3
Introduction

The increasing level of processor integration
coupled with the higher clock frequencies of the
semiconductor of battery operated devices is
increasing the power consumption.
Recent studies have concentrated on methods to
reduce the amount of energy needed to supply
instructions to the DSP core.
Targeting the instruction RAMs power dissipation
through the addition of a small cache or buffer.
In this paper, we investigate a method to reduce
the power consumption of the instruction supply
pipeline up to the point where the instructions
interface with the data computation of a processor

4
Introduction(contd)

TMS320C62xx
We propose to insert an instruction buffer in the
C62x pipeline between the instruction
dispatch/decode logic and the core data
processing engine.
Allocation strategy for the buffer which attempts
to capture loop kernels in the instruction buffer
allow nested loops

5
Loop Buffer Operation

6
Loop Buffer Operation

The loop buffer consists of two major parts, one
which resides in the PS stage, and one which
resides in the DC stage.
The hardware in the PS stage
candidate address register, a counter, a state
machine, a base address register, a number valid
register, and a comparator
The hardware in the DC stage
decoded instruction storage elements

7
Loop Buffer Operation

PS state machine
Three states IDLE, FILL, and RUN
FILL when the loop buffers state elements need
to capture the decoded instruction information
into the loop buffer.
RUN when the loop buffers state elements should
supply instruction information to the functional
units

8
Loop Buffer Operation

When a branch occurs, the target of the branch is
compared with the address in the base address
registers.
If there is a match
a signal is sent to the instruction RAM to
indicate that the request should be canceled.
The state machine is placed in the RUN state.
In subsequent cycles, the entry index will be
incremented until another branch occurs, or the
entry index exceeds the value stored in the
number valid register

9
Loop Buffer Operation

If the branch target does not match either of the
stored base addresses, then the branch target is
also compared to the candidate address.
If the candidate address matches the branch
target address, the loop buffer allocates the
branch target to the least-recently-used way and
the FSM enters the FILL state.
A write command is also issued, which will write
the instruction data into the loop buffer storage
elements when the instruction data is available
in the DC stage.
In subsequent cycles, the entry number is
incremented and a write command is generated for
the loop storage elements until the next branch
is detected.

10
Benchmarks and Simulation

Six DSP applications were used to evaluate the
performance of the various loop buffer
configurations.
Compiled with the TMS320C6x optimizing C compiler
GSM Coder, G723.1 Speech Coder, and Multichannel
codec
Hand-optimized assembly code
ADSL error correction, ADSL time equalization,
and ADSL steady-state
The simulation model used for the loop buffer
experiments was an in-house C62x instruction set
simulator.

11
Benchmarks and Simulation

We evaluated both 1-way and 2-way loop buffers
with 4, 6, 8, 12, 16, and 32 entries
We then evaluated the number of instruction RAM
accesses per clock, instructions decoded per
clock, and percent reduction in those access
rates.
A fictitious processor implementation
The instruction RAM consumes the same amount of
power as the CPU core
The CPU dispatch/decode logic consumes 50 of the
power of the CPU core.
The loop buffer consumes 5 of the instruction
RAM power.

12
Results
13
Results(contd)
14
Results(contd)
15
Results(contd)
16
Results(contd)
17
Results(contd)
18
Conclusion

A method for reducing the power consumption of
processors has been shown to reduce the
instruction fetch and decode activity by up to
83, and the instruction RAM activity by 82.
We have shown that overall instruction RAM plus
CPU power may be reduced by 63

Write a Comment

User Comments (0)