Tim Anderson and Sanjive Agarwala - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Tim Anderson and Sanjive Agarwala

Description:

Effective Hardware-Based Two-Way Loop Cache for High Performance Low Power Processors ... Benchmarks used for the loop buffer configurations and the experimental setup ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 19
Provided by: aa267
Category:

less

Transcript and Presenter's Notes

Title: Tim Anderson and Sanjive Agarwala


1

Effective Hardware-Based Two-Way Loop Cache for
High Performance Low Power Processors
  • Tim Anderson and Sanjive Agarwala
  • ICCD00

2
Overview
  • Introduction
  • Loop buffer placement in the pipeline
  • Benchmarks used for the loop buffer
    configurations and the experimental setup
  • Experimental results
  • Conclusions

3
Introduction
  • The increasing level of processor integration
    coupled with the higher clock frequencies of the
    semiconductor of battery operated devices is
    increasing the power consumption.
  • Recent studies have concentrated on methods to
    reduce the amount of energy needed to supply
    instructions to the DSP core.
  • Targeting the instruction RAMs power dissipation
    through the addition of a small cache or buffer.
  • In this paper, we investigate a method to reduce
    the power consumption of the instruction supply
    pipeline up to the point where the instructions
    interface with the data computation of a processor

4
Introduction(contd)
  • TMS320C62xx
  • We propose to insert an instruction buffer in the
    C62x pipeline between the instruction
    dispatch/decode logic and the core data
    processing engine.
  • Allocation strategy for the buffer which attempts
    to capture loop kernels in the instruction buffer
  • allow nested loops

5
Loop Buffer Operation
  • The instruction supply portion of the C62x
    pipeline is divided into five stages
  • PG performs the address calculations for the
    next packet address
  • PS sends the fetch packet address to the memory
    system
  • PW waits for the instruction RAM to be read
  • PR receive the fetch packet data from the
    instruction RAM
  • DS dispatch the instructions to the appropriate
    functional units
  • DC decode the functional unit instruction

6
Loop Buffer Operation
  • The loop buffer consists of two major parts, one
    which resides in the PS stage, and one which
    resides in the DC stage.
  • The hardware in the PS stage
  • candidate address register, a counter, a state
    machine, a base address register, a number valid
    register, and a comparator
  • The hardware in the DC stage
  • decoded instruction storage elements

7
Loop Buffer Operation
  • PS state machine
  • Three states IDLE, FILL, and RUN
  • FILL when the loop buffers state elements need
    to capture the decoded instruction information
    into the loop buffer.
  • RUN when the loop buffers state elements should
    supply instruction information to the functional
    units

8
Loop Buffer Operation
  • When a branch occurs, the target of the branch is
    compared with the address in the base address
    registers.
  • If there is a match
  • a signal is sent to the instruction RAM to
    indicate that the request should be canceled.
  • The state machine is placed in the RUN state.
  • In subsequent cycles, the entry index will be
    incremented until another branch occurs, or the
    entry index exceeds the value stored in the
    number valid register

9
Loop Buffer Operation
  • If the branch target does not match either of the
    stored base addresses, then the branch target is
    also compared to the candidate address.
  • If the candidate address matches the branch
    target address, the loop buffer allocates the
    branch target to the least-recently-used way and
    the FSM enters the FILL state.
  • A write command is also issued, which will write
    the instruction data into the loop buffer storage
    elements when the instruction data is available
    in the DC stage.
  • In subsequent cycles, the entry number is
    incremented and a write command is generated for
    the loop storage elements until the next branch
    is detected.

10
Benchmarks and Simulation
  • Six DSP applications were used to evaluate the
    performance of the various loop buffer
    configurations.
  • Compiled with the TMS320C6x optimizing C compiler
  • GSM Coder, G723.1 Speech Coder, and Multichannel
    codec
  • Hand-optimized assembly code
  • ADSL error correction, ADSL time equalization,
    and ADSL steady-state
  • The simulation model used for the loop buffer
    experiments was an in-house C62x instruction set
    simulator.

11
Benchmarks and Simulation
  • We evaluated both 1-way and 2-way loop buffers
    with 4, 6, 8, 12, 16, and 32 entries
  • We then evaluated the number of instruction RAM
    accesses per clock, instructions decoded per
    clock, and percent reduction in those access
    rates.
  • A fictitious processor implementation
  • The instruction RAM consumes the same amount of
    power as the CPU core
  • The CPU dispatch/decode logic consumes 50 of the
    power of the CPU core.
  • The loop buffer consumes 5 of the instruction
    RAM power.

12
Results
13
Results(contd)
14
Results(contd)
15
Results(contd)
16
Results(contd)
17
Results(contd)
18
Conclusion
  • A method for reducing the power consumption of
    processors has been shown to reduce the
    instruction fetch and decode activity by up to
    83, and the instruction RAM activity by 82.
  • We have shown that overall instruction RAM plus
    CPU power may be reduced by 63
Write a Comment
User Comments (0)
About PowerShow.com