SelfImproving Configurable IC Platforms - PowerPoint PPT Presentation

About This Presentation
Title:

SelfImproving Configurable IC Platforms

Description:

Dept. of Computer Science and Engineering. University of California, Riverside ... DAG & LC. Mem. Processor. L1. Cache. Profiler. Explorer. Dynamic Partitioning Module ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 19
Provided by: frank126
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: SelfImproving Configurable IC Platforms


1
Self-Improving Configurable IC Platforms
  • Frank Vahid
  • Associate Professor
  • Dept. of Computer Science and Engineering
  • University of California, Riverside
  • Also with the Center for Embedded Computer
    Systems at UC Irvine
  • http//www.cs.ucr.edu/vahid
  • Co-PI Walid Najjar, Professor, CSE, UCR

2
Goal Platform Self-Tunes to Executing Application
Platform
  • Download standard binary
  • Platform adjusts to executing application
  • Result is better speed and energy
  • Why and How?

3
Platforms
  • Pre-designed programmable platforms
  • Reduce NRE cost, time-to-market, and risk
  • Platform designer amortizes design cost over
    large volumes
  • Many (if not most) will include FPGA
  • Today Triscend, Altera, Xilinx, Atmel
  • More sure to come
  • As FPGA vendors license to SoC makers

Sample Platform Processor, cache, memory, FPGA,
etc.
Modern IC costs are feasible mostly in very high
volumes
4
Hardware/Software Partitioning Improves Speed and
Energy
  • But requires partitioning CAD tool
  • O.K. in some flows
  • In mainstream software flows, hard to integrate

Standard Sw Tools
Hw/Sw Parti- tioner
5
Idea Perform Partitioning Dynamically (and hence
Transparently)
  • Add components on-chip
  • Profile
  • Decompile frequent loops
  • Optimize
  • Synthesize
  • Place and route onto FPGA
  • Update Sw to call FPGA
  • Transparent
  • No impact on tool flow
  • Dynamic software optimization, software binary
    updating, and dynamic binary translation are
    proven technologies
  • But how can you profile, decompile, optimize,
    synthesize, and pr, on-chip?

Mem
Processor
L1 Cache
Profiler Explorer
DAG LC

Dynamic Partitioning Module
FPGA
6
Dynamic Partitioning Requires Lean Tools
  • How can you run Synopsys/Cadence/Xilinx tools
    on-chip, when they currently run on powerful
    workstations?
  • Key our tools only need be good enough to
    speedup critical loops
  • Most time spent in small loops (e.g., Mediabench,
    Netbench, EEMBC)
  • Created ultra-lean versions of the tools
  • Quality not necessarily as good, but good enough
  • Runs on a 60 MHz ARM 7

Loop
7
Dynamic Hw/Sw Partitioning Tool Chain
Weve developed efficient profiler Hw
Mem
Processor
L1 Cache
Profiler Explorer
Were continuing to extend these tools to handle
more benchmarks
DAG LC
FPGA
Partitioner
Architecture targeted for loop speedup, simple PR
8
Dynamic Hw/Sw Partitioning Results
Mem
Processor
L1 Cache
Profiler Explorer
DAG LC
FPGA
Partitioner
9
Dynamic Hw/Sw Partitioning Results
  • Powerstone, NetBench, and EEMBC examples, most
    frequent 1 loop only
  • Average speedup very close to ideal speedup of
    2.4
  • Not much left on the table in these examples
  • Dynamically speeding up inners loops on FPGAs is
    feasible using on-chip tools
  • ICCAD02 (Stitt/Vahid) Binary-level
    partitioning in general is very effective

10
Configurable Cache Why?
  • ARM920T Caches consume half of total processor
    system power (Segars 01)
  • MCORE Unified cache consumes half of total
    processor sys. power (Lee/Moyer/Arends 99)

11
Best Cache for Embedded Systems?
  • Diversity of associativity, line size, total size

12
Cache Design Dilemmas
  • Associativity
  • Low low power, good performance for many
    programs
  • High better performance on more programs
  • Total size
  • Small lower power if working set small, (less
    area)
  • Big better performance/power if working set
    large
  • Line size
  • Small better when poor spatial locality
  • Big better when good spatial locality
  • Most caches are a compromise for many programs
  • Work best on average
  • But embedded systems run one/few programs
  • Want best cache for that one program

vs.
vs.
vs.
13
Solution to the Cache Design Dilemna
  • Configurable cache
  • Design physical cache that can be reconfigured
  • 1-way, 2-ways, or 4-ways
  • Way concatenation new technique, ISCA03
    (Zhang/Vahid/Najjar)
  • Four 2K ways, plus concatenation logic
  • 8K, 4K or 2K byte total size
  • Way shutdown, ISCA03
  • Gates Vdd, saves both dynamic and static power,
    some performance overhead (5)
  • 16, 32 or 64 byte line size
  • Variable line fetch size, ISVLSI03
  • Physical 16 byte line, one, two or four physical
    line fetches
  • Note this is a single physical cache, not a
    synthesizable core

14
Configurable Cache Design Way Concatenation (4,
2 or 1 way)
a31 tag address
a13 a12 a11 a10
index a5
a4 line offset a0
Configuration circuit
a11
Trivial area overhead, no performance overhead
reg0
a12
reg1
tag part
c1
c3
c0
c2
bitline
c1
c0
index
6x64
6x64
6x64
data array
c2
c3
6x64
6x64
column mux
sense amps
tag address
line offset
mux driver
data output
critical path
15
Configurable Cache Design Metrics
  • We computed power, performance, energy and size
    using
  • CACTI models
  • Our own layout (0.13 TSMC CMOS), Cadence tools
  • Energy considered cache, memory, bus, and CPU
    stall
  • Powerstone, MediaBench, and SPEC benchmarks
  • Used SimpleScalar for simulations

16
Configurable Cache Energy Benefits
  • 40-50 energy savings on average
  • Compared to conventional 4-way and 1-way assoc.,
    32-byte line size
  • AND, best for every example (remember,
    conventional is compromise)

17
Future Work
  • Dynamic cache tuning
  • More advanced dynamic partitioning
  • Automatic frequent loop detection
  • On-chip exploration tool
  • Better decompilation, synthesis
  • Better FPGA fabric, place and route
  • Approach continue to extend to support more
    benchmarks
  • Extend to platforms with multiple processors
  • Scales well processors can share on-chip
    partitioning tools

18
Conclusions
  • Self-improving configurable ICs
  • Provide excellent speed and energy improvements
  • Require no modification to existing software
    flows
  • Can thus be widely adopted
  • Weve shown the idea is practical
  • Lean on-chip tools are possible
  • Now need to make them even better
  • Extensive research into algorithms, designs and
    architecture is needed
Write a Comment
User Comments (0)
About PowerShow.com