SelfImproving Configurable IC Platforms

About This Presentation

Title:

SelfImproving Configurable IC Platforms

Description:

Dept. of Computer Science and Engineering. University of California, Riverside ... DAG & LC. Mem. Processor. L1. Cache. Profiler. Explorer. Dynamic Partitioning Module ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 19

Provided by: frank126

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: SelfImproving Configurable IC Platforms

1
Self-Improving Configurable IC Platforms

Frank Vahid
Associate Professor
Dept. of Computer Science and Engineering
University of California, Riverside
Also with the Center for Embedded Computer
Systems at UC Irvine
http//www.cs.ucr.edu/vahid
Co-PI Walid Najjar, Professor, CSE, UCR

2
Goal Platform Self-Tunes to Executing Application
Platform

Download standard binary
Platform adjusts to executing application
Result is better speed and energy
Why and How?

3
Platforms

Pre-designed programmable platforms
Reduce NRE cost, time-to-market, and risk
Platform designer amortizes design cost over
large volumes
Many (if not most) will include FPGA
Today Triscend, Altera, Xilinx, Atmel
More sure to come
As FPGA vendors license to SoC makers

Sample Platform Processor, cache, memory, FPGA,
etc.
Modern IC costs are feasible mostly in very high
volumes
4
Hardware/Software Partitioning Improves Speed and
Energy

But requires partitioning CAD tool
O.K. in some flows
In mainstream software flows, hard to integrate

Standard Sw Tools
Hw/Sw Parti- tioner
5
Idea Perform Partitioning Dynamically (and hence
Transparently)

Add components on-chip
Profile
Decompile frequent loops
Optimize
Synthesize
Place and route onto FPGA
Update Sw to call FPGA
Transparent
No impact on tool flow
Dynamic software optimization, software binary
updating, and dynamic binary translation are
proven technologies
But how can you profile, decompile, optimize,
synthesize, and pr, on-chip?

Mem
Processor
L1 Cache
Profiler Explorer
DAG LC

Dynamic Partitioning Module
FPGA
6
Dynamic Partitioning Requires Lean Tools

How can you run Synopsys/Cadence/Xilinx tools
on-chip, when they currently run on powerful
workstations?
Key our tools only need be good enough to
speedup critical loops
Most time spent in small loops (e.g., Mediabench,
Netbench, EEMBC)
Created ultra-lean versions of the tools
Quality not necessarily as good, but good enough
Runs on a 60 MHz ARM 7

Loop
7
Dynamic Hw/Sw Partitioning Tool Chain
Weve developed efficient profiler Hw
Mem
Processor
L1 Cache
Profiler Explorer
Were continuing to extend these tools to handle
more benchmarks
DAG LC
FPGA
Partitioner
Architecture targeted for loop speedup, simple PR
8
Dynamic Hw/Sw Partitioning Results
Mem
Processor
L1 Cache
Profiler Explorer
DAG LC
FPGA
Partitioner
9
Dynamic Hw/Sw Partitioning Results

Powerstone, NetBench, and EEMBC examples, most
frequent 1 loop only
Average speedup very close to ideal speedup of
2.4
Not much left on the table in these examples
Dynamically speeding up inners loops on FPGAs is
feasible using on-chip tools
ICCAD02 (Stitt/Vahid) Binary-level
partitioning in general is very effective

10
Configurable Cache Why?

ARM920T Caches consume half of total processor
system power (Segars 01)
MCORE Unified cache consumes half of total
processor sys. power (Lee/Moyer/Arends 99)

11
Best Cache for Embedded Systems?

Diversity of associativity, line size, total size

12
Cache Design Dilemmas

Associativity
Low low power, good performance for many
programs
High better performance on more programs
Total size
Small lower power if working set small, (less
area)
Big better performance/power if working set
large
Line size
Small better when poor spatial locality
Big better when good spatial locality
Most caches are a compromise for many programs
Work best on average
But embedded systems run one/few programs
Want best cache for that one program

vs.
vs.
vs.
13
Solution to the Cache Design Dilemna

Configurable cache
Design physical cache that can be reconfigured
1-way, 2-ways, or 4-ways
Way concatenation new technique, ISCA03
(Zhang/Vahid/Najjar)
Four 2K ways, plus concatenation logic
8K, 4K or 2K byte total size
Way shutdown, ISCA03
Gates Vdd, saves both dynamic and static power,
some performance overhead (5)
16, 32 or 64 byte line size
Variable line fetch size, ISVLSI03
Physical 16 byte line, one, two or four physical
line fetches
Note this is a single physical cache, not a
synthesizable core

14
Configurable Cache Design Way Concatenation (4,
2 or 1 way)
a31 tag address
a13 a12 a11 a10
index a5
a4 line offset a0
Configuration circuit
a11
Trivial area overhead, no performance overhead
reg0
a12
reg1
tag part
c1
c3
c0
c2
bitline
c1
c0
index
6x64
6x64
6x64
data array
c2
c3
6x64
6x64
column mux
sense amps
tag address
line offset
mux driver
data output
critical path
15
Configurable Cache Design Metrics