Title: SelfImproving Configurable IC Platforms
1Self-Improving Configurable IC Platforms
- Frank Vahid
- Associate Professor
- Dept. of Computer Science and Engineering
- University of California, Riverside
- Also with the Center for Embedded Computer
Systems at UC Irvine - http//www.cs.ucr.edu/vahid
- Co-PI Walid Najjar, Professor, CSE, UCR
2Goal Platform Self-Tunes to Executing Application
Platform
- Download standard binary
- Platform adjusts to executing application
- Result is better speed and energy
- Why and How?
3Platforms
- Pre-designed programmable platforms
- Reduce NRE cost, time-to-market, and risk
- Platform designer amortizes design cost over
large volumes - Many (if not most) will include FPGA
- Today Triscend, Altera, Xilinx, Atmel
- More sure to come
- As FPGA vendors license to SoC makers
Sample Platform Processor, cache, memory, FPGA,
etc.
Modern IC costs are feasible mostly in very high
volumes
4Hardware/Software Partitioning Improves Speed and
Energy
- But requires partitioning CAD tool
- O.K. in some flows
- In mainstream software flows, hard to integrate
Standard Sw Tools
Hw/Sw Parti- tioner
5Idea Perform Partitioning Dynamically (and hence
Transparently)
- Add components on-chip
- Profile
- Decompile frequent loops
- Optimize
- Synthesize
- Place and route onto FPGA
- Update Sw to call FPGA
- Transparent
- No impact on tool flow
- Dynamic software optimization, software binary
updating, and dynamic binary translation are
proven technologies - But how can you profile, decompile, optimize,
synthesize, and pr, on-chip?
Mem
Processor
L1 Cache
Profiler Explorer
DAG LC
Dynamic Partitioning Module
FPGA
6Dynamic Partitioning Requires Lean Tools
- How can you run Synopsys/Cadence/Xilinx tools
on-chip, when they currently run on powerful
workstations? - Key our tools only need be good enough to
speedup critical loops - Most time spent in small loops (e.g., Mediabench,
Netbench, EEMBC) - Created ultra-lean versions of the tools
- Quality not necessarily as good, but good enough
- Runs on a 60 MHz ARM 7
Loop
7Dynamic Hw/Sw Partitioning Tool Chain
Weve developed efficient profiler Hw
Mem
Processor
L1 Cache
Profiler Explorer
Were continuing to extend these tools to handle
more benchmarks
DAG LC
FPGA
Partitioner
Architecture targeted for loop speedup, simple PR
8Dynamic Hw/Sw Partitioning Results
Mem
Processor
L1 Cache
Profiler Explorer
DAG LC
FPGA
Partitioner
9Dynamic Hw/Sw Partitioning Results
- Powerstone, NetBench, and EEMBC examples, most
frequent 1 loop only - Average speedup very close to ideal speedup of
2.4 - Not much left on the table in these examples
- Dynamically speeding up inners loops on FPGAs is
feasible using on-chip tools - ICCAD02 (Stitt/Vahid) Binary-level
partitioning in general is very effective
10Configurable Cache Why?
- ARM920T Caches consume half of total processor
system power (Segars 01) - MCORE Unified cache consumes half of total
processor sys. power (Lee/Moyer/Arends 99)
11Best Cache for Embedded Systems?
- Diversity of associativity, line size, total size
12Cache Design Dilemmas
- Associativity
- Low low power, good performance for many
programs - High better performance on more programs
- Total size
- Small lower power if working set small, (less
area) - Big better performance/power if working set
large - Line size
- Small better when poor spatial locality
- Big better when good spatial locality
- Most caches are a compromise for many programs
- Work best on average
- But embedded systems run one/few programs
- Want best cache for that one program
vs.
vs.
vs.
13Solution to the Cache Design Dilemna
- Configurable cache
- Design physical cache that can be reconfigured
- 1-way, 2-ways, or 4-ways
- Way concatenation new technique, ISCA03
(Zhang/Vahid/Najjar) - Four 2K ways, plus concatenation logic
- 8K, 4K or 2K byte total size
- Way shutdown, ISCA03
- Gates Vdd, saves both dynamic and static power,
some performance overhead (5) - 16, 32 or 64 byte line size
- Variable line fetch size, ISVLSI03
- Physical 16 byte line, one, two or four physical
line fetches - Note this is a single physical cache, not a
synthesizable core
14Configurable Cache Design Way Concatenation (4,
2 or 1 way)
a31 tag address
a13 a12 a11 a10
index a5
a4 line offset a0
Configuration circuit
a11
Trivial area overhead, no performance overhead
reg0
a12
reg1
tag part
c1
c3
c0
c2
bitline
c1
c0
index
6x64
6x64
6x64
data array
c2
c3
6x64
6x64
column mux
sense amps
tag address
line offset
mux driver
data output
critical path
15Configurable Cache Design Metrics
- We computed power, performance, energy and size
using - CACTI models
- Our own layout (0.13 TSMC CMOS), Cadence tools
- Energy considered cache, memory, bus, and CPU
stall
- Powerstone, MediaBench, and SPEC benchmarks
- Used SimpleScalar for simulations
16Configurable Cache Energy Benefits
- 40-50 energy savings on average
- Compared to conventional 4-way and 1-way assoc.,
32-byte line size - AND, best for every example (remember,
conventional is compromise)
17Future Work
- Dynamic cache tuning
- More advanced dynamic partitioning
- Automatic frequent loop detection
- On-chip exploration tool
- Better decompilation, synthesis
- Better FPGA fabric, place and route
- Approach continue to extend to support more
benchmarks - Extend to platforms with multiple processors
- Scales well processors can share on-chip
partitioning tools
18Conclusions
- Self-improving configurable ICs
- Provide excellent speed and energy improvements
- Require no modification to existing software
flows - Can thus be widely adopted
- Weve shown the idea is practical
- Lean on-chip tools are possible
- Now need to make them even better
- Extensive research into algorithms, designs and
architecture is needed