Title: System-on-a-Chip Platform Tuning for Embedded Systems
1System-on-a-Chip Platform Tuning for Embedded
Systems
- Frank Vahid
- Associate Professor
- Dept. of Computer Science and Engineering
- University of California, Riverside
- Also with the Center for Embedded Computer
Systems at UC Irvine - http//www.cs.ucr.edu/vahid
- This research has been supported by the National
Science Foundation, NEC, Trimedia, and Triscend
2How Much is Enough?
3How Much is Enough?
Perhaps a bit small
4How Much is Enough?
Reasonably sized
5How Much is Enough?
Probably plenty big
6How Much is Enough?
More than typically necessary
7How Much is Enough?
Very few people could use this
8How Much is Enough for an IC?
1993 1 million logic transistors
Perhaps a bit small
9How Much is Enough for an IC?
1996 5-8 million logic transistors
Reasonably sized
10How Much is Enough for an IC?
1999 10-50 million logic transistors
Probably plenty big
11How Much is Enough for an IC?
2002 100-200 million logic transistors
More than typically necessary
12How Much is Enough for an IC?
- Point of diminishing returns
- 8-bit uC 15K
- 32-bit ARM 30K
- MPEG dcd 1M
- 100M good enough for audio/video/etc.?
- Other examples
- Fast cars (gt 100 mph)
- High res digital cameras (gt 4M)
- Disk space
- Even IC performance
13Very Few Companies Can Design High-End ICs
Design productivity gap
Source ITRS99
- Designer productivity growing at slower rate
- 1981 100 designer months ? 1M
- 2002 30,000 designer months ? 300M
14Meanwhile, ICs Themselves are Costlier
Tech 0.8 0.35 0.18 0.13
NRE 40k 100k 350k 1,000k
Turnaround 42 days 49 days 56 days 76 days
Market 3.5B 6B 12B 18B
Source DAC01 panel on embedded programmable
logic
- And take longer to fabricate
- While market windows are shrinking
- Less than 1,000 out of 10,000 ASIC designs have
volumes to justify fabrication in 0.13 micron
15Summarizing So Far...
Designers
16Trend Towards Pre-Fabricated Platforms ASSPs
- ASSP application specific standard product
- Domain-specific pre-fabricated IC
- e.g., digital camera IC
- ASIC application specific IC
- ASSP revenue gt ASIC
- ASSP design starts gt ASIC
- Unique IC design
- Ignores quantity of same IC
- ASIC design starts decreasing
- Due to strong benefits of using pre-fabricated
devices
Source Gartner/Dataquest September01
17A Sample Pre-Fabricated Platform
- Must be programmable for use in variety of
products - Ideally also configurable
- Means high volume
- Platform designers investment pays off
- Cost per IC is reasonable
- Use additional (readily available) transistors
for high configurability - Our research focus
- Design and use of highly configurable platforms
Periph- erals
L2 cache
JPEG dcd
L1 cache
uP
DSP
FPGA
IC
Pre-fabricated Platform
18Commercial Highly-Configurable Platform Type
Single-Chip Microprocessor/FPGA Platforms
- Triscend E5 based on 8-bit 8051 CISC core
- 10 Dhrystone MIPS at 40MHz
- 60 kbytes on-chip RAM
- up to 40K logic gates
- Cost only about 4 (in volume)
19Single-Chip Microprocessor/FPGA Platforms
- Atmel FPSLIC
- Field-Programmable System-Level IC
- Based on AVR 8-bit RISC core
- 20 Dhrystone MIPS
- 5k-40k configurable logic gates
- On-chip RAM (20-36Kb) and EEPROM
- 5-10
Courtesy of Atmel
20Single-Chip Microprocessor/FPGA Platforms
- Triscend A7 chip
- Based on ARM7 32-bit RISC processor
- 54 Dhrystone MIPS at 60 MHz
- Up to 40k logic gates
- On-chip cache and RAM
- 10-20 in volume
Courtesy of Triscend
21Single-Chip Microprocessor/FPGA Platforms
- Alteras Excalibur EPXA 10
- ARM (922T) hard core
- 200 Dhrystone MIPS at 200 MHz
- Devices range from 200k to 2 million
programmable logic gates
Source www.altera.com
22Single-Chip Microprocessor/FPGA Platforms
- Xilinx Virtex II Pro
- PowerPC based
- 420 Dhrystone MIPS at 300 MHz
- 1 to 4 PowerPCs
- 4 to 16 gigabit transceivers
- 12 to 216 multipliers
- 3,000 to 50,000 logic cells
- 200k to 4M bits RAM
- 204 to 852 I/O
- 100-500 (gt25,000 units)
- Up to 16 serial transceivers
- 622 Mbps to 3.125 Gbps
PowerPCs
Config. logic
Courtesy of Xilinx
23Single-Chip Microprocessor/FPGA Platforms
- Why wouldnt future microprocessor chips include
some amount of on-chip FPGA?
24Single-Chip Microprocessor/FPGA Platforms
- Lots of silicon area taken up by configurable
logic - As discussed earlier, less of an issue every year
- Smaller area doesnt necessarily mean higher
yield (lower costs) any more - Previously could pack more die onto a wafer
- But die are becoming pad (pin) limited in
nanoscale technologies - Configurable logic typically used for
peripherals, glue logic, etc. - We have investigated another use...
25Software Improvements using On-Chip Configurable
Logic
A7 IC
- Partitioned software critical loops onto on-chip
FPGA for several benchmarks - Performed physical measurements on Triscend A7
and E5 devices
Triscend A7 development board
Work done by Greg Stitt, Brian Grattan, Shawn
Nematbaktsh at UCR
26Software Improvements using On-Chip Configurable
Logic
- Extensive simulated results for 8051 and MIPS
- (Physical measurement very time consuming)
- For Powerstone (PS), MediaBench (MB) and Netbench
(NB)
27Speedup Gained with Relatively Few Gates
- Created several partitioned versions of each
benchmarks - Most speedup gained with first 20,000 gates
diminishing returns after that - Surprisingly few gates
- Stitt, Grattan and Vahid, Field-programmable
Custom Computing Machines (FCCM) 2002 - Stitt and Vahid, IEEE Design and Test, Dec. 2002
- J. Villarreal, D. Suresh, G. Stitt, F. Vahid and
W. Najjar, Design Automation of Embedded Systems,
2002 (to appear).
28Other Types of Configurability
- Microprocessor (other researchers)
- VLIW configurations
- Voltage scaling
- Memory hierarchy
- Our focus build a highly-configurable cache that
can be tuned to a particular program - Work by Chaunjun Zhang, along with Walid Najjar,
at UCR
29Cache Contributes Much to Performance and Power
- Well-known for performance
- Energy
- ARM920T caches consume nearly half of total
power (Segars 01) - MCORE unified cache consumes half of total
power (Lee/Moyer/Arends 99)
Mem
L1 Cache
Processor
ARM920T. Source Segars ISSCC01
30Associativity Plays a Big Role
- Reduces miss rate thus improving performance
- Impact on power and energy?
- (Energy Power Time)
31Associativity is Costly
- Associativity improves hit rate, but at the cost
of more power per access - Are the power savings from reduced misses
outweighed by the increased power per hit?
Energy access breakdown for 8 Kbyte, 4-way set
associative cache (considering dynamic power only)
Energy per access for 8 Kbyte cache
32Associativity and Energy
- Best performing cache is not always lowest energy
33So Whats the Best Cache?
- Looking at popular embedded processors, theres
obviously no standard cache - Dilemma
- Direct mapped good performance and energy for
most programs - Four-way good performance for all programs, but
at cost of higher power per access for all
programs - Do we design for the average case or the worst
case?
34Solution to the Dilemma
- Configurable cache
- Can be configured as four way, two way, or one
way - Ways can be concatenated
- Furthermore, ways can even be shut down to
decrease total size
Memory
35Configurable Cache Design Way Concatenation
a31 tag address
a13 a12 a11 a10
index a5
a4 line offset a0
Configuration circuit
a11
Small area and performance overhead
reg0
a12
reg1
tag part
c1
c3
c0
c2
bitline
c1
c0
index
6x64
6x64
6x64
data array
c2
c3
6x64
6x64
column mux
sense amps
tag address
line offset
mux driver
data output
critical path
36Configurable Cache Experiments
100 4-way conventional cache
- Configurable cache with both way concatenation
and way shutdown is superior on every benchmark - Considered Powerstone, MediaBench, and Spec2000
- Tuning the cache to the program is important
- Work submitted to High-Performance Computer
Architectures 2003, Zhang, Vahid and Najjar
37Conclusions
- Trend is away from semi-custom IC fabrication
- Big enough other pressures encourage buying
pre-fabricated platforms - Platforms must be highly configurable
- To be useful for a variety of applications, and
hence mass produced - We have discussed
- Software speedup/energy benefits of on-chip
configurable logic 3x speedups with only 10,000
gates - Creating a highly-configurable cache
architecture 40 energy savings compared to
conventional cache - Current/future work (collaborators Walid Najjar
UCR, Nik Dutt UCI) - Automatically partitioning software loops to
configurable logic - Several approaches platform-assisted, and
dynamically on-chip - Work being done by Roman Lysecky, Susan
Cotterell, Greg Stitt, and Shawn Nematbaktsh at
UCR - Automatically tuning a configurable cache
- Ann Gordon-Ross at UCR