Title: Architecture Tuning in Embedded Systems
1Architecture Tuning in Embedded Systems
- Greg Stitt, Frank Vahid, Tony Givargis
- Dept. of Computer Science Engineering
- University of California, Riverside
Roman Lysecky Department of IP Management
Conexant Newport Beach
This work was supported by the National Science
Foundation under grants CCR-9811164 and
CCR-9876006, and by a Design Automation
Conference graduate scholarship.
This work is being presented at CASES00
(Compilers, Architectures and Synthesis for
Embedded Systems), November 18-19, 2000, San
Jose, CA.
2A short list of embedded systems
Anti-lock brakes Auto-focus cameras Automatic
teller machines Automatic toll systems Automatic
transmission Avionic systems Battery
chargers Camcorders Cell phones Cell-phone base
stations Cordless phones Cruise control Curbside
check-in systems Digital cameras Disk
drives Electronic card readers Electronic
instruments Electronic toys/games Factory
control Fax machines Fingerprint identifiers Home
security systems Life-support systems Medical
testing systems
Modems MPEG decoders Network cards Network
switches/routers On-board navigation Pagers Photoc
opiers Point-of-sale systems Portable video
games Printers Satellite phones Scanners Smart
ovens/dishwashers Speech recognizers Stereo
systems Teleconferencing systems Televisions Tempe
rature controllers Theft tracking systems TV
set-top boxes VCRs, DVD players Video game
consoles Video phones Washers and dryers
- And the list goes on and on
3Introduction Traditional micro-processor use in
embedded systems
- Tasks (not necessarily in the given order)
- (1) Buy a microprocessor IC (integrated circuit)
- (2) Integrate it with other ICs onto a board and
insert it into an embedded system - (3) Download a software program
Software
Processor
Board
1
2
3
- Notice that the processor IC is designed
independent of the software - Different microprocessor variations thus exist,
like low-power or high-performance ICs
4Introduction Modern core-based approach
- Tasks
- (1) Buy a microprocessor CORE
- Hard layout Firm structural HDL Soft
synthesizable HDL - You are buying Intellectual Property, like a file
that may come on a floppy, CD-ROM, over the web,
etc. You are NOT buying hardware. - (2) Design a system-on-a-chip (SOC) from this and
other cores - (3) Fabricate a SOC IC
- (4) Insert the IC into an embedded system
- (5) Download a software program
Software
Processor
Processor
HDL
HDL
1
4
5
2
3
5Introduction embedded system unique feature of
fixed program
- SOCs implementing an embedded system have a
unique feature - Implements a particular application
- Thus, the processor may execute a single fixed
program that never changes - Unlike desktop systems, which execute a variety
of programs - Examples digital camera, automobile
cruise-controller - We can exploit this fixed-program feature
- For example, by using mask-programmed ROM
- But much more can be done
The software in here never changes after
production
6Introduction Proposed core-based approach with
architecture tuning
- Tasks
- (1) Buy a microprocessor core
- (2) Design a system-on-a-chip (SOC) from this and
other cores - (3) TUNE the SOC architecture to a software
program - (4) Fabricate a SOC IC
- (5) Insert the IC into an embedded system
- (6) Download the software program
Software
1
Processor
Processor
Processor
HDL
HDL
HDL
4
5
2
3
6
7Introduction architecture tuning
Fixed program
- Architecture tuning
- A way to exploit the fixed-program feature of
embedded systems - First, do architecture design for the particular
application - Then, tune the core-based system architecture
to the particular application program, before IC
fabrication - Goals better performance, power, size
Architecture design
Peripheral
Prog.
Processor
Architecture tuning
HDL
Prog.
Peripheral
Processor
Fabrication
HDL
Prog.
Peripheral
Tuned cores
Processor
IC
8Introduction architecture tuning
- Examples of tuning optimizations
- Memory hierarchy no cache, L1 cache, L1L2 cache
- Cache organization size, associativity, write
policies - Bus structure, data/address encoding
- DMA block sizes
- Microprocessor optimizations
- Internal small-loop table
- Controller partitioning
- Datapath shortcuts
- Register file copies
9Introduction Tuning is a special case of Y-Chart
iteration
- Philips/TriMedia approach of simultaneously
developing architecture and its applications
Architecture
Applications
Mapping
Analysis
Numbers
10Problem description
- Focus of this work
- Tuning a microcontroller to its program
- Goal is reduced power without performance loss
- Restrict tuning to maintain exact instruction set
compatibility - No instructions may be added or deleted
- Thus, no modification to software development
environment - Also, no problems with porting software to/from
other versions of the microcontroller - Instruction set incompatibility can be a show
stopper - Maintenance/upgrades/re-porting of binaries over
the lifetime of product and for product
variations is a key issue - Likewise, a stable software development
environment is needed
11Previous work
- Application-specific instruction-set processors
Fisher99 - Customize a microprocessor to its application(s)
- Delete unnecessary instructions, add new ones
along with accompanying datapath extensions - e.g., Tensilica
- Customized instruction-set requires customized
development tools (e.g., compiler, debugger) - Tuning compiler to architecture Tiwari et al 94
- Architectural description languages to inform
compiler of architecture features Halambi et al
99 - Tuning cache and cache/bus Givargis et al 99
organization to application
12Tuning environment
- Currently for the 8051 microcontroller
- Starts from VHDL synthesizable model of 8051
(soft core) - Uses Synopsys synthesis, simulation and power
analysis - Uses 8051 instruction-set simulator
- Uses numerous scripts
- Goal of the enviroment
- Understand how power is being consumed for a
particular application, so that modifications to
the architecture (or application) can be made to
minimize that power - Three main tools
- Architectural view
- Instruction-set view
- Program/data memory view
13Tuning environment architectural view tool
14Tuning environment instruction-set view tool
Instruction Power (mW) ADDC_1 7.340834 ADD_1 7.350
741 ANL_1 6.631394 CLR_1 3.76228 CPL_1 5.481627 DA
5.28897 DEC_1 5.368807 DIV 7.716592 INC_1 4.66286
2 MOVC_1 6.078014 MOVC_2 5.021021 MOV_1 5.577664 M
OV_2 6.164267 MUL 5.522886 NOP 4.900275 ORL_1 6.95
4121 POP 8.103867 PUSH 8.7116
15Tuning environment program/data memory view tool
Addr Ins Freq Pwr FreqPwr 00000 LJMP 1 0 0 0000
3 MOV_9 108 5.46067 589.752 00005 MOV_9 108 5.460
67 589.752 00007 MOV_9 108 5.46067 589.752 00009
MOV_9 108 5.46067 589.752 00011 RET 108 0 0 000
12 MOV_9 27 5.46067 147.438 00014 MOV_9 27 5.4606
7 147.438 00016 MOV_9 27 5.46067 147.438 00018 M
OV_9 27 5.46067 147.438 00020 MOV_4 27 4.83507 13
0.547 00022 LCALL 27 0 0
Addr Purpose Accesses 00128 P0
1311 00129 SP 70317 00130 DPL
31189 00131 DPH
7977 00144 P1 161 00208 PSW
413527 00224 ACC
360949 00240 B 2598
16Tuning environment
17Design flow using the tuning environment
18Experiments
- Started with 8051 soft core in VHDL
- Tuning environment was used to
- Examine where power consumption was occurring for
a given application - Quickly evaluate the impact of tuning
optimizations - These are early results, much more work remains
19Power consumption of the initial 8051 model
- Power consumption
- Mainly due to switching wires
- Any wire whos value changed (from 0 to 1)
consumes power - Want to minimize switching
- 8051 power consumption
- 5 main components
- Controller, RAM, and ALU are the most expensive
components - These components have potential for general
optimizations - Total Gates - 25854
Average power 37.1824 mW
20General optimizations made to the 8051
- Prevent unnecessary switching on wires connecting
to memories - Wires connecting processor to memories are high
capacitance - They were switching even when not being used
- So we inserted latches to hold the previous
value, a standard power-saving technique - Prevent unnecessary switching in decoder and ALU
- Again, by latching the inputs coming from the
controller - Fetch instruction bytes only when needed
- Hold ROM output when not being read
21Power after general optimizations
- Overall power reduction from 37.2 to 11.6 mW.
- Total gates - 25951
- improvements
- ROM 82.9
- RAM 70.5
- ALU 60.0
- CTR 19.9
Average power 11.6025 mW
22Tuning optimizations
- Sought to tune the microprocessor to a particular
applicaton - GCD (Greatest common divisor) computation
- Tuning optimizations invoked
- 1) Replace frequently-accessed RAM locations by
internal registers - 2) Create datapath shortcuts for most common
instructions - 3) Partition the controller into a big controller
and a small controller, with the small one
handling the most frequently-executed GCD
instructions
23Sample tuning optimization
- Observation
- RAM consumes much power
- Address 224 accessed frequently
- Possible tuning optimization
- Replace this RAM location by a register
- Steps
- Modify VHDL model
- Run all three view tools
- Results
- Power reduction 7.67 to 7.27 mW
- RAM reduced from 1.42 to 0.8 mW, CTRL increased
slightly
Addr Purpose Accesses 00128 P0
1311 00129 SP 70317 00130 DPL
31189 00131 DPH
7977 00144 P1 161 00208 PSW
413527 00224 ACC
360949 00240 B 2598
24Replacing certain RAM locations by registers
- PSW and accumulator are separated from RAM
entity, placed in internal registers - Total gates - 26465
- improvements
- RAM 46.1
- Overall 15.8
Average Power 9.7684 mW
25Optimized datapath
Addr Ins Freq Pwr FreqPwr 00000
LJMP 1 0 0 00003 MOV_9 108 5.46067 589.752 0000
5 MOV_9 108 5.46067 589.752 00007
MOV_9 108 5.46067 589.752 00009
MOV_9 108 5.46067 589.752 00011
RET 108 0 0 00012 MOV_9 27 5.46067 147.438 0001
4 MOV_9 27 5.46067 147.438 00016
MOV_9 27 5.46067 147.438 00018
MOV_9 27 5.46067 147.438 00020
MOV_4 27 4.83507 130.547 00022 LCALL 27 0 0
- MOV from reg7 to ACC very common
- Add shortcut signal to register file
- Avoids having data go through ALU
- Total Gates - 26315
- Power reduced by 0.32 mW (2.7)
Average power 11.2857 mW
26Controller Partitioning
- Motivation
- In many applications, 90 of the time is spent in
10 of the code (or some similar ratio) - So lets partition the controller into two, one
handling the 10 of frequently executed code - This smaller controller should consume less power
- Results
- Average power reduced from 11.6 mW to 11.3 mW
(2.6) - Total gates - 28731
27Conclusions
- Described an environment for tuning a
microprocessor to its application for low power - Full instruction set compatibility
- Multiple views helps find power hogs
- Fully automated
- Focus is now on developing tuning optimizations
- Controller partitioning, small-loop table,
datapath shortcuts, register-file copies, etc. - Investigate possibility of automating tuning
optimizations, develop more general tuning
methodology - Environment for the 8051 is available on the web
- http//www.cs.ucr.edu/dalton