Title: ARMO: Synthesizable Modeling of Mobile Phones to Facilitate Architectural Studies of Performance and
1ARMOSynthesizable Modeling of Mobile Phones to
Facilitate Architectural Studies of Performance
and Energy-Efficiency
- Nokia-MIT Meeting
- Helsinki, Finland
- June 5, 2006
2Nokia-MIT Architecture Team
- MIT
- Faculty Arvind, Krste Asanovic, Anantha
Chandrakasan - Students Nirav Dave, (Alfred) Man Cheuk Ng,
ChunChieh V Lin, Daniel Finchelstein, Steve
Gerding, Jae Lee, Christopher Batten, Michal
Karczmarek, Ken Barr - Nokia
- Gopal Raghavan, Soracha Nananukul, Jamey Hicks,
John Ankcorn
3Project Overview
- Goal Provide 10x performance improvement with
same power envelope and same semiconductor
technology - Approach New design, simulation, and synthesis
methodologies new circuit techniques - Specify components in a form (transactors) that
supports flexible partitioning between software
and hardware for effective design space
exploration. - Run full-system simulations with components at
multiple levels of abstraction for early stage
performance and power analysis - Ultra-low-power DSP technologies using
sub-threshold logic (see poster) - Application Target Next-generation cellphone
running next-generation applications - We expect many exciting new applications
(including those developed in the Nokia-MIT
collaboration) to require unprecedented
performance from a handheld device
4Example Design Scenario
- Build a H.264 decoder for a digital video
receiver that runs in 100mW
5Engineering Design Problem
New App. Processor
ARM
DSP
GPU
IVA
6Multicore Processors
IBM/Sony Cell Processor
- Advantages
- Easier to scale hardwaredesign as complexity is
contained within processors - Develop complex applications using just software
tools
- A few minor problems?
- Power up to 1000X worse than customized standard
cells - Performance up to 100X worse
- Area up to 100X greater
- And, do we really know how to program these?
7Another popular platform visionField-Programma
ble Gate Arrays
- Programmable logic
- Dramatically reduce the cost of chip design
errors, spin-a-day - Remove the mask costs from each design
- A few minor problems? From Kuon Rose,
FPGA2006 - Switching power around 12X worse than standard
cells - Performance up 3-4X worse than standard cells
- Area 20-40X greater than standard cells
- And, requires tremendous low-level design effort
8Standard Cells versus Hand-Crafted Design
- Even standard-cell methodology leaves a lot of
capability on the table, versus hand-crafted
design (draw each transistor by hand) - Power 3-10X worse than hand-crafted
- Speed 3-8X worse
- Area 3-15X worse
- BUT hand-crafted needs 10-20X design effort
- IBM/Sony Cell microprocessor had 400M
development budget but hit 4GHz in 90nm
technology - Standard-cell chip costs 20M to develop and hit
400MHz in 90nm technology
9Power-Flexibility Conflict
Source T.Claasen (ISSCC99)
Power efficiency (32b GOPS/Watt)
1000
100
Hand-Crafted Design
Standard-Cell Based Design
10
Coarse-Grain Reconfigurable
Field-Programmable Gate Array
1
Application-Specific Processor, DSP
Hardwired data paths
General Purpose Processor
Reconfigurable Logic
0.1
Instruction Set Processors
0.01
0.001
2
1
0.5
0.25
0.13
0.07
feature size(mm)
10Our Vision of Future System-on-Chip (SoC)Highly
structured but heterogeneous
Near full-custom quality application-specific
hardware
On-chip memory banks
Structured on-chip networks, high-bandwidth
low-power global comunications
General-purpose processors
Application-specific processors
11Keys to Future Efficient SoCs
- Meet throughput goals using multiple parallel
execution units to allow system to run with lower
clock rate (1/2 clock rate ? 1/4 energy) - Parallelism
- Avoid global communication to reduce latency and
to reduce wire switching power - Locality
- Use customized logic or new instructions to
improve efficiency of application hot-spots
(10-1000x improvement over general-purpose
processing for certain tasks) - Specialization
12Design Representation
Partitioned, Concurrent Design Representation
- How to represent partitioned, concurrent design?
- How do pieces communicate and synchronize?
- Does representation partition design in way that
supports mapping to heterogeneous components? - How well can individual units be refined to
efficient hardware and/or software
implementations?
13Partitioning and Mapping Today?
- Design Representation
- Design entered via a variety of languages,
formalisms, and tool environments - Implementation
- Design manually partitioned, different languages
and tools for each target - Verification Evaluation
- System verification and evaluation difficult,
error-prone, and slow
C Program
Verilog/VHDL RTL Code
Verilog/VHDL RTL Code
C program Assembly
DSP compiler Assembler
Logic synthesis FPGA tools
General-purpose compiler OS
Logic synthesis Physical Design
14Partitioning and Mapping Today?
- Design Representation
- Design entered via a variety of languages,
formalisms, and tool environments - Implementation
- Design manually partitioned, different languages
and tools for each target - Verification Evaluation
- System verification and evaluation difficult,
error-prone, and slow
C Program
Verilog/VHDL RTL Code
Verilog/VHDL RTL Code
C program Assembly
DSP compiler Assembler
Logic synthesis FPGA tools
General-purpose compiler OS
Logic synthesis Physical Design
15Standard Cell Tool Flow
RTL
Front-End
Limited design points considered in the front end
RTL DRC
RTL/Gate Simulation
RTL Floorplanning
Equivalence Checking
Physical Synthesis
Because implementing each design point can
cause 10s, 100s, 1000s of interations In the
back-end
Static Timing
Wire Delay
Reliability
Placement, Route
Back-End
Leakage Power
Critical parameter variation
ATPG/Test Vectors
Layout
Wire Coupling
Test Program
Signal Integrity
Extraction
LVS/DRC
Signal Integrity Analysis
- Difficult to complete even one correct
implementation - Minimal design space exploration is practical
- Designers do not exploit current fabrication
capabilities - E.g. Typical ASIC in 130nm has 200MHz clock rate
(commercial microprocessor over 2GHz)
16ARMO Approach Transactors
- Describe computation as a network of transactors
(transactional actors) - Single formal description used for design
capture, verification, power-performance
simulation, and hardware/software synthesis
17H.264 Top Level Transactor Decomposition
Transactors
Channel
Intra- Pred.
CABAC
Front-end parser Control
Deb. Filter Frame Buffer
IQ IDCT
CAVLC
Inter- Pred.
18Transactor Mapping Process
- Transactors can be independently designed and
tested, simplifies reuse - Channels can be automatically generated and
mapped to shared memory, FIFO, user-level message
passing, etc. - Transactor description can be naturally refined
into Bluespec RTL code to support hardware
microarchitecture exploration - Arvinds talk next
19Transactor Anatomy
- Transactor unit comprises
- Architectural state (registers RAMs)
- Input queues and output queues connected to other
units - Transactions (guarded atomic actions on state and
queues) - Scheduler (selects next ready transaction to run)
Transactions
Output queues
Input queues
Scheduler
Transactor
- Advantages
- Handles non-deterministic inputs
- Allows concurrent operations on mutable state
within unit - Natural representation for formal verification
20Computation Described as Transactor Network
Transactor
Global inter-unit communication via FIFO buffered
point-point channels
Short-range local communication within unit
- Decompose computation into network of transactor
units - Only communication between units via buffered
point-point channels - All computation only on local state and channel
end-points - Representation encodes concurrency and locality
- Semantics amenable to mixed hardware and software
implementations
21Desired System Design Flow
Transactor Description
22Status
- Initial decomposition of large applications into
transactor networks - H.264 video decoder (see poster)
- 802.11a wireless ethernet transmitter/receiver
(see poster) - Transactor implementations on programmable
processors - Complete H.264 decoder mapped to OMAP 2420 ARM
core - Components being mapped to OMAP DSP
- Transactor implementations in Bluespec RTL
- 802.11a mapped into Bluespec for design space
exploration (Arvinds talk next) - H.264 component mapping in progress
- Transactor-level power characterization
- Real hardware power measurement setup based on
SPUD OMAP 2420 development board (see poster)
23Plan of Work
- Continue to port more applications into
transactor model - Evaluate/enhance transactor approach
- Provides input for design tool flow
- Develop tools for high-level design-space
exploration - Develop full-system power-performance simulation
of entire cellphone based on transactor model - Models for transactors mapped to software on
processor cores - Models for inter-transactor channels mapped to
on-chip network and/or shared memory buffers - Models for external memories (DRAM, FLASH)
- Models for other I/O
- Link to VLSI tools to include custom logic models
- Support near-realtime simulation on RAMP
FPGA-based emulation platform - Longer range New field-programmable transactor
array chip architecture - Design processor cores and on-chip network to
execute transactor model directly - Goal is to achieve energy and performance close
to custom logic but with software programmability