ARMO: Synthesizable Modeling of Mobile Phones to Facilitate Architectural Studies of Performance and - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

ARMO: Synthesizable Modeling of Mobile Phones to Facilitate Architectural Studies of Performance and

Description:

Verilog/VHDL RTL Code. C Program. Partitioning and Mapping Today? C program. Assembly ... Verilog/VHDL RTL Code. General-purpose compiler. OS. RTL. RTL DRC. RTL ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 24

Provided by: krst1

Category:

more less

Transcript and Presenter's Notes

Title: ARMO: Synthesizable Modeling of Mobile Phones to Facilitate Architectural Studies of Performance and

1
ARMOSynthesizable Modeling of Mobile Phones to
Facilitate Architectural Studies of Performance
and Energy-Efficiency

Nokia-MIT Meeting
Helsinki, Finland
June 5, 2006

2
Nokia-MIT Architecture Team

MIT
Faculty Arvind, Krste Asanovic, Anantha
Chandrakasan
Students Nirav Dave, (Alfred) Man Cheuk Ng,
ChunChieh V Lin, Daniel Finchelstein, Steve
Gerding, Jae Lee, Christopher Batten, Michal
Karczmarek, Ken Barr
Nokia
Gopal Raghavan, Soracha Nananukul, Jamey Hicks,
John Ankcorn

3
Project Overview

Goal Provide 10x performance improvement with
same power envelope and same semiconductor
technology
Approach New design, simulation, and synthesis
methodologies new circuit techniques
Specify components in a form (transactors) that
supports flexible partitioning between software
and hardware for effective design space
exploration.
Run full-system simulations with components at
multiple levels of abstraction for early stage
performance and power analysis
Ultra-low-power DSP technologies using
sub-threshold logic (see poster)
Application Target Next-generation cellphone
running next-generation applications
We expect many exciting new applications
(including those developed in the Nokia-MIT
collaboration) to require unprecedented
performance from a handheld device

4
Example Design Scenario

Build a H.264 decoder for a digital video
receiver that runs in 100mW

5
Engineering Design Problem
New App. Processor
ARM
DSP
GPU
IVA
6
Multicore Processors
IBM/Sony Cell Processor

Advantages
Easier to scale hardwaredesign as complexity is
contained within processors
Develop complex applications using just software
tools

A few minor problems?
Power up to 1000X worse than customized standard
cells
Performance up to 100X worse
Area up to 100X greater
And, do we really know how to program these?

7
Another popular platform visionField-Programma
ble Gate Arrays

Programmable logic
Dramatically reduce the cost of chip design
errors, spin-a-day
Remove the mask costs from each design

A few minor problems? From Kuon Rose,
FPGA2006
Switching power around 12X worse than standard
cells
Performance up 3-4X worse than standard cells
Area 20-40X greater than standard cells
And, requires tremendous low-level design effort

8
Standard Cells versus Hand-Crafted Design

Even standard-cell methodology leaves a lot of
capability on the table, versus hand-crafted
design (draw each transistor by hand)
Power 3-10X worse than hand-crafted
Speed 3-8X worse
Area 3-15X worse
BUT hand-crafted needs 10-20X design effort
IBM/Sony Cell microprocessor had 400M
development budget but hit 4GHz in 90nm
technology
Standard-cell chip costs 20M to develop and hit
400MHz in 90nm technology

9
Power-Flexibility Conflict
Source T.Claasen (ISSCC99)
Power efficiency (32b GOPS/Watt)
1000
100
Hand-Crafted Design
Standard-Cell Based Design
10
Coarse-Grain Reconfigurable
Field-Programmable Gate Array
1
Application-Specific Processor, DSP
Hardwired data paths

General Purpose Processor
Reconfigurable Logic
0.1
Instruction Set Processors
0.01
0.001
2
1
0.5
0.25
0.13
0.07
feature size(mm)
10
Our Vision of Future System-on-Chip (SoC)Highly
structured but heterogeneous
Near full-custom quality application-specific
hardware
On-chip memory banks
Structured on-chip networks, high-bandwidth
low-power global comunications
General-purpose processors
Application-specific processors
11
Keys to Future Efficient SoCs

Meet throughput goals using multiple parallel
execution units to allow system to run with lower
clock rate (1/2 clock rate ? 1/4 energy)
Parallelism
Avoid global communication to reduce latency and
to reduce wire switching power
Locality
Use customized logic or new instructions to
improve efficiency of application hot-spots
(10-1000x improvement over general-purpose
processing for certain tasks)
Specialization

12
Design Representation
Partitioned, Concurrent Design Representation

How to represent partitioned, concurrent design?
How do pieces communicate and synchronize?
Does representation partition design in way that
supports mapping to heterogeneous components?
How well can individual units be refined to
efficient hardware and/or software
implementations?

13
Partitioning and Mapping Today?

Design Representation
Design entered via a variety of languages,
formalisms, and tool environments
Implementation
Design manually partitioned, different languages
and tools for each target
Verification Evaluation
System verification and evaluation difficult,
error-prone, and slow

Design Representation
Design entered via a variety of languages,
formalisms, and tool environments
Implementation
Design manually partitioned, different languages
and tools for each target
Verification Evaluation
System verification and evaluation difficult,
error-prone, and slow

C Program
Verilog/VHDL RTL Code
Verilog/VHDL RTL Code
C program Assembly
DSP compiler Assembler
Logic synthesis FPGA tools
General-purpose compiler OS
Logic synthesis Physical Design
15
Standard Cell Tool Flow
RTL
Front-End
Limited design points considered in the front end
RTL DRC
RTL/Gate Simulation
RTL Floorplanning
Equivalence Checking
Physical Synthesis
Because implementing each design point can
cause 10s, 100s, 1000s of interations In the
back-end
Static Timing
Wire Delay
Reliability
Placement, Route
Back-End
Leakage Power
Critical parameter variation
ATPG/Test Vectors
Layout
Wire Coupling
Test Program
Signal Integrity
Extraction
LVS/DRC
Signal Integrity Analysis

Difficult to complete even one correct
implementation
Minimal design space exploration is practical
Designers do not exploit current fabrication
capabilities
E.g. Typical ASIC in 130nm has 200MHz clock rate
(commercial microprocessor over 2GHz)

16
ARMO Approach Transactors

Describe computation as a network of transactors
(transactional actors)
Single formal description used for design
capture, verification, power-performance
simulation, and hardware/software synthesis

17
H.264 Top Level Transactor Decomposition
Transactors
Channel
Intra- Pred.
CABAC
Front-end parser Control
Deb. Filter Frame Buffer
IQ IDCT
CAVLC
Inter- Pred.
18
Transactor Mapping Process

Transactors can be independently designed and
tested, simplifies reuse
Channels can be automatically generated and
mapped to shared memory, FIFO, user-level message
passing, etc.
Transactor description can be naturally refined
into Bluespec RTL code to support hardware
microarchitecture exploration
Arvinds talk next

19
Transactor Anatomy

Transactor unit comprises
Architectural state (registers RAMs)
Input queues and output queues connected to other
units
Transactions (guarded atomic actions on state and
queues)
Scheduler (selects next ready transaction to run)

Transactions
Output queues
Input queues
Scheduler
Transactor

Advantages
Handles non-deterministic inputs
Allows concurrent operations on mutable state
within unit
Natural representation for formal verification

20
Computation Described as Transactor Network
Transactor
Global inter-unit communication via FIFO buffered
point-point channels
Short-range local communication within unit

Decompose computation into network of transactor
units
Only communication between units via buffered
point-point channels
All computation only on local state and channel
end-points
Representation encodes concurrency and locality
Semantics amenable to mixed hardware and software
implementations

21
Desired System Design Flow
Transactor Description
22
Status

Initial decomposition of large applications into
transactor networks
H.264 video decoder (see poster)
802.11a wireless ethernet transmitter/receiver
(see poster)
Transactor implementations on programmable
processors
Complete H.264 decoder mapped to OMAP 2420 ARM
core
Components being mapped to OMAP DSP
Transactor implementations in Bluespec RTL
802.11a mapped into Bluespec for design space
exploration (Arvinds talk next)
H.264 component mapping in progress
Transactor-level power characterization
Real hardware power measurement setup based on
SPUD OMAP 2420 development board (see poster)

23
Plan of Work

Continue to port more applications into
transactor model
Evaluate/enhance transactor approach
Provides input for design tool flow
Develop tools for high-level design-space
exploration
Develop full-system power-performance simulation
of entire cellphone based on transactor model
Models for transactors mapped to software on
processor cores
Models for inter-transactor channels mapped to
on-chip network and/or shared memory buffers
Models for external memories (DRAM, FLASH)
Models for other I/O
Link to VLSI tools to include custom logic models
Support near-realtime simulation on RAMP
FPGA-based emulation platform
Longer range New field-programmable transactor
array chip architecture
Design processor cores and on-chip network to
execute transactor model directly
Goal is to achieve energy and performance close
to custom logic but with software programmability