ARMO: Synthesizable Modeling of Mobile Phones to Facilitate Architectural Studies of Performance and - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

ARMO: Synthesizable Modeling of Mobile Phones to Facilitate Architectural Studies of Performance and

Description:

Verilog/VHDL RTL Code. C Program. Partitioning and Mapping Today? C program. Assembly ... Verilog/VHDL RTL Code. General-purpose compiler. OS. RTL. RTL DRC. RTL ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 24
Provided by: krst1
Category:

less

Transcript and Presenter's Notes

Title: ARMO: Synthesizable Modeling of Mobile Phones to Facilitate Architectural Studies of Performance and


1
ARMOSynthesizable Modeling of Mobile Phones to
Facilitate Architectural Studies of Performance
and Energy-Efficiency
  • Nokia-MIT Meeting
  • Helsinki, Finland
  • June 5, 2006

2
Nokia-MIT Architecture Team
  • MIT
  • Faculty Arvind, Krste Asanovic, Anantha
    Chandrakasan
  • Students Nirav Dave, (Alfred) Man Cheuk Ng,
    ChunChieh V Lin, Daniel Finchelstein, Steve
    Gerding, Jae Lee, Christopher Batten, Michal
    Karczmarek, Ken Barr
  • Nokia
  • Gopal Raghavan, Soracha Nananukul, Jamey Hicks,
    John Ankcorn

3
Project Overview
  • Goal Provide 10x performance improvement with
    same power envelope and same semiconductor
    technology
  • Approach New design, simulation, and synthesis
    methodologies new circuit techniques
  • Specify components in a form (transactors) that
    supports flexible partitioning between software
    and hardware for effective design space
    exploration.
  • Run full-system simulations with components at
    multiple levels of abstraction for early stage
    performance and power analysis
  • Ultra-low-power DSP technologies using
    sub-threshold logic (see poster)
  • Application Target Next-generation cellphone
    running next-generation applications
  • We expect many exciting new applications
    (including those developed in the Nokia-MIT
    collaboration) to require unprecedented
    performance from a handheld device

4
Example Design Scenario
  • Build a H.264 decoder for a digital video
    receiver that runs in 100mW

5
Engineering Design Problem
New App. Processor
ARM
DSP
GPU
IVA
6
Multicore Processors
IBM/Sony Cell Processor
  • Advantages
  • Easier to scale hardwaredesign as complexity is
    contained within processors
  • Develop complex applications using just software
    tools
  • A few minor problems?
  • Power up to 1000X worse than customized standard
    cells
  • Performance up to 100X worse
  • Area up to 100X greater
  • And, do we really know how to program these?

7
Another popular platform visionField-Programma
ble Gate Arrays
  • Programmable logic
  • Dramatically reduce the cost of chip design
    errors, spin-a-day
  • Remove the mask costs from each design
  • A few minor problems? From Kuon Rose,
    FPGA2006
  • Switching power around 12X worse than standard
    cells
  • Performance up 3-4X worse than standard cells
  • Area 20-40X greater than standard cells
  • And, requires tremendous low-level design effort

8
Standard Cells versus Hand-Crafted Design
  • Even standard-cell methodology leaves a lot of
    capability on the table, versus hand-crafted
    design (draw each transistor by hand)
  • Power 3-10X worse than hand-crafted
  • Speed 3-8X worse
  • Area 3-15X worse
  • BUT hand-crafted needs 10-20X design effort
  • IBM/Sony Cell microprocessor had 400M
    development budget but hit 4GHz in 90nm
    technology
  • Standard-cell chip costs 20M to develop and hit
    400MHz in 90nm technology

9
Power-Flexibility Conflict
Source T.Claasen (ISSCC99)
Power efficiency (32b GOPS/Watt)
1000
100
Hand-Crafted Design
Standard-Cell Based Design
10
Coarse-Grain Reconfigurable
Field-Programmable Gate Array
1
Application-Specific Processor, DSP
Hardwired data paths

General Purpose Processor
Reconfigurable Logic
0.1
Instruction Set Processors
0.01
0.001
2
1
0.5
0.25
0.13
0.07
feature size(mm)
10
Our Vision of Future System-on-Chip (SoC)Highly
structured but heterogeneous
Near full-custom quality application-specific
hardware
On-chip memory banks
Structured on-chip networks, high-bandwidth
low-power global comunications
General-purpose processors
Application-specific processors
11
Keys to Future Efficient SoCs
  • Meet throughput goals using multiple parallel
    execution units to allow system to run with lower
    clock rate (1/2 clock rate ? 1/4 energy)
  • Parallelism
  • Avoid global communication to reduce latency and
    to reduce wire switching power
  • Locality
  • Use customized logic or new instructions to
    improve efficiency of application hot-spots
    (10-1000x improvement over general-purpose
    processing for certain tasks)
  • Specialization

12
Design Representation
Partitioned, Concurrent Design Representation
  • How to represent partitioned, concurrent design?
  • How do pieces communicate and synchronize?
  • Does representation partition design in way that
    supports mapping to heterogeneous components?
  • How well can individual units be refined to
    efficient hardware and/or software
    implementations?

13
Partitioning and Mapping Today?
  • Design Representation
  • Design entered via a variety of languages,
    formalisms, and tool environments
  • Implementation
  • Design manually partitioned, different languages
    and tools for each target
  • Verification Evaluation
  • System verification and evaluation difficult,
    error-prone, and slow

C Program
Verilog/VHDL RTL Code
Verilog/VHDL RTL Code
C program Assembly
DSP compiler Assembler
Logic synthesis FPGA tools
General-purpose compiler OS
Logic synthesis Physical Design
14
Partitioning and Mapping Today?
  • Design Representation
  • Design entered via a variety of languages,
    formalisms, and tool environments
  • Implementation
  • Design manually partitioned, different languages
    and tools for each target
  • Verification Evaluation
  • System verification and evaluation difficult,
    error-prone, and slow

C Program
Verilog/VHDL RTL Code
Verilog/VHDL RTL Code
C program Assembly
DSP compiler Assembler
Logic synthesis FPGA tools
General-purpose compiler OS
Logic synthesis Physical Design
15
Standard Cell Tool Flow
RTL
Front-End
Limited design points considered in the front end
RTL DRC
RTL/Gate Simulation
RTL Floorplanning
Equivalence Checking
Physical Synthesis
Because implementing each design point can
cause 10s, 100s, 1000s of interations In the
back-end
Static Timing
Wire Delay
Reliability
Placement, Route
Back-End
Leakage Power
Critical parameter variation
ATPG/Test Vectors
Layout
Wire Coupling
Test Program
Signal Integrity
Extraction
LVS/DRC
Signal Integrity Analysis
  • Difficult to complete even one correct
    implementation
  • Minimal design space exploration is practical
  • Designers do not exploit current fabrication
    capabilities
  • E.g. Typical ASIC in 130nm has 200MHz clock rate
    (commercial microprocessor over 2GHz)

16
ARMO Approach Transactors
  • Describe computation as a network of transactors
    (transactional actors)
  • Single formal description used for design
    capture, verification, power-performance
    simulation, and hardware/software synthesis

17
H.264 Top Level Transactor Decomposition
Transactors
Channel
Intra- Pred.
CABAC
Front-end parser Control
Deb. Filter Frame Buffer
IQ IDCT
CAVLC
Inter- Pred.
18
Transactor Mapping Process
  • Transactors can be independently designed and
    tested, simplifies reuse
  • Channels can be automatically generated and
    mapped to shared memory, FIFO, user-level message
    passing, etc.
  • Transactor description can be naturally refined
    into Bluespec RTL code to support hardware
    microarchitecture exploration
  • Arvinds talk next

19
Transactor Anatomy
  • Transactor unit comprises
  • Architectural state (registers RAMs)
  • Input queues and output queues connected to other
    units
  • Transactions (guarded atomic actions on state and
    queues)
  • Scheduler (selects next ready transaction to run)

Transactions
Output queues
Input queues
Scheduler
Transactor
  • Advantages
  • Handles non-deterministic inputs
  • Allows concurrent operations on mutable state
    within unit
  • Natural representation for formal verification

20
Computation Described as Transactor Network
Transactor
Global inter-unit communication via FIFO buffered
point-point channels
Short-range local communication within unit
  • Decompose computation into network of transactor
    units
  • Only communication between units via buffered
    point-point channels
  • All computation only on local state and channel
    end-points
  • Representation encodes concurrency and locality
  • Semantics amenable to mixed hardware and software
    implementations

21
Desired System Design Flow
Transactor Description
22
Status
  • Initial decomposition of large applications into
    transactor networks
  • H.264 video decoder (see poster)
  • 802.11a wireless ethernet transmitter/receiver
    (see poster)
  • Transactor implementations on programmable
    processors
  • Complete H.264 decoder mapped to OMAP 2420 ARM
    core
  • Components being mapped to OMAP DSP
  • Transactor implementations in Bluespec RTL
  • 802.11a mapped into Bluespec for design space
    exploration (Arvinds talk next)
  • H.264 component mapping in progress
  • Transactor-level power characterization
  • Real hardware power measurement setup based on
    SPUD OMAP 2420 development board (see poster)

23
Plan of Work
  • Continue to port more applications into
    transactor model
  • Evaluate/enhance transactor approach
  • Provides input for design tool flow
  • Develop tools for high-level design-space
    exploration
  • Develop full-system power-performance simulation
    of entire cellphone based on transactor model
  • Models for transactors mapped to software on
    processor cores
  • Models for inter-transactor channels mapped to
    on-chip network and/or shared memory buffers
  • Models for external memories (DRAM, FLASH)
  • Models for other I/O
  • Link to VLSI tools to include custom logic models
  • Support near-realtime simulation on RAMP
    FPGA-based emulation platform
  • Longer range New field-programmable transactor
    array chip architecture
  • Design processor cores and on-chip network to
    execute transactor model directly
  • Goal is to achieve energy and performance close
    to custom logic but with software programmability
Write a Comment
User Comments (0)
About PowerShow.com