Krste Asanovic - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Krste Asanovic

Description:

... as well as target memory Non-target accesses Standard TI OMAP 2420 design CPU& DSP Mapping Optimized with Virtualized RTL Large on-chip memories virtualized ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 40
Provided by: KrsteAs9
Category:
Tags: omap | asanovic | krste

less

Transcript and Presenter's Notes

Title: Krste Asanovic


1
RAMP Design Infrastructure
  • Krste Asanovic
  • krste_at_mit.edu
  • MIT Computer Science and Artificial Intelligence
    Laboratory
  • http//cag.csail.mit.edu/scale
  • Embedded RAMP Workshop, BWRC
  • August 23, 2006

2
RAMP Approach
  • Detailed target-cycle accurate emulation of
    proposed machine, NOT run applications as fast as
    possible on underlying platform
  • But must run applications fast enough (100MHz)
    to allow software development
  • Initially, should boot and run standard software
    (OSapplications unchanged)
  • Challenges
  • Accurate target-cycle emulation
  • Efficient use of FPGA resources
  • Providing reproducibility, debugging, monitoring
  • Managing design complexity with multiple
    contributing authors
  • Providing flexibility for rapid architectural
    exploration
  • Approach
  • Generate a distributed cycle-accurate hardware
    event simulator from transactor model

3
RAMP Design Framework Overview
With Greg Gibeling, Andrew Schultz, UCB
  • Target System the machine being emulated
  • Describe structure as transactor netlist in RAMP
    Description Language (RDL)
  • Describe behavior of each leaf unit in favorite
    language (Verilog, VHDL, Bluespec, C/C, Java)
  • Host Platforms systems that run the emulation or
    simulation
  • Can have part of target mapped to FPGA emulation
    and part mapped to software simulation

4
Units and Channels in RAMP
Port
Channel
Receiving Unit
Sending Unit
Port
  • Units
  • Large pieces of functionality, gt10,000 Gates
    (e.g. CPU L1)
  • Leaf units implemented in a host language
    (e.g., Verilog, C)
  • Channels
  • Unidirectional
  • Point-to-point
  • FIFO semantics
  • Unknown latency and buffering (fixed when system
    instantiated)
  • Implementation generated automatically by RDL
    compiler

5
RAMP Channels Generated Automatically During
System Instantiation
  • Channel parameters for timing-accurate
    simulations given in RAMP description file
  • Bitwidth (in bits per target clock cycle)
  • Latency (in target clock cycles)
  • Buffering (in either fragments or messages)
  • Fragments (one target clock cycles worth of
    data)
  • Smaller than messages
  • Convey the simulation time through idles

h
32
b
32
b
32
b
t
d
i
w
t
i
B
Buffering
Latency
Channel
6
Mapping Target Units to Host Platform
  • Inside edge, free from host implementation
    dependencies
  • Needs language-specific version of interface
    (e.g., Verilog, Bluespec, C)
  • Outside edge, implementation dependant
  • Deals with physical links
  • RDL compiler generates the wrapper and all of the
    links
  • Allows plugins to extend to new host languages or
    new link types

7
Targets Mapped Across Hardware and Software Host
Platforms
  • Cross-platform
  • Units implemented in many languages
  • Library units for I/O
  • Links implement channels
  • Links
  • Can be mapped to anything that transmits data
    (e.g.,FPGA wires, high-speed serial links,
    Ethernet)

Link I
(
Channels
E

F
)
TCP
/
IP
Link K
(
Channel E
)
Library
Link L
Link J
(
Debug
)
(
Channel H
)
(
Channel F
)
Outside Edge
8
Virtualization to Improve FPGA Resource Usage
  • RAMP allows units to run at varying target-host
    clock ratios to optimize area and overall
    performance without changing cycle-accurate
    accounting
  • Example 1 Multiported register file
  • Example, Sun Niagara has 3 read ports and 2 write
    ports to 6KB of register storage
  • If RTL mapped directly, requires 48K flip-flops
  • Slow cycle time, large area
  • If mapping into block RAMs (one readone write
    per cycle), takes 3 host cycles and 3x2KB block
    RAMs
  • Faster cycle time (3X) and far less resources
  • Example 2 Large L2/L3 caches
  • Current FPGAs only have 1MB of on-chip SRAM
  • Use on-chip SRAM to build cache of active piece
    of L2/L3 cache, stall target cycle if access
    misses and fetch data from off-chip DRAM

9
Debugging and Monitoring Support
  • Channel model target time model supports
  • Monitoring
  • All communication over channels can be examined
    and controlled
  • Single-stepping by cycle or by transaction
  • Target time can be paused or slowed down
  • Simulation steering
  • Inject messages into channels
  • Mixed-mode emulation/simulation
  • Can move some units into software simulation
  • Cross-platform communication hidden by RDL
    compiler (RDLC)

10
Related Approaches
  • FPGA-Based Approaches
  • Quickturn, Axis, IKOS, Thara
  • FPGA- or special-processor based gate-level
    hardware emulators
  • Slow clock rate (1MHz vs. RAMP 100MHz)
  • Limited memory capacity (few GB vs. RAMP 256GB)
  • RPM at USC in early 1990s
  • Up to only 8 processors, only memory controller
    in configurable logic
  • Other approaches
  • Software Simulators
  • Clusters (standard microprocessors)
  • PlanetLab (distributed environment)
  • Wisconsin Wind Tunnel (used CM-5 to simulate
    shared memory)
  • All suffer from some combination of
  • Slowness, inaccuracy, target inflexibility,
    scalability, unbalanced computation-communication
    ratio, ..

11
RAMP White Structure
CPU L1 Coherence
CPU L1 Coherence
  • Multiple different ISAs will eventually be
    supported

L2 Coherence
  • L2 optional
  • Target router topology independent of host link
    topology

Router
To Other Nodes
ISA Independent
  • RAMP White uses scalable directory-based
    coherence protocol

Coherence Engine
Non-target accesses
  • Host DRAM used to support host emulation (e.g.,
    L2 cache image) and tracing, as well as target
    memory

12
RAMP for MP-SoC Emulation
Standard TI OMAP 2420 design
13
Backup
14
Computing Devices Then
  • EDSAC, University of Cambridge, UK, 1949

15
Computing Devices Now
Sensor Nets
Cameras
Games
Set-top boxes
Media Players
Laptops
Servers
Robots
Smart phones
Routers
Automobiles
Supercomputers
16
Requirements Converging and Growing
  • Traditional general-purpose computing
  • Focus on programming effort to implement large
    and extensible feature set
  • Traditional embedded computing
  • Focus on resource constraints (cost, execution
    time, power, memory size, ) to implement a fixed
    function
  • Current and future computing platforms
  • Large and growing feature set and resource
    constraints (e.g., web browsers on cellphones,
    power consumption of server farms)
  • But also, new concerns
  • Reliability (hardware and software errors)
  • Security
  • Manageability (labor costs)

17
Uniprocessor Performance (SPECint)
3X gap from historical growth
From Hennessy and Patterson, Computer
Architecture A Quantitative Approach, 4th
edition, 2006
gt All major manufacturers moving to multicore
architectures
  • General-purpose uniprocessors have stopped
    historic performance scaling
  • Power consumption
  • Wire delays
  • DRAM access latency
  • Diminishing returns of more instruction-level
    parallelism

18
Custom Chip Design Cost Growing
gt Fewer chips, increasingly programmable to
support wider range of applications
  • Development cost rising rapidly because of
    growing design effort
  • Logic complexity and new physical design
    challenges (wire delay, switching and leakage
    power, coupling, inductance, variability, )
  • New ASIC development with automated design tools
    10-30M (lt400MHz_at_90nm)
  • Assume 10 RD cost, 10 market share gt 1-3B
    market
  • Development cost much higher for hand-crafted
    layout, e.g., IBM Cell microprocessor gt400M
    (4GHz in 90nm)

19
Convergence of Platforms
  • Only way to meet system feature set, cost, power,
    and performance requirements is by programming a
    processor array
  • Multiple parallel general-purpose processors
    (GPPs)
  • Multiple application-specific processors (ASPs)

The Processor is the new Transistor Rowen
20
New Abstraction Stack Needed
  • Challenge Desperate need to improve the state of
    the art of parallel computing for complex
    applications
  • Opportunity Everything is open to change
  • Programming languages
  • Operating systems
  • Instruction set architecture (ISA)
  • Microarchitectures
  • How do we work across traditional abstraction
    boundaries?

21
Stratification of Research Communities
Application
Algorithm
Software Community Hardware cannot be changed!
Programming Language
Operating System
Instruction Set Architecture (ISA)
Hardware community Software cannot be changed!
Microarchitecture
Gates/Register-Transfer Level (RTL)
Circuits
Devices
  • Problem is not just one of mindset
  • Software developers not interested unless
    hardware available
  • software simulations too slow, 10-100 kHz for
    detailed models of one CPU
  • software simulations not credible
  • But takes 5 years to complete prototype hardware
    system!
  • Then in a few months of software development, all
    mistakes become clear

22
RAMP Build Research MPP from FPGAs
  • As ? 25 CPUs will fit in Field Programmable Gate
    Array (FPGA), 1000-CPU system from ? 40 FPGAs?
  • 16 32-bit simple soft core RISC at 150MHz in
    2004 (Virtex-II)
  • FPGA generations every 1.5 yrs ? 2X CPUs, ? 1.2X
    clock rate
  • HW research community does logic design (gate
    shareware) to create out-of-the-box, MPP
  • E.g., 1000 processor, standard ISA
    binary-compatible, 64-bit, cache-coherent
    supercomputer _at_ ? 200 MHz/CPU in 2007
  • Multi-University Collaboration
  • RAMPants Arvind (MIT), Krste Asanovic (MIT),
    Derek Chiou (Texas), James Hoe (CMU), Christos
    Kozyrakis (Stanford), Shih-Lien Lu (Intel),
    Mark Oskin (Washington), David Patterson (UCB),
    Jan Rabaey (UCB), and John Wawrzynek (UCB)

23
RAMP Goals
  • Provide credible prototypes with sufficient
    performance to support co-development of software
    and hardware ideas
  • Turn-around new hardware ideas in minutes or
    hours
  • Support reproducible comparison of ideas across
    different groups
  • Architects distribute usable hardware designs by
    FTP, improve visibility to industry

24
RAMP-1 Hardware
  • BEE2 Berkeley Emulation Engine 2
  • By John Wawrzynek and Bob Brodersen with students
    Chen Chang and Pierre Droz
  • Completed Dec. 2004 (14x17 inch 22-layer PCB)

1.5W / computer, 5 cu. in. /computer, 100 /
computer
Board 5 Virtex II FPGAs, 18 banks DDR2-400
memory, 20 10GigE conn.
25
Transactors
  • A transactor (transactional actor) is an abstract
    unit of computation, which is easy to understand
    and verify, but which can also be automatically
    translated into high quality hardware or software
    implementations

26
Original Transactor Motivation
Application
Algorithm
Programming Language
Operating System
Scale Vector-Thread Processor 128
Threads/Core 1M Gates, 17mm2, 400MHz,
0.18um IEEE Micro, Top Picks, 2004
Instruction Set Architecture (ISA)
Transactors/Microarchitecture (UTL)
Gates/Register-Transfer Level (RTL)
Circuits
Devices
  • Design chip at microarchitecture level rather
    than at RTL level
  • Abstract away pipeline depth and communication
    latencies
  • Separate global communication from local
    computation
  • Avoid over-specification of behavior,
    particularly local pipelining scheduling
  • Encode best-practice in concurrency management

27
Transactor Anatomy
  • Transactor unit comprises
  • Architectural state (registers RAMs)
  • Input queues and output queues connected to other
    units
  • Transactions (guarded atomic actions on state and
    queues)
  • Scheduler (selects next ready transaction to run)

Transactions
Output queues
Input queues
Scheduler
Transactor
  • Advantages
  • Handles non-deterministic inputs
  • Allows concurrent operations on mutable state
    within unit
  • Natural representation for formal verification

28
Transactor Networks
Transactor
Global inter-unit communication via FIFO buffered
point-point channels
Short-range local communication within unit
  • Decompose system into network of transactor units
  • Decoupling global communication and local
    computation
  • Only communication between units via buffered
    point-point channels
  • All computation only on local state and channel
    end-points

29
Message Queues or Channels
  • Queues decouple units execution and require
    units to use latency-insensitive protocols
    Carloni et al., ICCAV99
  • Queues are point-to-point channels only
  • No fanout, a unit must replicate messages on
    multiple queues
  • No buses in a transactor design (though
    implementation may use them)
  • Transactions can only pop head of input queues
    and push at most one element onto each output
    queue
  • Avoids exposing size of buffers in queues
  • Also avoids synchronization inherent in waiting
    for multiple elements

30
Transactions
  • Transaction is a guarded atomic action on local
    state and input and output queues
  • Guard is a predicate that specifies when
    transaction can execute
  • Predicate is over architectural state and heads
    of input queues
  • Implicit conditions on input queues (data
    available) and output queues (space available)
    that transaction accesses
  • Transaction can only pop up to one record from an
    input queue and push up to one record on each
    output queue
  • transaction
  • route(input int32 in,
  • output int32 out0,
  • output int32 out1)
  • when (routable(in))
  • if (route_func(in) 0)
  • out0 in
  • else
  • out1 in
  • transaction
  • route_kill(input int32 in)
  • when (!routable(in))
  • bad_packets

31
Scheduler
  • Scheduling function decides on transaction
    priority based on local state and state of input
    queues
  • Simplest scheduler picks among ready transactions
    in a fixed priority order
  • Transactions may have additional predicates which
    indicate when they can fire
  • E.g., implicit condition on all necessary output
    queues being ready

unit route_stage(input int32 in0, // First
input channel. input int32 in1,
// Second input channel. output
int32 out0, // First output channel.
output int32 out1) // Second output
channel. int32 bad_packets int1
last // Fair scheduler state. schedule
reset bad_packets 0 last 0
route_kill(in0) route_kill(in1)
schedule round_robin(last) (0)
route(in0, out0, out1) (1) route(in1,
out0, out1)
32
Raise Abstraction Level for Communication
RTL Model Cycles and Wires
Transactors Messages and Queues
  • Designer allocate signals to wires and
    orchestrates cycle-by-cycle communication across
    chip
  • Global and local wires specified identically
  • All global communication uses latency-insensitive
    messages on buffered point-point channels
  • Global wires separated from local intra-unit wires

Problems in RTL Implementation
Transactor Communications
Long signal paths may need more pipelining to hit
frequency goal, require manual RTL changes
Can also trade increased end-to-end latency for
reduced repeater power.
Latency-insensitive model allows automatic
insertion of pipeline registers to meet frequency
goals.
Repeaters used to reduce latency burn leakage
power
cycle
cycle
A1
A2
B1
B2
Neighbor wire coupling may reduce speed inject
errors, require manual rework
Dedicated wires for each signal cause wiring
congestion waste repeater power because many
wires are mostly idle
Use optimized signaling on known long wires
e.g., dual-data rate for high throughput,
low-swing for low power, shields to cut noise
Multiplexed channels reduce congestion, save
repeater power. Can use on-chip network.
Error detection and correction circuitry cannot
be added automatically, requires manual RTL
redesign
Can automatically insert error correction/retry
to cover communication soft errors
33
Raise Abstraction Level for Computation
RTL Model Manual Concurrency Management
Transactor Model Synthesis from Guarded Atomic
Actions
  • Designer has to divide application operations
    into pieces that fit within a clock cycle, then
    develop control logic to manage concurrent
    execution of many overlapping operations.
  • Designer describes each atomic transaction in
    isolation, together with priority for scheduling
    transactions.
  • Tools synthesize pipelined transactor
    implementation including all control logic to
    manage dependencies between operations and flow
    control of communications.

Single application operation manually divided
across multiple pipeline stages, then interleaved
with other operations
Each application operation described as
independent transaction
Schedule gives desired priority for multiple
enabled transactions
If (condA1) Astage1 else if (condB1)
Bstage1
If (condA2) Astage2 else if (condB2)
Bstage2
Transactor
Transaction B If (condB) ----
Schedule A gt B
Transaction A If (condA)
CLK
Dependencies between concurrently executing
operations managed manually
No pipeline registers or other internal
bookkeeping state is exposed in specification
Input and output communication rates and flow
control protocol manually built into code
Communication flow control automatically
generated from transactions use of input and
output queues
34
Design Template for Transactor
Scheduler
Arch. State 1
Arch. State 2
  • Scheduler only fires transaction when it can
    complete without stalls
  • Avoids driving heavily loaded stall signals
  • Architectural state (and outputs) only written in
    one stage of pipeline, use bypass/interlocks to
    read in earlier stages
  • Simplifies hazard detection/prevention
  • Have different transaction types access expensive
    units (RAM read ports, shifters, multiply units)
    in same pipeline stage to reduce area

35
Transactor VLSI Design Flow
Specification
36
System Design Flow
Transactor Code
37
Related Models
  • CSP/Occam
  • Rendevous communications expose system latencies
    in design
  • No mutable shared state within a unit
  • Khan Process Networks (and simpler SDF models)
  • Do not support non-deterministic inputs
  • Sequential execution within unit
  • Latency-Insensitive Design Carloni et al.
  • Channels are similar to transactor channels
  • Units described as stallable RTL
  • TRS/Bluespec Arvind Hoe
  • Uses guarded atomic actions at RTL level (single
    cycle transactions)
  • Microarchitectural state is explicit
  • No unit-level discipline enforced

38
RAMP Implementation Plans
Name Goal Target CPUs Details
Red (Stanford) Get Started 1H06 8 PowerPC 32b hard cores Transactional memory SMP
Blue (Cal) Scale 2H06 ?1000 32b soft (Microblaze) Cluster, MPI
White (All) Full Features 1H07? 128? soft 64b, Multiple commercial ISAs CC-NUMA, shared address, deterministic, debug/monitor
2.0 3rd party sells it 2H07? 4X CPUs of 04 FPGA New 06 FPGA, new board
39
Summary
  • All computing systems will use many concurrent
    processors (1,000s of processors/chip)
  • Unlike previously, this is not just a prediction,
    already happening
  • We desperately need a new stack of system
    abstractions to manage complexity of concurrent
    system design
  • RAMP project building an emulator watering hole
    to bring everyone together to help make rapid
    progress
  • architects, OS, programming language, compilers,
    algorithms, application developers,
  • Transactors provide a unifying model for
    describing complex concurrent hardware and
    software systems
  • Complex digital applications
  • The RAMP target hardware itself
Write a Comment
User Comments (0)
About PowerShow.com