Title: Krste Asanovic
1RAMP Design Infrastructure
- Krste Asanovic
- krste_at_mit.edu
- MIT Computer Science and Artificial Intelligence
Laboratory - http//cag.csail.mit.edu/scale
- Embedded RAMP Workshop, BWRC
- August 23, 2006
2RAMP Approach
- Detailed target-cycle accurate emulation of
proposed machine, NOT run applications as fast as
possible on underlying platform - But must run applications fast enough (100MHz)
to allow software development - Initially, should boot and run standard software
(OSapplications unchanged) - Challenges
- Accurate target-cycle emulation
- Efficient use of FPGA resources
- Providing reproducibility, debugging, monitoring
- Managing design complexity with multiple
contributing authors - Providing flexibility for rapid architectural
exploration - Approach
- Generate a distributed cycle-accurate hardware
event simulator from transactor model
3RAMP Design Framework Overview
With Greg Gibeling, Andrew Schultz, UCB
- Target System the machine being emulated
- Describe structure as transactor netlist in RAMP
Description Language (RDL) - Describe behavior of each leaf unit in favorite
language (Verilog, VHDL, Bluespec, C/C, Java)
- Host Platforms systems that run the emulation or
simulation - Can have part of target mapped to FPGA emulation
and part mapped to software simulation
4Units and Channels in RAMP
Port
Channel
Receiving Unit
Sending Unit
Port
- Units
- Large pieces of functionality, gt10,000 Gates
(e.g. CPU L1) - Leaf units implemented in a host language
(e.g., Verilog, C) - Channels
- Unidirectional
- Point-to-point
- FIFO semantics
- Unknown latency and buffering (fixed when system
instantiated) - Implementation generated automatically by RDL
compiler
5RAMP Channels Generated Automatically During
System Instantiation
- Channel parameters for timing-accurate
simulations given in RAMP description file - Bitwidth (in bits per target clock cycle)
- Latency (in target clock cycles)
- Buffering (in either fragments or messages)
- Fragments (one target clock cycles worth of
data) - Smaller than messages
- Convey the simulation time through idles
h
32
b
32
b
32
b
t
d
i
w
t
i
B
Buffering
Latency
Channel
6Mapping Target Units to Host Platform
- Inside edge, free from host implementation
dependencies - Needs language-specific version of interface
(e.g., Verilog, Bluespec, C) - Outside edge, implementation dependant
- Deals with physical links
- RDL compiler generates the wrapper and all of the
links - Allows plugins to extend to new host languages or
new link types
7Targets Mapped Across Hardware and Software Host
Platforms
- Cross-platform
- Units implemented in many languages
- Library units for I/O
- Links implement channels
- Links
- Can be mapped to anything that transmits data
(e.g.,FPGA wires, high-speed serial links,
Ethernet)
Link I
(
Channels
E
F
)
TCP
/
IP
Link K
(
Channel E
)
Library
Link L
Link J
(
Debug
)
(
Channel H
)
(
Channel F
)
Outside Edge
8Virtualization to Improve FPGA Resource Usage
- RAMP allows units to run at varying target-host
clock ratios to optimize area and overall
performance without changing cycle-accurate
accounting - Example 1 Multiported register file
- Example, Sun Niagara has 3 read ports and 2 write
ports to 6KB of register storage - If RTL mapped directly, requires 48K flip-flops
- Slow cycle time, large area
- If mapping into block RAMs (one readone write
per cycle), takes 3 host cycles and 3x2KB block
RAMs - Faster cycle time (3X) and far less resources
- Example 2 Large L2/L3 caches
- Current FPGAs only have 1MB of on-chip SRAM
- Use on-chip SRAM to build cache of active piece
of L2/L3 cache, stall target cycle if access
misses and fetch data from off-chip DRAM
9Debugging and Monitoring Support
- Channel model target time model supports
- Monitoring
- All communication over channels can be examined
and controlled - Single-stepping by cycle or by transaction
- Target time can be paused or slowed down
- Simulation steering
- Inject messages into channels
- Mixed-mode emulation/simulation
- Can move some units into software simulation
- Cross-platform communication hidden by RDL
compiler (RDLC)
10Related Approaches
- FPGA-Based Approaches
- Quickturn, Axis, IKOS, Thara
- FPGA- or special-processor based gate-level
hardware emulators - Slow clock rate (1MHz vs. RAMP 100MHz)
- Limited memory capacity (few GB vs. RAMP 256GB)
- RPM at USC in early 1990s
- Up to only 8 processors, only memory controller
in configurable logic - Other approaches
- Software Simulators
- Clusters (standard microprocessors)
- PlanetLab (distributed environment)
- Wisconsin Wind Tunnel (used CM-5 to simulate
shared memory) - All suffer from some combination of
- Slowness, inaccuracy, target inflexibility,
scalability, unbalanced computation-communication
ratio, ..
11RAMP White Structure
CPU L1 Coherence
CPU L1 Coherence
- Multiple different ISAs will eventually be
supported
L2 Coherence
- Target router topology independent of host link
topology
Router
To Other Nodes
ISA Independent
- RAMP White uses scalable directory-based
coherence protocol
Coherence Engine
Non-target accesses
- Host DRAM used to support host emulation (e.g.,
L2 cache image) and tracing, as well as target
memory
12RAMP for MP-SoC Emulation
Standard TI OMAP 2420 design
13Backup
14Computing Devices Then
- EDSAC, University of Cambridge, UK, 1949
15Computing Devices Now
Sensor Nets
Cameras
Games
Set-top boxes
Media Players
Laptops
Servers
Robots
Smart phones
Routers
Automobiles
Supercomputers
16Requirements Converging and Growing
- Traditional general-purpose computing
- Focus on programming effort to implement large
and extensible feature set - Traditional embedded computing
- Focus on resource constraints (cost, execution
time, power, memory size, ) to implement a fixed
function
- Current and future computing platforms
- Large and growing feature set and resource
constraints (e.g., web browsers on cellphones,
power consumption of server farms)
- But also, new concerns
- Reliability (hardware and software errors)
- Security
- Manageability (labor costs)
17Uniprocessor Performance (SPECint)
3X gap from historical growth
From Hennessy and Patterson, Computer
Architecture A Quantitative Approach, 4th
edition, 2006
gt All major manufacturers moving to multicore
architectures
- General-purpose uniprocessors have stopped
historic performance scaling - Power consumption
- Wire delays
- DRAM access latency
- Diminishing returns of more instruction-level
parallelism
18Custom Chip Design Cost Growing
gt Fewer chips, increasingly programmable to
support wider range of applications
- Development cost rising rapidly because of
growing design effort - Logic complexity and new physical design
challenges (wire delay, switching and leakage
power, coupling, inductance, variability, ) - New ASIC development with automated design tools
10-30M (lt400MHz_at_90nm) - Assume 10 RD cost, 10 market share gt 1-3B
market - Development cost much higher for hand-crafted
layout, e.g., IBM Cell microprocessor gt400M
(4GHz in 90nm)
19Convergence of Platforms
- Only way to meet system feature set, cost, power,
and performance requirements is by programming a
processor array - Multiple parallel general-purpose processors
(GPPs) - Multiple application-specific processors (ASPs)
The Processor is the new Transistor Rowen
20New Abstraction Stack Needed
- Challenge Desperate need to improve the state of
the art of parallel computing for complex
applications - Opportunity Everything is open to change
- Programming languages
- Operating systems
- Instruction set architecture (ISA)
- Microarchitectures
- How do we work across traditional abstraction
boundaries?
21Stratification of Research Communities
Application
Algorithm
Software Community Hardware cannot be changed!
Programming Language
Operating System
Instruction Set Architecture (ISA)
Hardware community Software cannot be changed!
Microarchitecture
Gates/Register-Transfer Level (RTL)
Circuits
Devices
- Problem is not just one of mindset
- Software developers not interested unless
hardware available - software simulations too slow, 10-100 kHz for
detailed models of one CPU - software simulations not credible
- But takes 5 years to complete prototype hardware
system! - Then in a few months of software development, all
mistakes become clear
22RAMP Build Research MPP from FPGAs
- As ? 25 CPUs will fit in Field Programmable Gate
Array (FPGA), 1000-CPU system from ? 40 FPGAs? - 16 32-bit simple soft core RISC at 150MHz in
2004 (Virtex-II) - FPGA generations every 1.5 yrs ? 2X CPUs, ? 1.2X
clock rate - HW research community does logic design (gate
shareware) to create out-of-the-box, MPP - E.g., 1000 processor, standard ISA
binary-compatible, 64-bit, cache-coherent
supercomputer _at_ ? 200 MHz/CPU in 2007 - Multi-University Collaboration
- RAMPants Arvind (MIT), Krste Asanovic (MIT),
Derek Chiou (Texas), James Hoe (CMU), Christos
Kozyrakis (Stanford), Shih-Lien Lu (Intel),
Mark Oskin (Washington), David Patterson (UCB),
Jan Rabaey (UCB), and John Wawrzynek (UCB)
23RAMP Goals
- Provide credible prototypes with sufficient
performance to support co-development of software
and hardware ideas - Turn-around new hardware ideas in minutes or
hours - Support reproducible comparison of ideas across
different groups - Architects distribute usable hardware designs by
FTP, improve visibility to industry
24RAMP-1 Hardware
- BEE2 Berkeley Emulation Engine 2
- By John Wawrzynek and Bob Brodersen with students
Chen Chang and Pierre Droz - Completed Dec. 2004 (14x17 inch 22-layer PCB)
1.5W / computer, 5 cu. in. /computer, 100 /
computer
Board 5 Virtex II FPGAs, 18 banks DDR2-400
memory, 20 10GigE conn.
25Transactors
- A transactor (transactional actor) is an abstract
unit of computation, which is easy to understand
and verify, but which can also be automatically
translated into high quality hardware or software
implementations
26Original Transactor Motivation
Application
Algorithm
Programming Language
Operating System
Scale Vector-Thread Processor 128
Threads/Core 1M Gates, 17mm2, 400MHz,
0.18um IEEE Micro, Top Picks, 2004
Instruction Set Architecture (ISA)
Transactors/Microarchitecture (UTL)
Gates/Register-Transfer Level (RTL)
Circuits
Devices
- Design chip at microarchitecture level rather
than at RTL level - Abstract away pipeline depth and communication
latencies - Separate global communication from local
computation - Avoid over-specification of behavior,
particularly local pipelining scheduling - Encode best-practice in concurrency management
27Transactor Anatomy
- Transactor unit comprises
- Architectural state (registers RAMs)
- Input queues and output queues connected to other
units - Transactions (guarded atomic actions on state and
queues) - Scheduler (selects next ready transaction to run)
Transactions
Output queues
Input queues
Scheduler
Transactor
- Advantages
- Handles non-deterministic inputs
- Allows concurrent operations on mutable state
within unit - Natural representation for formal verification
28Transactor Networks
Transactor
Global inter-unit communication via FIFO buffered
point-point channels
Short-range local communication within unit
- Decompose system into network of transactor units
- Decoupling global communication and local
computation - Only communication between units via buffered
point-point channels - All computation only on local state and channel
end-points
29Message Queues or Channels
- Queues decouple units execution and require
units to use latency-insensitive protocols
Carloni et al., ICCAV99 - Queues are point-to-point channels only
- No fanout, a unit must replicate messages on
multiple queues - No buses in a transactor design (though
implementation may use them) - Transactions can only pop head of input queues
and push at most one element onto each output
queue - Avoids exposing size of buffers in queues
- Also avoids synchronization inherent in waiting
for multiple elements
30Transactions
- Transaction is a guarded atomic action on local
state and input and output queues - Guard is a predicate that specifies when
transaction can execute - Predicate is over architectural state and heads
of input queues - Implicit conditions on input queues (data
available) and output queues (space available)
that transaction accesses - Transaction can only pop up to one record from an
input queue and push up to one record on each
output queue
- transaction
- route(input int32 in,
- output int32 out0,
- output int32 out1)
-
- when (routable(in))
- if (route_func(in) 0)
- out0 in
- else
- out1 in
-
- transaction
- route_kill(input int32 in)
-
- when (!routable(in))
- bad_packets
-
-
31Scheduler
- Scheduling function decides on transaction
priority based on local state and state of input
queues - Simplest scheduler picks among ready transactions
in a fixed priority order - Transactions may have additional predicates which
indicate when they can fire - E.g., implicit condition on all necessary output
queues being ready
unit route_stage(input int32 in0, // First
input channel. input int32 in1,
// Second input channel. output
int32 out0, // First output channel.
output int32 out1) // Second output
channel. int32 bad_packets int1
last // Fair scheduler state. schedule
reset bad_packets 0 last 0
route_kill(in0) route_kill(in1)
schedule round_robin(last) (0)
route(in0, out0, out1) (1) route(in1,
out0, out1)
32Raise Abstraction Level for Communication
RTL Model Cycles and Wires
Transactors Messages and Queues
- Designer allocate signals to wires and
orchestrates cycle-by-cycle communication across
chip - Global and local wires specified identically
- All global communication uses latency-insensitive
messages on buffered point-point channels - Global wires separated from local intra-unit wires
Problems in RTL Implementation
Transactor Communications
Long signal paths may need more pipelining to hit
frequency goal, require manual RTL changes
Can also trade increased end-to-end latency for
reduced repeater power.
Latency-insensitive model allows automatic
insertion of pipeline registers to meet frequency
goals.
Repeaters used to reduce latency burn leakage
power
cycle
cycle
A1
A2
B1
B2
Neighbor wire coupling may reduce speed inject
errors, require manual rework
Dedicated wires for each signal cause wiring
congestion waste repeater power because many
wires are mostly idle
Use optimized signaling on known long wires
e.g., dual-data rate for high throughput,
low-swing for low power, shields to cut noise
Multiplexed channels reduce congestion, save
repeater power. Can use on-chip network.
Error detection and correction circuitry cannot
be added automatically, requires manual RTL
redesign
Can automatically insert error correction/retry
to cover communication soft errors
33Raise Abstraction Level for Computation
RTL Model Manual Concurrency Management
Transactor Model Synthesis from Guarded Atomic
Actions
- Designer has to divide application operations
into pieces that fit within a clock cycle, then
develop control logic to manage concurrent
execution of many overlapping operations.
- Designer describes each atomic transaction in
isolation, together with priority for scheduling
transactions. - Tools synthesize pipelined transactor
implementation including all control logic to
manage dependencies between operations and flow
control of communications.
Single application operation manually divided
across multiple pipeline stages, then interleaved
with other operations
Each application operation described as
independent transaction
Schedule gives desired priority for multiple
enabled transactions
If (condA1) Astage1 else if (condB1)
Bstage1
If (condA2) Astage2 else if (condB2)
Bstage2
Transactor
Transaction B If (condB) ----
Schedule A gt B
Transaction A If (condA)
CLK
Dependencies between concurrently executing
operations managed manually
No pipeline registers or other internal
bookkeeping state is exposed in specification
Input and output communication rates and flow
control protocol manually built into code
Communication flow control automatically
generated from transactions use of input and
output queues
34Design Template for Transactor
Scheduler
Arch. State 1
Arch. State 2
- Scheduler only fires transaction when it can
complete without stalls - Avoids driving heavily loaded stall signals
- Architectural state (and outputs) only written in
one stage of pipeline, use bypass/interlocks to
read in earlier stages - Simplifies hazard detection/prevention
- Have different transaction types access expensive
units (RAM read ports, shifters, multiply units)
in same pipeline stage to reduce area
35Transactor VLSI Design Flow
Specification
36System Design Flow
Transactor Code
37Related Models
- CSP/Occam
- Rendevous communications expose system latencies
in design - No mutable shared state within a unit
- Khan Process Networks (and simpler SDF models)
- Do not support non-deterministic inputs
- Sequential execution within unit
- Latency-Insensitive Design Carloni et al.
- Channels are similar to transactor channels
- Units described as stallable RTL
- TRS/Bluespec Arvind Hoe
- Uses guarded atomic actions at RTL level (single
cycle transactions) - Microarchitectural state is explicit
- No unit-level discipline enforced
38RAMP Implementation Plans
Name Goal Target CPUs Details
Red (Stanford) Get Started 1H06 8 PowerPC 32b hard cores Transactional memory SMP
Blue (Cal) Scale 2H06 ?1000 32b soft (Microblaze) Cluster, MPI
White (All) Full Features 1H07? 128? soft 64b, Multiple commercial ISAs CC-NUMA, shared address, deterministic, debug/monitor
2.0 3rd party sells it 2H07? 4X CPUs of 04 FPGA New 06 FPGA, new board
39Summary
- All computing systems will use many concurrent
processors (1,000s of processors/chip) - Unlike previously, this is not just a prediction,
already happening - We desperately need a new stack of system
abstractions to manage complexity of concurrent
system design - RAMP project building an emulator watering hole
to bring everyone together to help make rapid
progress - architects, OS, programming language, compilers,
algorithms, application developers, - Transactors provide a unifying model for
describing complex concurrent hardware and
software systems - Complex digital applications
- The RAMP target hardware itself