Krste Asanovic

About This Presentation

Title:

Krste Asanovic

Description:

... as well as target memory Non-target accesses Standard TI OMAP 2420 design CPU& DSP Mapping Optimized with Virtualized RTL Large on-chip memories virtualized ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 40

Provided by: KrsteAs9

Learn more at: http://ramp.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Krste Asanovic

1
RAMP Design Infrastructure

Krste Asanovic
krste_at_mit.edu
MIT Computer Science and Artificial Intelligence
Laboratory
http//cag.csail.mit.edu/scale
Embedded RAMP Workshop, BWRC
August 23, 2006

2
RAMP Approach

Detailed target-cycle accurate emulation of
proposed machine, NOT run applications as fast as
possible on underlying platform
But must run applications fast enough (100MHz)
to allow software development
Initially, should boot and run standard software
(OSapplications unchanged)
Challenges
Accurate target-cycle emulation
Efficient use of FPGA resources
Providing reproducibility, debugging, monitoring
Managing design complexity with multiple
contributing authors
Providing flexibility for rapid architectural
exploration
Approach
Generate a distributed cycle-accurate hardware
event simulator from transactor model

3
RAMP Design Framework Overview
With Greg Gibeling, Andrew Schultz, UCB

Target System the machine being emulated
Describe structure as transactor netlist in RAMP
Description Language (RDL)
Describe behavior of each leaf unit in favorite
language (Verilog, VHDL, Bluespec, C/C, Java)

Host Platforms systems that run the emulation or
simulation
Can have part of target mapped to FPGA emulation
and part mapped to software simulation

4
Units and Channels in RAMP
Port
Channel
Receiving Unit
Sending Unit
Port

Units
Large pieces of functionality, gt10,000 Gates
(e.g. CPU L1)
Leaf units implemented in a host language
(e.g., Verilog, C)
Channels
Unidirectional
Point-to-point
FIFO semantics
Unknown latency and buffering (fixed when system
instantiated)
Implementation generated automatically by RDL
compiler

5
RAMP Channels Generated Automatically During
System Instantiation

Channel parameters for timing-accurate
simulations given in RAMP description file
Bitwidth (in bits per target clock cycle)
Latency (in target clock cycles)
Buffering (in either fragments or messages)
Fragments (one target clock cycles worth of
data)
Smaller than messages
Convey the simulation time through idles

h
32
b
32
b
32
b
t
d
i
w
t
i
B
Buffering
Latency
Channel
6
Mapping Target Units to Host Platform

Inside edge, free from host implementation
dependencies
Needs language-specific version of interface
(e.g., Verilog, Bluespec, C)
Outside edge, implementation dependant
Deals with physical links
RDL compiler generates the wrapper and all of the
links
Allows plugins to extend to new host languages or
new link types

7
Targets Mapped Across Hardware and Software Host
Platforms

Cross-platform
Units implemented in many languages
Library units for I/O
Links implement channels
Links
Can be mapped to anything that transmits data
(e.g.,FPGA wires, high-speed serial links,
Ethernet)

Link I
(
Channels
E

F
)
TCP
/
IP
Link K
(
Channel E
)
Library
Link L
Link J
(
Debug
)
(
Channel H
)
(
Channel F
)
Outside Edge
8
Virtualization to Improve FPGA Resource Usage

RAMP allows units to run at varying target-host
clock ratios to optimize area and overall
performance without changing cycle-accurate
accounting
Example 1 Multiported register file
Example, Sun Niagara has 3 read ports and 2 write
ports to 6KB of register storage
If RTL mapped directly, requires 48K flip-flops
Slow cycle time, large area
If mapping into block RAMs (one readone write
per cycle), takes 3 host cycles and 3x2KB block
RAMs
Faster cycle time (3X) and far less resources
Example 2 Large L2/L3 caches
Current FPGAs only have 1MB of on-chip SRAM
Use on-chip SRAM to build cache of active piece
of L2/L3 cache, stall target cycle if access
misses and fetch data from off-chip DRAM

9
Debugging and Monitoring Support

Channel model target time model supports
Monitoring
All communication over channels can be examined
and controlled
Single-stepping by cycle or by transaction
Target time can be paused or slowed down
Simulation steering
Inject messages into channels
Mixed-mode emulation/simulation
Can move some units into software simulation
Cross-platform communication hidden by RDL
compiler (RDLC)

10
Related Approaches

FPGA-Based Approaches
Quickturn, Axis, IKOS, Thara
FPGA- or special-processor based gate-level
hardware emulators
Slow clock rate (1MHz vs. RAMP 100MHz)
Limited memory capacity (few GB vs. RAMP 256GB)
RPM at USC in early 1990s
Up to only 8 processors, only memory controller
in configurable logic
Other approaches
Software Simulators
Clusters (standard microprocessors)
PlanetLab (distributed environment)
Wisconsin Wind Tunnel (used CM-5 to simulate
shared memory)
All suffer from some combination of
Slowness, inaccuracy, target inflexibility,
scalability, unbalanced computation-communication
ratio, ..

11
RAMP White Structure
CPU L1 Coherence
CPU L1 Coherence

Multiple different ISAs will eventually be
supported

L2 Coherence

L2 optional

Target router topology independent of host link
topology

Router
To Other Nodes
ISA Independent

RAMP White uses scalable directory-based
coherence protocol

Coherence Engine
Non-target accesses

Host DRAM used to support host emulation (e.g.,
L2 cache image) and tracing, as well as target
memory

12
RAMP for MP-SoC Emulation
Standard TI OMAP 2420 design
13
Backup
14
Computing Devices Then

EDSAC, University of Cambridge, UK, 1949

15
Computing Devices Now
Sensor Nets
Cameras
Games
Set-top boxes
Media Players
Laptops
Servers
Robots
Smart phones
Routers
Automobiles
Supercomputers
16
Requirements Converging and Growing

Traditional general-purpose computing
Focus on programming effort to implement large
and extensible feature set
Traditional embedded computing
Focus on resource constraints (cost, execution
time, power, memory size, ) to implement a fixed
function

Current and future computing platforms
Large and growing feature set and resource
constraints (e.g., web browsers on cellphones,
power consumption of server farms)

But also, new concerns
Reliability (hardware and software errors)
Security
Manageability (labor costs)

17
Uniprocessor Performance (SPECint)
3X gap from historical growth
From Hennessy and Patterson, Computer
Architecture A Quantitative Approach, 4th
edition, 2006
gt All major manufacturers moving to multicore
architectures

General-purpose uniprocessors have stopped
historic performance scaling
Power consumption
Wire delays
DRAM access latency
Diminishing returns of more instruction-level
parallelism

18
Custom Chip Design Cost Growing
gt Fewer chips, increasingly programmable to
support wider range of applications

Development cost rising rapidly because of
growing design effort
Logic complexity and new physical design
challenges (wire delay, switching and leakage
power, coupling, inductance, variability, )
New ASIC development with automated design tools
10-30M (lt400MHz_at_90nm)
Assume 10 RD cost, 10 market share gt 1-3B
market
Development cost much higher for hand-crafted
layout, e.g., IBM Cell microprocessor gt400M
(4GHz in 90nm)

19
Convergence of Platforms

Only way to meet system feature set, cost, power,
and performance requirements is by programming a
processor array
Multiple parallel general-purpose processors
(GPPs)
Multiple application-specific processors (ASPs)

The Processor is the new Transistor Rowen
20
New Abstraction Stack Needed

Challenge Desperate need to improve the state of
the art of parallel computing for complex
applications
Opportunity Everything is open to change
Programming languages
Operating systems
Instruction set architecture (ISA)
Microarchitectures
How do we work across traditional abstraction
boundaries?

21
Stratification of Research Communities
Application
Algorithm
Software Community Hardware cannot be changed!
Programming Language
Operating System
Instruction Set Architecture (ISA)
Hardware community Software cannot be changed!
Microarchitecture
Gates/Register-Transfer Level (RTL)
Circuits
Devices

Problem is not just one of mindset
Software developers not interested unless
hardware available
software simulations too slow, 10-100 kHz for
detailed models of one CPU
software simulations not credible
But takes 5 years to complete prototype hardware
system!
Then in a few months of software development, all
mistakes become clear

22
RAMP Build Research MPP from FPGAs

As ? 25 CPUs will fit in Field Programmable Gate
Array (FPGA), 1000-CPU system from ? 40 FPGAs?
16 32-bit simple soft core RISC at 150MHz in
2004 (Virtex-II)
FPGA generations every 1.5 yrs ? 2X CPUs, ? 1.2X
clock rate
HW research community does logic design (gate
shareware) to create out-of-the-box, MPP
E.g., 1000 processor, standard ISA
binary-compatible, 64-bit, cache-coherent
supercomputer _at_ ? 200 MHz/CPU in 2007
Multi-University Collaboration
RAMPants Arvind (MIT), Krste Asanovic (MIT),
Derek Chiou (Texas), James Hoe (CMU), Christos
Kozyrakis (Stanford), Shih-Lien Lu (Intel),
Mark Oskin (Washington), David Patterson (UCB),
Jan Rabaey (UCB), and John Wawrzynek (UCB)

23
RAMP Goals

Provide credible prototypes with sufficient
performance to support co-development of software
and hardware ideas
Turn-around new hardware ideas in minutes or
hours
Support reproducible comparison of ideas across
different groups
Architects distribute usable hardware designs by
FTP, improve visibility to industry

24
RAMP-1 Hardware

BEE2 Berkeley Emulation Engine 2
By John Wawrzynek and Bob Brodersen with students
Chen Chang and Pierre Droz
Completed Dec. 2004 (14x17 inch 22-layer PCB)

1.5W / computer, 5 cu. in. /computer, 100 /
computer
Board 5 Virtex II FPGAs, 18 banks DDR2-400
memory, 20 10GigE conn.
25
Transactors

A transactor (transactional actor) is an abstract
unit of computation, which is easy to understand
and verify, but which can also be automatically
translated into high quality hardware or software
implementations

26
Original Transactor Motivation
Application
Algorithm
Programming Language
Operating System
Scale Vector-Thread Processor 128
Threads/Core 1M Gates, 17mm2, 400MHz,
0.18um IEEE Micro, Top Picks, 2004
Instruction Set Architecture (ISA)
Transactors/Microarchitecture (UTL)
Gates/Register-Transfer Level (RTL)
Circuits
Devices

Design chip at microarchitecture level rather
than at RTL level
Abstract away pipeline depth and communication
latencies
Separate global communication from local
computation
Avoid over-specification of behavior,
particularly local pipelining scheduling
Encode best-practice in concurrency management

27
Transactor Anatomy

Transactor unit comprises
Architectural state (registers RAMs)
Input queues and output queues connected to other
units
Transactions (guarded atomic actions on state and
queues)
Scheduler (selects next ready transaction to run)

Transactions
Output queues
Input queues
Scheduler
Transactor

Advantages
Handles non-deterministic inputs
Allows concurrent operations on mutable state
within unit
Natural representation for formal verification

28
Transactor Networks
Transactor
Global inter-unit communication via FIFO buffered
point-point channels
Short-range local communication within unit

Decompose system into network of transactor units
Decoupling global communication and local
computation
Only communication between units via buffered
point-point channels
All computation only on local state and channel
end-points

29
Message Queues or Channels

Queues decouple units execution and require
units to use latency-insensitive protocols
Carloni et al., ICCAV99
Queues are point-to-point channels only
No fanout, a unit must replicate messages on
multiple queues
No buses in a transactor design (though
implementation may use them)
Transactions can only pop head of input queues
and push at most one element onto each output
queue
Avoids exposing size of buffers in queues
Also avoids synchronization inherent in waiting
for multiple elements

30
Transactions

Transaction is a guarded atomic action on local
state and input and output queues
Guard is a predicate that specifies when
transaction can execute
Predicate is over architectural state and heads
of input queues
Implicit conditions on input queues (data
available) and output queues (space available)
that transaction accesses
Transaction can only pop up to one record from an
input queue and push up to one record on each
output queue

transaction
route(input int32 in,
output int32 out0,
output int32 out1)
when (routable(in))
if (route_func(in) 0)
out0 in
else
out1 in

transaction
route_kill(input int32 in)
when (!routable(in))
bad_packets

31
Scheduler

Scheduling function decides on transaction
priority based on local state and state of input
queues
Simplest scheduler picks among ready transactions
in a fixed priority order
Transactions may have additional predicates which
indicate when they can fire
E.g., implicit condition on all necessary output
queues being ready

unit route_stage(input int32 in0, // First
input channel. input int32 in1,
// Second input channel. output
int32 out0, // First output channel.
output int32 out1) // Second output
channel. int32 bad_packets int1
last // Fair scheduler state. schedule
reset bad_packets 0 last 0
route_kill(in0) route_kill(in1)
schedule round_robin(last) (0)
route(in0, out0, out1) (1) route(in1,
out0, out1)
32
Raise Abstraction Level for Communication
RTL Model Cycles and Wires
Transactors Messages and Queues

Designer allocate signals to wires and
orchestrates cycle-by-cycle communication across
chip
Global and local wires specified identically

All global communication uses latency-insensitive
messages on buffered point-point channels
Global wires separated from local intra-unit wires

Problems in RTL Implementation
Transactor Communications
Long signal paths may need more pipelining to hit
frequency goal, require manual RTL changes
Can also trade increased end-to-end latency for
reduced repeater power.
Latency-insensitive model allows automatic
insertion of pipeline registers to meet frequency
goals.
Repeaters used to reduce latency burn leakage
power
cycle
cycle
A1
A2
B1
B2
Neighbor wire coupling may reduce speed inject
errors, require manual rework
Dedicated wires for each signal cause wiring
congestion waste repeater power because many
wires are mostly idle
Use optimized signaling on known long wires
e.g., dual-data rate for high throughput,
low-swing for low power, shields to cut noise
Multiplexed channels reduce congestion, save
repeater power. Can use on-chip network.
Error detection and correction circuitry cannot
be added automatically, requires manual RTL
redesign
Can automatically insert error correction/retry
to cover communication soft errors
33
Raise Abstraction Level for Computation
RTL Model Manual Concurrency Management
Transactor Model Synthesis from Guarded Atomic
Actions

Designer has to divide application operations
into pieces that fit within a clock cycle, then
develop control logic to manage concurrent
execution of many overlapping operations.

Designer describes each atomic transaction in
isolation, together with priority for scheduling
transactions.
Tools synthesize pipelined transactor
implementation including all control logic to
manage dependencies between operations and flow
control of communications.

Single application operation manually divided
across multiple pipeline stages, then interleaved
with other operations
Each application operation described as
independent transaction
Schedule gives desired priority for multiple
enabled transactions
If (condA1) Astage1 else if (condB1)
Bstage1
If (condA2) Astage2 else if (condB2)
Bstage2
Transactor
Transaction B If (condB) ----
Schedule A gt B
Transaction A If (condA)
CLK
Dependencies between concurrently executing
operations managed manually
No pipeline registers or other internal
bookkeeping state is exposed in specification
Input and output communication rates and flow
control protocol manually built into code
Communication flow control automatically
generated from transactions use of input and
output queues
34
Design Template for Transactor
Scheduler
Arch. State 1
Arch. State 2

Scheduler only fires transaction when it can
complete without stalls
Avoids driving heavily loaded stall signals
Architectural state (and outputs) only written in
one stage of pipeline, use bypass/interlocks to
read in earlier stages
Simplifies hazard detection/prevention
Have different transaction types access expensive
units (RAM read ports, shifters, multiply units)
in same pipeline stage to reduce area

35
Transactor VLSI Design Flow
Specification
36
System Design Flow
Transactor Code
37
Related Models

CSP/Occam
Rendevous communications expose system latencies
in design
No mutable shared state within a unit
Khan Process Networks (and simpler SDF models)
Do not support non-deterministic inputs
Sequential execution within unit
Latency-Insensitive Design Carloni et al.
Channels are similar to transactor channels
Units described as stallable RTL
TRS/Bluespec Arvind Hoe
Uses guarded atomic actions at RTL level (single
cycle transactions)
Microarchitectural state is explicit
No unit-level discipline enforced

38
RAMP Implementation Plans
Name Goal Target CPUs Details
Red (Stanford) Get Started 1H06 8 PowerPC 32b hard cores Transactional memory SMP
Blue (Cal) Scale 2H06 ?1000 32b soft (Microblaze) Cluster, MPI
White (All) Full Features 1H07? 128? soft 64b, Multiple commercial ISAs CC-NUMA, shared address, deterministic, debug/monitor
2.0 3rd party sells it 2H07? 4X CPUs of 04 FPGA New 06 FPGA, new board
39
Summary

All computing systems will use many concurrent
processors (1,000s of processors/chip)
Unlike previously, this is not just a prediction,
already happening
We desperately need a new stack of system
abstractions to manage complexity of concurrent
system design
RAMP project building an emulator watering hole
to bring everyone together to help make rapid
progress
architects, OS, programming language, compilers,
algorithms, application developers,
Transactors provide a unifying model for
describing complex concurrent hardware and
software systems
Complex digital applications
The RAMP target hardware itself