Micro37 Tutorial Compilation System for Throughputdriven Multicore Network Processors

About This Presentation

Title:

Micro37 Tutorial Compilation System for Throughputdriven Multicore Network Processors

Description:

Micro-37 Tutorial. Compilation System for Throughput-driven Multi ... Fabric. I/F. PCI. QDR SRAM. Controller. Scratch. Memory. Hash. Unit. Multi-threaded (x8) ... – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 112

Provided by: ejjo

Category:

more less

Transcript and Presenter's Notes

Title: Micro37 Tutorial Compilation System for Throughputdriven Multicore Network Processors

1
Micro-37 TutorialCompilation System for
Throughput-driven Multi-core Network Processors

Michael K. Chen
Erik Johnson
Roy Ju
michael.k.chen, erik.j.johnson,
roy.ju_at_intel.com
Corporate Technology Group
Intel Corp.
December 5, 2004

2
Agenda

Project Overview
Domain-specific Language
High-level Optimizations
Code Generations and Optimizations
Performance Characterization
Runtime Adaptation
Summary

3
Project Overview

Part of the Shangri-la Tutorial presented at
MICRO-37
December 5, 2004

4
Outline

Problem Statement
Overview of Shangri-la System
Status and Teams

5
The Problem
Packet processing application

State-of-the-art
Hand-tuned code for maximal performance
? but often error-prone and not scalable
Static resource allocation often tailored to one
particular workload
? not flexible to varying workloads and hardware

6
Shangri-La Overview

Mission research an industry leading programming
environment for packet processing on Intel chip
multiprocessors (CMP) silicon
Challenges
Hide architectural details from programmers
Automate allocation of system resources
Adapt resource allocation to match dynamic
traffic conditions
Achieve performance comparable to hand-tuned
systems
Technology
Language enable portable packet processing
applications
Compiler automate code partitioning and
optimizations
Run-time System adapt to dynamic workloads

7
Architectural Features of Intel IXP Processor

Heterogeneous, multi-cores
Intel Xscale Processor (control) and
MicroEngines (data)
Memory hierarchy
Local memory (LM) distributed on MEs
No HW cache
Scratch, SRAM, DRAM shared
Long memory latency
MicroEngine
Single issue deferred slots
Light-weighted HW multi-threading
Event signals to synchronize threads
Multiple register banks and constraints as
operands in instructions
Limited code store

8
Packet Processing Applications

Types of apps
IPv4 forwarding, L3-Switch, MPLS (Multi-Protocol
Label Switch), NAT (Network Address Translation),
Firewall, QoS (Quality of Service)
Characteristics of packet processing apps
Performance metric throughput (vs. latency)
Mostly memory bound
Large amount of packets without locality
Smaller instruction footprint
Execution paths tend to be predictable

9
Anatomy of Shangri-La
General-purpose Compiler
Language(s)
Baker Programming Language

Modular language (with C-like syntax) to express
applications as a dataflow graph

Baker Compiler
Front-end

Extract run-time characteristics by executing
application

Profiler
Profiling

Compiler optimizations for pipeline construction
and data structure mapping/caching

Pi Compiler
Inter-Procedural Opt.
Loop/Memory Opt.
Aggregate Compiler
Global Optimizations

Code generation and optimization for
heterogeneous cores

Code Generation

Dynamically adapt mapping to match traffic
fluctuations

Run-time System
Execution environment
10
Baker Language

Familiar to embedded systems programmers
Syntactically feels like C
Simplifies the development of packet-processing
applications
Hides architectural details
Single-level of memory
Implicit threading model
Modular programming encapsulation
Domain-specific
Data flow model
Actors and interconnects (PPFs and channels)
Built-in types, e.g. packet
Enables compiler to generate efficient code on
target CMP hardware

11
Shangri-la Example
Modular, simple description module l3_switch
module eth_rx, eth_tx // Built-in ppf
l2_clsfr module eth_encap_mod, l3_fwdr,
l2_bridge wiring eth_rx.eth0 -gt
l3_switch.l2_clsfr.input_chnl
l3_fwdr.input_chnl lt- l2_clsfr.l3_fwrd_chnl
... ...

Profiler

Baker

Looks like C int l3_switch.l2_clsfr.process
(ether_packet_t in_pkt) ... if ( fwd ) p
packet_decap(in_pkt) channel_put(l3_forward_c
hnl, p) else channel_put(l2_bridge_chnl,
in_pkt)
L3 Switch

IR run on IR-simulator
Stimulated by packet trace
Statistics stored in IR

L3 Fwdr
L2 Cls
Eth Encap
TX
RX
L2 Bridge
12
Compiler and Optimizations

Perform program and data partitioning
Cluster multiple finer-grained components into
larger aggregates
Balancing between replication and pipelining
Automatic data mapping on memory hierarchy
Optimizations and code generation
Code generations for heterogeneous processing
cores
Global machine independent optimizations
Optimizations for memory hierarchy
Machine dependent code generation and
optimizations

13
Shangri-la Example

Aggregate Compiler

Pipeline Compiler

Internal Channels converted to function calls
L3 Switch
L3 Fwdr
Intel XScale
ME
L2 Cls
Eth Encap
TX
RX
L2 Bridge
Executable binaries
Each Aggregate given a main with a while(1) loop
Aggregates PPFs Critical Path PPFS in same
aggregate
14
Run-time Adaptation

Workloads fluctuate over time
Usually over provision to handle worst case
Adapt to workload
Change mapping to increase performance when
needed
Power-down unneeded processors
Adaptation requirements
Hardware independent abstraction
Querying of resource utilization

15
Shangri-la Example

Run-time System

Automatically map aggregates to processing units
L3 Switch
Automatically remap at runtime
XScale
ME
IntelXScaleCore
MEv2 2
MEv2 3
MEv2 4
MEv2 1
MEv2 7
MEv2 6
MEv2 5
MEv2 8
16
Project Status

Project started in Q1 2003
Collaboration among Intel, Chinese Academy of
Sciences, UT-Austin
Compiler based on Open Research Compiler (ORC)
A completed prototype system with maximal packet
forwarding rate on a number of applications
Research project to transfer technology to
product groups

17
Acknowledgements

Communication Technology Lab, Intel
Erik Johnson, Jamie Jason, Aaron Kunze, Steve
Goglin, Arun Raghunath, Vinod Balakrishnan,
Robert Odell
Microprocessor Technology Lab, Intel
Xiao Feng Li, Lixia Liu, Jason Lin, Mike Chen,
Roy Ju, Astrid Wang, Kaiyu Chen, Subramanian
Ramaswamy
Institute of Computing Technology, Chinese
Academy of Sciences
Zhaoqing Zhang, Ruiqi Lian, Chengyong Wu, Junchao
Zhang, Jiajun Wu, HanDong Ye, Tao Liu, Bin Bao,
Wei Tang, Feng Zhou
University of Texas at Austin
Harrick Vin, Jayaram Mudigonda, Taylor Riche,
Ravi Kokku

18
The Baker Language

Part of the Shangri-la Tutorial presented at
MICRO-37
December 5, 2004

19
Baker Overview and Goals

Baker is C with data-flow and packet processing
extensions
Goal 1 Enable efficient expression of packet
processing applications on large-scale
chip-multiprocessors (e.g., Intel IXP2400
processor)
Encourage interesting, complex application
development
Be familiar to embedded systems programmers
-should start with C
Goal 2 Enable good execution performance
Scalable performance across new versions of
large-scale CMP
Expose compiler and run-time system to
optimization opportunities
Dont constrain the compilers ability to place
code or data dont preclude run-time adaptation

20
Outline

The Baker Approach
Hardware abstractions and models
Domain-specific constructs
Standard C language feature reductions
Results
Future Research
Summary

21
Bakers Hardware Models
Results and Summary
C Restrictions
Domain Features
Hardware Model (1/5)

Memory
Concurrency, i.e., cores and threads
I/O, e.g., receive and transmit

22
A Single-level Memory Model
Results and Summary
C Restrictions
Domain Features
Hardware Model (2/5)

Baker exposes a single-level, shared memory
model, like C
Makes programming easier
Variable declaration and use, malloc/free work
just like C
Enables compiler freedom in optimizing data
placement
Move most accessed data structures (or parts of
structures) to fastest memory
Enables compiler to move code to any core
Code not tied to a particular cores physical
memory model

23
Implicitly Threaded Concurrency
Results and Summary
C Restrictions
Domain Features
Hardware Model (3/5)

Baker exposes a multithreaded concurrency model
Programmer knows code may execute concurrently
Programmer does not
Know the of number of cores
Explicitly create or destroy threads
A consequence Programmers must protect shared
memory with locks
Enables compiler and run-time system to optimize
execution
Can create an application pipeline and balance it
Can optimize locks based on which processors
access a lock

24
Example of Implicit Threading
Results and Summary
C Restrictions
Domain Features
Hardware Model (4/5)
Code shown for illustrative purposes only and
should not be considered valid.
25
I/O As A Driver Model
Results and Summary
C Restrictions
Domain Features
Hardware Model (5/5)

RX and TX require hardware knowledge
E.g., PHYs and MACs, RBUFs, TBUFs, flow control
hardware
Difficult to abstract this hardware using common
C constructs
Solution
Dont write these in Baker
Written once in assembly by the system vendors
for each board
Baker developers use receive and transmit code
like a device driver

26
Exposing Domain Features
Results and Summary
C Restrictions
Hardware Model
Domain Features (1/5)

Hiding hardware features can drastically decrease
performance
Baker exposes application domain features to
compensate
Tailor the compiler and run-time system
optimizations to the domain
Programmer is forced to help the compiler and
run-time system find parallelism
But in a natural way
Two types of domain features
Data-flow abstractions
Packet processing abstractions

27
Data Flow Overview
Results and Summary
C Restrictions
Hardware Model
Domain Features (2/5)

A data flow is a directed graph
Graph nodes are called actors (or
packet-processing functions, PPF, in Baker) and
represent the computation
Graph edges are called channels and move data
between actors
Data-flow is a natural fit for the packet
processing domain

28
Data Flow PPFs and Channels
Results and Summary
C Restrictions
Hardware Model
Domain Features (3/5)

PPFs (or Actors)
Implicitly Concurrent
Stateful
Support multiple inputs and outputs
No assumptions about a steady rate of packet
consumption
Channels
Queue-like properties
Asynchronous, unidirectional, typed, reliable
Active and passive varieties
Can be replaced with function
Run-time system can choose an optimal
implementation
E.g., Scratch rings vs next neighbor rings

29
Packet Processing Features
Results and Summary
C Restrictions
Hardware Model
Domain Features (4/5)

Packets and Meta-data as first class objects
Packets
Programmer accesses packet data through a special
pointer type, all packet accesses go through
these pointers
Allows compiler to coalesce reads/writes, avoid
head and tail manipulation, etc.
Meta-data
Storage associated and carried with a packet
E.g., input port, output port, etc.
Accessed via the packets pointer
Useful to programmers to carry per-packet state
passed between actors
Language ensures that meta-data is created before
it is used

30
Example Application
Results and Summary
C Restrictions
Hardware Model
Domain Features (5/5)
Code shown for illustrative purposes only and
should not be considered valid.
31
Reduce Language Features
Results and Summary
Hardware Model
Domain Features
C Restrictions (1/1)

By removing some features of C, compiler is able
to make more optimizations
Typesafe pointers
Compiler is able to do much better alias analysis
Networking code typically does not use tricky
pointer manipulations
Some features needed to be removed to avoid large
overheads on the microengines
Recursion
No natural stack on the microengine so the
compiler has to implement one
Eliminating recursion simplifies stack analysis
Function Pointers
Removed for similar reasons as recursion
Unfortunately, network programmers actually use
them a great deal

32
Results
Hardware Model
Domain Features
C Restrictions
Results Summary (1/3)

Source-lines of code measured using sloccount
Does not do complexity analysis, does not handle
assembly code

These tests and ratings are measured using
specific computer systems and/or components and
reflect the size of the indicated code as
measured by those tests. Any difference in
system hardware or software design or
configuration may affect actual sizes.
33
Future Research
Hardware Model
Domain Features
C Restrictions
Results Summary (2/3)

Existing languages expose packets as completely
independent, however flows are a more appropriate
independence class for data in this domain
How should flows of packets be represented in a
language and how to optimize around these
Automated ordering
Flow-data locality improvements
Flow-lock elision

34
Summary
Hardware Model
Domain Features
C Restrictions
Results Summary (3/3)

Goals
Enable efficient expression of packet processing
applications on large-scale chip-multiprocessors
(e.g., Intel IXP2400 processor)
Enable good execution performance
Approach
Hide hardware details
Single memory, implicit threading, RX/TX as
drivers
Expose domain-specific constructs
Data-flow, packets, meta-data
Reduce C Features
Typesafe pointers, recursion, function pointers

35
High-Level Optimizations

Part of the Shangri-la Tutorial presented at
MICRO-37
December 5, 2004

36
Shangri-La Compiler Overview

Convert Baker program into compiler intermediate
representation (IR)

Derive run-time characteristics by simulating
application

Compiler optimizations for pipeline construction
and data structure mapping/caching

Code generation and optimization for
heterogeneous cores

Load application and perform dynamic resource
linking

37
Profiling Overview

Simulation of high-level IR
Developed a custom IR interpreter
Different from the traditional 2-pass profiling
Profiling information guides optimizations in
later phases
Stimulated using user-supplied packet traces
Information collected
Execution frequency
Communication
Memory access statistics

38
Pi Compiler Details

Performs most high-level optimizations
Mapping PPFs to heterogeneous cores
Assign memory levels to global data structures
Perform inter-procedural analysis for
optimizations needing support
Guided by profiling results

39
Supporting Language Features with Compiler
Optimizations

Automatic program partitioning
Packet handling optimizations
Automatic memory mapping

Modular, dataflow language
Packet abstraction model
Flat memory hierarchy

40
Key Compiler Technologies

Automatic program partitioning to heterogeneous
cores
Packet handling optimizations
Packet access combining
Static offset and alignment resolution
Packet primitive removal
Partitioned memory hierarchy optimizations
Memory mapping
Delayed-update software-controlled caches
Program stack layout optimization

41
Partitioning Across Heterogeneous Cores
Memory Hierarchy Optimizations
Packet Handling Optimizations
Automatic Program Partitioning (1/3)

Partition across Intel XScale and multiple MEs
Partitioning considerations
Identifying control and data planes
Minimizing inter-processor communication costs
Account for dynamic characteristics using
profiling results
Satisfying code size constraint
Different memory addresses seen by different
cores
Insert address translations
Minimize insertions and impact on performance

42
Inputs Into Partitioning Algorithm
Memory Hierarchy Optimizations
Packet Handling Optimizations
Automatic Program Partitioning (2/3)

Throughput-driven cost model
Eliminates latency from consideration
Expresses goal appropriately for domain
Relevant profiling statistics
PPF execution time
Global data access frequency
Channel utilization
Possible partitioning strategies
Pipelining application across cores
Replicating application across cores

43
Partitioning Algorithm
Memory Hierarchy Optimizations
Packet Handling Optimizations
Automatic Program Partitioning (3/3)
Pi Compiler
Intra-PPF IPA
Code size exec time est.

Pipeline Compiler

Memory mapper
Merge PPFs with highest communication cost
Aggregate formation
L3 Fwdr
L3 Switch
Intra-aggregate IPA
Aggregate dump
Rx
Tx
L2 Cls
Eth Encap
L2 Bridge
Duplicate aggregate with lowest throughput
Duplicate entire pipeline on available MEs
44
Packet Access Combining
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (1/5)

Basic packet accesses are powerful
Support for language features
Naïve mapping results in at least one memory
access per packet access
Combine multiple packet accesses / metadata
accesses
L3-Switch has 24 packet accesses per packet on
critical path
Take advantage of IXPs wide DRAM access
instruction
Buffer values in local memory or transfer
registers

45
Packet Access Combining Example
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (2/5)
b read pkt (off64b, sz16b) t1 ( b gtgt 8 )
0xff
t1 pkt-gtttl (off64b, sz8b)
t2 pkt-gtprot (off72b, sz8b)
t2 b 0xff

Analysis overview
Isolate packet accesses
Perform checks to guarantee packet accesses
combined safely
Validate range and size of combined memory access
Replace combined accesses with accesses to / from
Local Memory / transfer registers

46
Static Offset and Alignment Resolution (SOAR)
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (3/5)
packet_encap
offset( src_ip ) 26B
packet_decap
offset( src_ip ) ???

Generic packet accesses
Can handle arbitrary layering of protocols and
arbitrary field offsets
Clearly simplifies programmers tasks
But dynamic offset and alignment determination
add significant overheads
Dynamic offsets handling adds 20 instructions
per packet access
Dynamic alignment adds several instructions per
packet access

47
Static Offset and Alignment Resolution (SOAR)
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (4/5)

Statically resolved packet field alignment
eliminates a few instructions
Statically resolved packet field offset and
alignment can be accessed with a few instructions
Implemented using custom dataflow analysis

18/18 resolved
3/3 resolved
l3_switch.m
lpm_lookup.p
eth_encap.m
Eth ? IP
IP ? Eth
options_processor.p
encap.p
l3_cls.p
icmp_processor.p
arp.p
Rx
Tx
l2_cls.p
1/1 resolved
Eth ? Arp
l3_fwdr.m
New ICMP ? IP Copy IP ? ICMP ? IP
2/2 resolved
l2_bridge.m
bridge.p
Copy Eth
48
Eliminate Unnecessary Packet Primitives in Code
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (5/5)

Eliminate unnecessary packet_encap and
packet_decap primitives
Balanced packet_encap and packet_decap in the
same aggregate can be eliminated because they
have no external effect
Works in conjunction with SOAR analysis results
Convert metadata accesses into local memory
accesses when all uses are within the same
aggregate
Private uses of metadata have no external effect
metadata accesses composed of 1 SRAM and 20
instructions
Candidate accesses can be identified with def-use
analysis

49
Global Data Memory Mapping
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (1/6)

Collect dynamic access frequencies to shared
global data structures
Map data structures to appropriate memory levels
Map small, frequently accessed data structures to
Scratch Memory
Otherwise, place in SRAM
Pointers may point to objects in different levels
of memory
Perform congruence analysis to allocate such
objects to a common memory level

50
Delayed-Update Software-Controlled Caches
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (2/6)

Cache unprotected global data structures
Since these structures are not protected by
locks, assume that they can tolerate delayed
update
Delayed update results in some mishandled
packets, tolerable for network applications
Identify caching candidates automatically from
profiling statistics
Frequently read packet processing core
Infrequently written control and initialization
routines
High predicted hit rate derived from profiling
Good candidates
Configuration globals MAC table, classification
table
Lookup tables

51
Caching Route Lookups
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (3/6)
01 10 11

Packet forwarding routes are stored in trie
tables
Frequently executed path
Route lookups
Infrequently executed path
Route update
Updated with an atomic write

00
01
10
11
11
a
c
b
00
01
10
11
a
c
b
c
52
Delayed-update software-controlled caches
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (4/6)
Infrequent write path
Frequent read path
Shared data
Base Access
Optimized Access

Delayed-update coherency checks home location
only occassionaly
update_flag set on any change to the cached
variable
Update check rate set as a function of tolerable
error rate and variables expected load and store
rate

53
Program stack layout optimization
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (5/6)

Shangri-las runtime model
Supports calling convention
Stack holds PPFs local variables and temporary
spill locations
Baker does not support recursion stack could be
assigned statically to different locations
Want to assign disjoint stack frames to limited
Local Memory
Stack is mapped to Local Memory and SRAM
Only 48 words / thread for stack

54
Program stack layout optimization
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (6/6)
main() 16 words
Stack
Local Memory 48 words
PPF2 32 words
PPF1 16 words
SRAM
PPF3 16 words

PPFs higher in call graph assigned to Local
Memory first
Dispatch model ensures relatively flat call graph
If PPF is called from two places, assign to
minimum stack location that will not collide with
live stack frames

55
Conclusions

Proposed optimizations for generating code
competitive with hand-tuned code from high-level
languages
Memory-level optimizations
Program partitioning to heterogeneous cores
Optimizations to support packet abstractions
Total system performance will be shown after we
describe code generation optimizations and the
run-time system

56
Code Generation and Optimizations

Part of the Shangri-la Tutorial presented at
MICRO-37
December 5, 2004

57
Outline

Compiler Flow
Intel XScale Processor Code Generation
MicroEngine Code Generation

58
Shangri-la Compiler Flow
59
Intel XScale Processor Code Generation

Intel XScale Processor
Runs configuration, management, control plane,
and cold code
With OS and virtually unlimited code store
Less performance critical
Code generation
Shares the compilation path with ME till WOPT
Regenerates C source code with proper naming
convention
Leverages an existing Gcc compiler for Intel
XScale Processor
Issue on address translation
Intel XScale Processor uses virtual address and
ME uses physical address memory type
Perform address translation only on Intel XScale
Processor for addresses exposed between two types
of cores

60
ME Code Generator
61
Register Allocation

ME architectural constraints on assigning
registers
Multiple register banks used in specific types of
instructions
GPR banks, SRAM/DRAM Transfer In/Out banks, Next
Neighbor bank
Cannot use certain banks of registers for both A
and B operands
E.g. GPR A and GPR B banks
ME register allocation framework
Step 1 identifying candidate banks
For each TN (virtual register), identify all
possible register banks at each occurrence
according to ME ISA
If there is at least one common register bank,
follow conventional register allocation

62
Register Allocation (cont.)

ME register allocation framework (cont.)
Step 2 resolving bank conflicts if no common
bank exists
Locate conflicting edges
Partition def-use graph
Add moves between sub-graphs
Step 3 allocating intra-set registers
Perform conventional register allocation but
observe the constraints on A and B operands
Add an edge between two source operands in the
same instruction in a symbolic register conflict
graph
Use different heuristics to balance the usage of
GPR A and B banks

63
Calling Convention and Stack

Support calling convention
Caller/callee save registers, parameter passing,
etc.
Perform code generation, e.g. register
allocation, within a function scope
Ease debugging and performance tuning to focus on
changes only in the affected scope
Support calling stack despite no recursion
Stack frame for local vars, spilled parameters,
register spills
Calling stack grows from LM to SRAM
Allocate disjoint stack frames on precious LM
Statically decide the memory level for a frame
for both performance and code size reasons

64
More Features in Code Generation and Optimizations

Inter-procedural analysis and function inlining
Global scalar optimization for register promotion
Parameterized machine model for ease of porting
Code size guard to throttle the aggressiveness of
optimizations which increase code size
Global instruction scheduling and latency hiding
Bitwise optimizations
Loop unrolling

65
The Run-time System

Part of the Shangri-la Tutorial presented at
MICRO-37
December 5, 2004

66
RTS Goals
Motivation (1/2)

Adapt execution of the application to match the
current workload
Isolate the RTS user from hardware-specific
features commonly needed for packet processing

67
Adaptation Opportunities
Motivation (2/2)
68
Outline

RTS Design Overview
Run-time Adaptation Mechanisms
Binding
Checkpointing
State migration
Run-time Adaptation Results
Overheads and costs
Benefits
Future research
Summary

69
RTS Theory of Operations
Summary
Adaptation Results
Adaptation Mechanisms
RTS Design (1/4)
System monitor collets run-time statistics (queue
depths), triggers adaptation
Resource planner and allocator computes new
processor mapping based on global knowledge
Executable binaries
XScale

Resource Abstraction Layer hides the
implementation of processor resources
Topology
ME
A
B
A
Resource Planner Allocator
A
A
Triggers
C
B, C
B
System Monitor
B
Mapping
IntelXScaleCore
C
Queue depths
C
Resource Abstraction Layer (RAL)
Traffic Mix
Resource Mapping
Run-time system
70
The Resource Abstraction Layer
Summary
Adaptation Results
Adaptation Mechanisms
RTS Design (2/4)

Three goals
Support adaptation Packet channels and Locks
Allow common abstractions for rest of RTS code
Processing units, network interfaces
Allow for portability of compilers code
generator
Data memory, packet memory, timers, hash, random
Key Lesson
Noble last goal, but performance cost can be
large
Focus on supporting adaptation

71
How RAL Supports Adaptation
Summary
Adaptation Results
Adaptation Mechanisms
RTS Design (3/4)
A microengine-based example
RAL calls are initially undefined
Application .o file
RAL .o file
Final .o file
RAL Implementation 0
RAL Implementation 1
RAL Implementation 2
RAL Implementation 3
At run time, the RTS has the application .o file
At run time, the RTS has the application .o file
and the RAL .o file
RAL Implementation 4
RAL Implementation 5
RAL Implementation 6
Linker adjusts jump targets using import variable
mechanism
Linker adjusts jump targets using import variable
mechanism
Linker adjusts jump targets using import variable
mechanism
Process repeated after each adaptation
72
System Monitor and Resource Planner
Summary
Adaptation Results
Adaptation Mechanisms
RTS Design (4/4)

System Monitor
Triggering policies
E.g., queue thresholds
Resource planner and allocator
Mapping policies
Move code into/out of fast path
Duplicate code within the fast path

73
Adaptation Mechanisms
Summary
Adaptation Results
RTS Design
Adaptation Mechs (1/7)

Binding
Checkpointing
State migration

74
Why Have Binding?
Summary
Adaptation Results
RTS Design
Adaptation Mechs (2/7)
Now we can use NN rings, local locks
A
B
A
B
A
A
B
B
IntelXScaleCore
IntelXScaleCore
Want to be able to use the fastest
implementations of resources available
75
Binding The Value of Choosing the Right Resource
Summary
Adaptation Results
RTS Design
Adaptation Mechs (3/7)
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
76
Binding Compile-time or Not?
Summary
Adaptation Results
RTS Design
Adaptation Mechs (4/7)
77
Checkpointing
Summary
Adaptation Results
RTS Design
Adaptation Mechs (5/7)

When migrating, RTS follows simple algorithm
Tell affected processing units to stop at
checkpoint location
Wait for processing unit to reach checkpoint
location
Reload and run processing units

78
Checkpointing contd
Summary
Adaptation Results
RTS Design
Adaptation Mechs (6/7)

Finding the best checkpoint is easier in packet
processing than in general domains
Leverage characteristics of data-flow
applications
Typically implemented as a dispatch loop
Dispatch loop is executed at high-frequency
Top of the dispatch loop has no stack information
Since compiler creates dispatch loop, compiler
inserts checkpoints in the code

79
State Migration
Summary
Adaptation Results
RTS Design
Adaptation Mechs (7/7)

Once a processor has been checkpointed, state
from old resources must be moved to new resources
E.g., Packets sitting in previous packet channel
implementations, cached data
Solution
Copy packets in old channels to new channels
Flush any caches

80
Adaptation Results
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (1/7)

Adaptation costs (i.e., overheads)
Checkpointing
Loading
Binding
State migration (not covered)
Cumulative effects
Adaptation benefits

Experimental setup
Radisys, Inc. ENP2611
1 600MHz Intel IXP2400 Processor
MontaVista Linux
Timer measurement accuracy 0.53us

Third party brands/names are property of their
respective owners
81
Checkpointing Overhead
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (2/7)

Factors
Time to inform a processing unit to stop at the
checkpoint
ME 60us Intel Xscale core 34us
Time to check if all threads have stopped
ME 3us Intel XScale core 3us
Time to start a processing unit
ME 0.036ms Intel Xscale core 0.097ms

Linux kernel thread
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
82
Loading Overhead
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (3/7)

Intel XScale core thread start time 0.054ms
Graph shows ME load times

ME binding

Intel XScale core binding

Not all adaptation time represents an inoperable
system
Can leave some processors running while
checkpointing others

Overall adaptation time is
Linking time (checkpointing and loading
time number of cores)
Packet loss occurs during checkpointing and
loading, but not during binding
So, focus optimizations on starting, stopping,
and loading
Exchange time in loading for more time in linking

86
Theoretical Benefits of Adaptation
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (7/7)

For more details see paper in HotNets-II
http//nms.lcs.mit.edu/HotNets-II/papers/adaptatio
n-case.pdf

Gather experimental benefits of adaptation
Define and develop performance determinism in the
face of adaptation
Apply power scaling to adaptation mechanisms
Co-exist commercial operating systems with
adaptation

88
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results
Summary (2/2)

An adaptive run-time system provides benefits in
Performance, supported services, and power
consumption
The system can be built with a truly programmable
large-scale chip multiprocessor, requires
Checkpointing
Binding
State migration
Adaptation costs come primarily from loading and
checkpointing times, optimize these

89
Shangri-la Performance Evaluation

Part of the Shangri-la Tutorial presented at
MICRO-37
December 5, 2004

90
Evaluation Setup
Results
Benchmarks
Resource Budgets
Setup (1/3)

Hardware
Radisys ENP2611 evaluation board (3 x 1Gbps
optical ports)
IXIA packet generator (2 x 1Gbps optical ports)
Currently only capable of generating 2Gbs traffic
Benchmarks
L3-Switch (3126 lines) L2 bridging and L3
forwarding
Firewall (2784 lines) Simple firewall using
ordered rule-based classification
MPLS (4331 lines) Multi-protocol labeled
switching (transit node)
Packet traces
L3-Switch and MPLS evaluated using NPF packet
traces
Firewall used custom packet trace

Third party brands/names are property of their
respective owners
91
Mt. Hood Board
Results
Benchmarks
Resource Budgets
Setup (2/3)

One Intel IXP2400
Three 1Gbps optical ports
64MB DRAM
8MB SRAM

3 optical ports
DRAM
92
Test Development Environment
Results
Benchmarks
Resource Budgets
Setup (3/3)

Linux host machine
Provides power to Radisys ENP2611 board via PCI
bus
Compiles code for MEs and Intel XScale
Running NFS server
Intel XScale core running Linux and Shangri-la
RTS
Read generated binaries from host machines NFS
server and load onto MEs

Ethernet serial cable
Linux host
Radisys ENP2600
2x 1Gbps optical links
IXIA packet generator
Third party brands/names are property of their
respective owners
93
Instruction and memory budgets at 2.5Gb/s
Results
Benchmarks
Setup
Resource Budgets (1/5)
x6

Assumed memory access latency 100 cycles
Scratch Memory 60 cycles
SRAM 90 cycles
DRAM 120 cycles
Memory access budget refers only to number of
memory accesses that can be overlapped with
computation
Does not account for bandwidth of SRAM/DRAM

94
Evaluating Intel IXP2400 Memory Bandwidth
Results
Benchmarks
Setup
Resource Budgets (2/5)

Modified empty PPF connected to Rx and Tx
Add loop to access chosen memory level n times
Graph throughput of various configurations
n 1, 2, 4, 1024
Memory accessed SCRATCH, SRAM, DRAM
Results using minimum-sized 64B packets

95
Scratch Memory Bandwidth
Results
Benchmarks
Setup
Resource Budgets (3/5)

Significant difference in memory bandwidth
consumed according to access size

Behavior is similar to Scratch Memory

DRAM accesses significantly constrain forwarding
rate

Performs core router functionality
Bridge packets not destined for this router
Handle ARP packets for resolving Ethernet
addresses
Route IP packets targeting this router

lpm_lookup.p
eth_encap.m
options_processor.p
encap.p
l3_cls.p
icmp_processor.p
arp.p
Rx
Tx
l2_cls.p
l3_fwdr.m
l2_bridge.m
bridge.p
l3_switch.m
99
Firewall
Results
Resource Budgets
Setup
Benchmarks (2/4)

Filter out unwanted packets from a WAN (e.g.
Internet)
Assign flow IDs to packets according
user-specified rules list using src IP, dst IP,
src port, dst port, TOS and protocol
Drop packets for specified flow IDs
Optimize assignment of flow IDs
Try to find flow ID in the hash table, placed in
the table by a previous packet with the same
fields
Otherwise, do a long search by testing the rules
in order

Rx
Tx
firewall.p
hash_lookup.p
long_search.p
classifier.m
firewall.m
100
Multi-Protocol Label Switching (MPLS)
Results
Resource Budgets
Setup
Benchmarks (3/4)

Route packets using attached labels instead of IP
addresses
Reduces routing hardware requirements
Facilitates high-level traffic management on
user-defined packet streams

mpls.m
ilm.p
ops.p
MPLS ? MPLS?
ops.p
ftn.p
eth_encap.m
encap.p
l3_fwdr.m
arp.p
Rx
Tx
l2_cls.p
l2_bridge.m
bridge.p
mpls_app.m
101
Other Network Benchmarks
Results
Resource Budgets
Setup
Benchmarks (4/4)

Network address translation (NAT)
Allows multiple LAN hosts connect to a WAN (e.g.
Internet) through one IP address
Achieved by remapping LAN IPs and ports
WAN hosts only see the NAT router
Manages a table mapping active connections
between the LAN and WAN
Quality of Service (QoS)
Allows partitioning of available bandwidth to
user-specified traffic streams
Packet streams are throttled by intentionally
dropping packets
Header compression
Reduce size of transmitted packet headers
Since many fields are similar for packets in the
same flow, achieve compression by only
transmitting differences
Various security features
Encryption / decryption

102
Dynamic Memory Accesses
Benchmarks
Resource Budgets
Setup
Results (1/4)
SWC software cache PHR pkt handling
removal SOAR static offset align PAC
pkt access combing -O2 inline pkt handling
-O1 typical scalar opts BASE no opts

Table shows average per-packet access count
PAC significantly reduces packet memory accesses
-O1 enables pipeline to fit on one ME
SWC and PHR also contribute to reduced memory
accesses

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
103
L3-Switch Forwarding Rate
Benchmarks
Resource Budgets
Setup
Results (2/4)
SWC software cache PHR pkt handling
removal SOAR static offset align PAC
pkt access combing -O2 inline pkt handling
-O1 typical scalar opts BASE no opts
Top-end performance still constrained by memory
bandwidth PHR and SWC alleviate this somewhat
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
PHR and SOAR have the most impact on forwarding
rate
PAC and SOAR have the most impact on L3-Switch
forwarding rate

Forwarding rate of minimum-sized packets (64B)
Reduced memory access and instruction count both
improve forwarding rate

104
Firewall Forwarding Rate
Benchmarks
Resource Budgets
Setup
Results (3/4)
SWC software cache PHR pkt handling
removal SOAR static offset align PAC
pkt access combing -O2 inline pkt handling
-O1 typical scalar opts BASE no opts
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Per-ME performance improvement dominated by PAC
105
MPLS Forwarding Rate
Benchmarks
Resource Budgets
Setup
Results (4/4)
SOAR does not help this application due to
stacking of MPLS labels
SWC software cache PHR pkt handling
removal SOAR static offset align PAC
pkt access combing -O2 inline pkt handling
-O1 typical scalar opts BASE no opts
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.

Results are for MPLS transit only (internal
routers of a MPLS domain)
Similar performance characteristics to L3-Switch

106
Conclusions

Demonstrate performance from a high-level
language comparable to hand-tuned code
Memory-level optimizations
Program partitioning to heterogeneous cores
Optimizations to support packet abstractions
Language features are more attractive as users
can enjoy ease of programming without sacrificing
performance
Modular program design
Packet model supporting encapsulation, metadata
and bit-level accesses
Flat memory model
Able to achieve 2Gbps on L3-Switch, Firewall and
MPLS Transit

107
Summary

Part of the Shangri-la Tutorial presented at
MICRO-37
December 5, 2004

108
Summary - Baker

Goals
Enable efficient expression of packet processing
applications on large-scale chip-multiprocessors
(e.g., Intel IXP2400 processor)
Enable good execution performance
Approach
Hide hardware details
Expose domain-specific constructs
Reduce C Features

109
Summary Compiler Optimizations

Demonstrate performance from a high-level
language comparable to hand-tuned code
Memory-level optimizations
Program partitioning to heterogeneous cores
Optimizations to support packet abstractions
Machine-specific optimizations
Able to achieve maximal packet forwarding rates
(2Gbps) on L3-Switch, Firewall and MPLS Transit

110
Summary Runtime Adaptation

An adaptive system is important for packet
processing to adapt to varying workloads
dynamically
Benefits in performance, services, and power
consumption
The system can be built with a truly programmable
large-scale chip multiprocessor, requires
Checkpointing
Binding
State migration
Adaptation costs come primarily from loading and
checkpointing times, optimize these

111
Key Learning

High-level language features ease programming for
complex multi-core network processors
Effective compiler optimizations able to achieve
performance comparable to hand-tuned systems
Architecture-specific, domain-specific, and
general optimizations all critical to obtain high
performance
Ease of programming and performance can co-exist
Runtime adaptation a key feature to future
network systems
The system can be built with a large-scale CMP
Many learning applicable to general CMP systems