Title: Micro37 Tutorial Compilation System for Throughputdriven Multicore Network Processors
1Micro-37 TutorialCompilation System for
Throughput-driven Multi-core Network Processors
- Michael K. Chen
- Erik Johnson
- Roy Ju
- michael.k.chen, erik.j.johnson,
roy.ju_at_intel.com - Corporate Technology Group
- Intel Corp.
- December 5, 2004
2Agenda
- Project Overview
- Domain-specific Language
- High-level Optimizations
- Code Generations and Optimizations
- Performance Characterization
- Runtime Adaptation
- Summary
3Project Overview
- Part of the Shangri-la Tutorial presented at
MICRO-37 - December 5, 2004
4Outline
- Problem Statement
- Overview of Shangri-la System
- Status and Teams
5The Problem
Packet processing application
- State-of-the-art
- Hand-tuned code for maximal performance
- ? but often error-prone and not scalable
- Static resource allocation often tailored to one
particular workload - ? not flexible to varying workloads and hardware
6Shangri-La Overview
- Mission research an industry leading programming
environment for packet processing on Intel chip
multiprocessors (CMP) silicon - Challenges
- Hide architectural details from programmers
- Automate allocation of system resources
- Adapt resource allocation to match dynamic
traffic conditions - Achieve performance comparable to hand-tuned
systems - Technology
- Language enable portable packet processing
applications - Compiler automate code partitioning and
optimizations - Run-time System adapt to dynamic workloads
7Architectural Features of Intel IXP Processor
- Heterogeneous, multi-cores
- Intel Xscale Processor (control) and
MicroEngines (data) - Memory hierarchy
- Local memory (LM) distributed on MEs
- No HW cache
- Scratch, SRAM, DRAM shared
- Long memory latency
- MicroEngine
- Single issue deferred slots
- Light-weighted HW multi-threading
- Event signals to synchronize threads
- Multiple register banks and constraints as
operands in instructions - Limited code store
8Packet Processing Applications
- Types of apps
- IPv4 forwarding, L3-Switch, MPLS (Multi-Protocol
Label Switch), NAT (Network Address Translation),
Firewall, QoS (Quality of Service) - Characteristics of packet processing apps
- Performance metric throughput (vs. latency)
- Mostly memory bound
- Large amount of packets without locality
- Smaller instruction footprint
- Execution paths tend to be predictable
9Anatomy of Shangri-La
General-purpose Compiler
Language(s)
Baker Programming Language
- Modular language (with C-like syntax) to express
applications as a dataflow graph
Baker Compiler
Front-end
- Extract run-time characteristics by executing
application
Profiler
Profiling
- Compiler optimizations for pipeline construction
and data structure mapping/caching
Pi Compiler
Inter-Procedural Opt.
Loop/Memory Opt.
Aggregate Compiler
Global Optimizations
- Code generation and optimization for
heterogeneous cores
Code Generation
- Dynamically adapt mapping to match traffic
fluctuations
Run-time System
Execution environment
10Baker Language
- Familiar to embedded systems programmers
- Syntactically feels like C
- Simplifies the development of packet-processing
applications - Hides architectural details
- Single-level of memory
- Implicit threading model
- Modular programming encapsulation
- Domain-specific
- Data flow model
- Actors and interconnects (PPFs and channels)
- Built-in types, e.g. packet
- Enables compiler to generate efficient code on
target CMP hardware
11Shangri-la Example
Modular, simple description module l3_switch
module eth_rx, eth_tx // Built-in ppf
l2_clsfr module eth_encap_mod, l3_fwdr,
l2_bridge wiring eth_rx.eth0 -gt
l3_switch.l2_clsfr.input_chnl
l3_fwdr.input_chnl lt- l2_clsfr.l3_fwrd_chnl
... ...
Looks like C int l3_switch.l2_clsfr.process
(ether_packet_t in_pkt) ... if ( fwd ) p
packet_decap(in_pkt) channel_put(l3_forward_c
hnl, p) else channel_put(l2_bridge_chnl,
in_pkt)
L3 Switch
- IR run on IR-simulator
- Stimulated by packet trace
- Statistics stored in IR
L3 Fwdr
L2 Cls
Eth Encap
TX
RX
L2 Bridge
12Compiler and Optimizations
- Perform program and data partitioning
- Cluster multiple finer-grained components into
larger aggregates - Balancing between replication and pipelining
- Automatic data mapping on memory hierarchy
- Optimizations and code generation
- Code generations for heterogeneous processing
cores - Global machine independent optimizations
- Optimizations for memory hierarchy
- Machine dependent code generation and
optimizations
13Shangri-la Example
Internal Channels converted to function calls
L3 Switch
L3 Fwdr
Intel XScale
ME
L2 Cls
Eth Encap
TX
RX
L2 Bridge
Executable binaries
Each Aggregate given a main with a while(1) loop
Aggregates PPFs Critical Path PPFS in same
aggregate
14Run-time Adaptation
- Workloads fluctuate over time
- Usually over provision to handle worst case
- Adapt to workload
- Change mapping to increase performance when
needed - Power-down unneeded processors
- Adaptation requirements
- Hardware independent abstraction
- Querying of resource utilization
15Shangri-la Example
Automatically map aggregates to processing units
L3 Switch
Automatically remap at runtime
XScale
ME
IntelXScaleCore
MEv2 2
MEv2 3
MEv2 4
MEv2 1
MEv2 7
MEv2 6
MEv2 5
MEv2 8
16Project Status
- Project started in Q1 2003
- Collaboration among Intel, Chinese Academy of
Sciences, UT-Austin - Compiler based on Open Research Compiler (ORC)
- A completed prototype system with maximal packet
forwarding rate on a number of applications - Research project to transfer technology to
product groups
17Acknowledgements
- Communication Technology Lab, Intel
- Erik Johnson, Jamie Jason, Aaron Kunze, Steve
Goglin, Arun Raghunath, Vinod Balakrishnan,
Robert Odell - Microprocessor Technology Lab, Intel
- Xiao Feng Li, Lixia Liu, Jason Lin, Mike Chen,
Roy Ju, Astrid Wang, Kaiyu Chen, Subramanian
Ramaswamy - Institute of Computing Technology, Chinese
Academy of Sciences - Zhaoqing Zhang, Ruiqi Lian, Chengyong Wu, Junchao
Zhang, Jiajun Wu, HanDong Ye, Tao Liu, Bin Bao,
Wei Tang, Feng Zhou - University of Texas at Austin
- Harrick Vin, Jayaram Mudigonda, Taylor Riche,
Ravi Kokku
18The Baker Language
- Part of the Shangri-la Tutorial presented at
MICRO-37 - December 5, 2004
19Baker Overview and Goals
- Baker is C with data-flow and packet processing
extensions - Goal 1 Enable efficient expression of packet
processing applications on large-scale
chip-multiprocessors (e.g., Intel IXP2400
processor) - Encourage interesting, complex application
development - Be familiar to embedded systems programmers
-should start with C - Goal 2 Enable good execution performance
- Scalable performance across new versions of
large-scale CMP - Expose compiler and run-time system to
optimization opportunities - Dont constrain the compilers ability to place
code or data dont preclude run-time adaptation
20Outline
- The Baker Approach
- Hardware abstractions and models
- Domain-specific constructs
- Standard C language feature reductions
- Results
- Future Research
- Summary
21Bakers Hardware Models
Results and Summary
C Restrictions
Domain Features
Hardware Model (1/5)
- Memory
- Concurrency, i.e., cores and threads
- I/O, e.g., receive and transmit
22A Single-level Memory Model
Results and Summary
C Restrictions
Domain Features
Hardware Model (2/5)
- Baker exposes a single-level, shared memory
model, like C - Makes programming easier
- Variable declaration and use, malloc/free work
just like C - Enables compiler freedom in optimizing data
placement - Move most accessed data structures (or parts of
structures) to fastest memory - Enables compiler to move code to any core
- Code not tied to a particular cores physical
memory model
23Implicitly Threaded Concurrency
Results and Summary
C Restrictions
Domain Features
Hardware Model (3/5)
- Baker exposes a multithreaded concurrency model
- Programmer knows code may execute concurrently
- Programmer does not
- Know the of number of cores
- Explicitly create or destroy threads
- A consequence Programmers must protect shared
memory with locks - Enables compiler and run-time system to optimize
execution - Can create an application pipeline and balance it
- Can optimize locks based on which processors
access a lock
24Example of Implicit Threading
Results and Summary
C Restrictions
Domain Features
Hardware Model (4/5)
Code shown for illustrative purposes only and
should not be considered valid.
25I/O As A Driver Model
Results and Summary
C Restrictions
Domain Features
Hardware Model (5/5)
- RX and TX require hardware knowledge
- E.g., PHYs and MACs, RBUFs, TBUFs, flow control
hardware - Difficult to abstract this hardware using common
C constructs - Solution
- Dont write these in Baker
- Written once in assembly by the system vendors
for each board - Baker developers use receive and transmit code
like a device driver
26Exposing Domain Features
Results and Summary
C Restrictions
Hardware Model
Domain Features (1/5)
- Hiding hardware features can drastically decrease
performance - Baker exposes application domain features to
compensate - Tailor the compiler and run-time system
optimizations to the domain - Programmer is forced to help the compiler and
run-time system find parallelism - But in a natural way
- Two types of domain features
- Data-flow abstractions
- Packet processing abstractions
27Data Flow Overview
Results and Summary
C Restrictions
Hardware Model
Domain Features (2/5)
- A data flow is a directed graph
- Graph nodes are called actors (or
packet-processing functions, PPF, in Baker) and
represent the computation - Graph edges are called channels and move data
between actors - Data-flow is a natural fit for the packet
processing domain
28Data Flow PPFs and Channels
Results and Summary
C Restrictions
Hardware Model
Domain Features (3/5)
- PPFs (or Actors)
- Implicitly Concurrent
- Stateful
- Support multiple inputs and outputs
- No assumptions about a steady rate of packet
consumption - Channels
- Queue-like properties
- Asynchronous, unidirectional, typed, reliable
- Active and passive varieties
- Can be replaced with function
- Run-time system can choose an optimal
implementation - E.g., Scratch rings vs next neighbor rings
29Packet Processing Features
Results and Summary
C Restrictions
Hardware Model
Domain Features (4/5)
- Packets and Meta-data as first class objects
- Packets
- Programmer accesses packet data through a special
pointer type, all packet accesses go through
these pointers - Allows compiler to coalesce reads/writes, avoid
head and tail manipulation, etc. - Meta-data
- Storage associated and carried with a packet
- E.g., input port, output port, etc.
- Accessed via the packets pointer
- Useful to programmers to carry per-packet state
passed between actors - Language ensures that meta-data is created before
it is used
30Example Application
Results and Summary
C Restrictions
Hardware Model
Domain Features (5/5)
Code shown for illustrative purposes only and
should not be considered valid.
31Reduce Language Features
Results and Summary
Hardware Model
Domain Features
C Restrictions (1/1)
- By removing some features of C, compiler is able
to make more optimizations - Typesafe pointers
- Compiler is able to do much better alias analysis
- Networking code typically does not use tricky
pointer manipulations - Some features needed to be removed to avoid large
overheads on the microengines - Recursion
- No natural stack on the microengine so the
compiler has to implement one - Eliminating recursion simplifies stack analysis
- Function Pointers
- Removed for similar reasons as recursion
- Unfortunately, network programmers actually use
them a great deal
32Results
Hardware Model
Domain Features
C Restrictions
Results Summary (1/3)
- Source-lines of code measured using sloccount
- Does not do complexity analysis, does not handle
assembly code
These tests and ratings are measured using
specific computer systems and/or components and
reflect the size of the indicated code as
measured by those tests. Any difference in
system hardware or software design or
configuration may affect actual sizes.
33Future Research
Hardware Model
Domain Features
C Restrictions
Results Summary (2/3)
- Existing languages expose packets as completely
independent, however flows are a more appropriate
independence class for data in this domain - How should flows of packets be represented in a
language and how to optimize around these - Automated ordering
- Flow-data locality improvements
- Flow-lock elision
34Summary
Hardware Model
Domain Features
C Restrictions
Results Summary (3/3)
- Goals
- Enable efficient expression of packet processing
applications on large-scale chip-multiprocessors
(e.g., Intel IXP2400 processor) - Enable good execution performance
- Approach
- Hide hardware details
- Single memory, implicit threading, RX/TX as
drivers - Expose domain-specific constructs
- Data-flow, packets, meta-data
- Reduce C Features
- Typesafe pointers, recursion, function pointers
35High-Level Optimizations
- Part of the Shangri-la Tutorial presented at
MICRO-37 - December 5, 2004
36Shangri-La Compiler Overview
- Convert Baker program into compiler intermediate
representation (IR)
- Derive run-time characteristics by simulating
application
- Compiler optimizations for pipeline construction
and data structure mapping/caching
- Code generation and optimization for
heterogeneous cores
- Load application and perform dynamic resource
linking
37Profiling Overview
- Simulation of high-level IR
- Developed a custom IR interpreter
- Different from the traditional 2-pass profiling
- Profiling information guides optimizations in
later phases - Stimulated using user-supplied packet traces
- Information collected
- Execution frequency
- Communication
- Memory access statistics
38Pi Compiler Details
- Performs most high-level optimizations
- Mapping PPFs to heterogeneous cores
- Assign memory levels to global data structures
- Perform inter-procedural analysis for
optimizations needing support - Guided by profiling results
39Supporting Language Features with Compiler
Optimizations
- Automatic program partitioning
- Packet handling optimizations
- Automatic memory mapping
- Modular, dataflow language
- Packet abstraction model
- Flat memory hierarchy
40Key Compiler Technologies
- Automatic program partitioning to heterogeneous
cores - Packet handling optimizations
- Packet access combining
- Static offset and alignment resolution
- Packet primitive removal
- Partitioned memory hierarchy optimizations
- Memory mapping
- Delayed-update software-controlled caches
- Program stack layout optimization
41Partitioning Across Heterogeneous Cores
Memory Hierarchy Optimizations
Packet Handling Optimizations
Automatic Program Partitioning (1/3)
- Partition across Intel XScale and multiple MEs
- Partitioning considerations
- Identifying control and data planes
- Minimizing inter-processor communication costs
- Account for dynamic characteristics using
profiling results - Satisfying code size constraint
- Different memory addresses seen by different
cores - Insert address translations
- Minimize insertions and impact on performance
42Inputs Into Partitioning Algorithm
Memory Hierarchy Optimizations
Packet Handling Optimizations
Automatic Program Partitioning (2/3)
- Throughput-driven cost model
- Eliminates latency from consideration
- Expresses goal appropriately for domain
- Relevant profiling statistics
- PPF execution time
- Global data access frequency
- Channel utilization
- Possible partitioning strategies
- Pipelining application across cores
- Replicating application across cores
43Partitioning Algorithm
Memory Hierarchy Optimizations
Packet Handling Optimizations
Automatic Program Partitioning (3/3)
Pi Compiler
Intra-PPF IPA
Code size exec time est.
Memory mapper
Merge PPFs with highest communication cost
Aggregate formation
L3 Fwdr
L3 Switch
Intra-aggregate IPA
Aggregate dump
Rx
Tx
L2 Cls
Eth Encap
L2 Bridge
Duplicate aggregate with lowest throughput
Duplicate entire pipeline on available MEs
44Packet Access Combining
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (1/5)
- Basic packet accesses are powerful
- Support for language features
- Naïve mapping results in at least one memory
access per packet access - Combine multiple packet accesses / metadata
accesses - L3-Switch has 24 packet accesses per packet on
critical path - Take advantage of IXPs wide DRAM access
instruction - Buffer values in local memory or transfer
registers
45Packet Access Combining Example
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (2/5)
b read pkt (off64b, sz16b) t1 ( b gtgt 8 )
0xff
t1 pkt-gtttl (off64b, sz8b)
t2 pkt-gtprot (off72b, sz8b)
t2 b 0xff
- Analysis overview
- Isolate packet accesses
- Perform checks to guarantee packet accesses
combined safely - Validate range and size of combined memory access
- Replace combined accesses with accesses to / from
Local Memory / transfer registers
46Static Offset and Alignment Resolution (SOAR)
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (3/5)
packet_encap
offset( src_ip ) 26B
packet_decap
offset( src_ip ) ???
- Generic packet accesses
- Can handle arbitrary layering of protocols and
arbitrary field offsets - Clearly simplifies programmers tasks
- But dynamic offset and alignment determination
add significant overheads - Dynamic offsets handling adds 20 instructions
per packet access - Dynamic alignment adds several instructions per
packet access
47Static Offset and Alignment Resolution (SOAR)
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (4/5)
- Statically resolved packet field alignment
eliminates a few instructions - Statically resolved packet field offset and
alignment can be accessed with a few instructions - Implemented using custom dataflow analysis
18/18 resolved
3/3 resolved
l3_switch.m
lpm_lookup.p
eth_encap.m
Eth ? IP
IP ? Eth
options_processor.p
encap.p
l3_cls.p
icmp_processor.p
arp.p
Rx
Tx
l2_cls.p
1/1 resolved
Eth ? Arp
l3_fwdr.m
New ICMP ? IP Copy IP ? ICMP ? IP
2/2 resolved
l2_bridge.m
bridge.p
Copy Eth
48Eliminate Unnecessary Packet Primitives in Code
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (5/5)
- Eliminate unnecessary packet_encap and
packet_decap primitives - Balanced packet_encap and packet_decap in the
same aggregate can be eliminated because they
have no external effect - Works in conjunction with SOAR analysis results
- Convert metadata accesses into local memory
accesses when all uses are within the same
aggregate - Private uses of metadata have no external effect
- metadata accesses composed of 1 SRAM and 20
instructions - Candidate accesses can be identified with def-use
analysis
49Global Data Memory Mapping
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (1/6)
- Collect dynamic access frequencies to shared
global data structures - Map data structures to appropriate memory levels
- Map small, frequently accessed data structures to
Scratch Memory - Otherwise, place in SRAM
- Pointers may point to objects in different levels
of memory - Perform congruence analysis to allocate such
objects to a common memory level
50Delayed-Update Software-Controlled Caches
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (2/6)
- Cache unprotected global data structures
- Since these structures are not protected by
locks, assume that they can tolerate delayed
update - Delayed update results in some mishandled
packets, tolerable for network applications - Identify caching candidates automatically from
profiling statistics - Frequently read packet processing core
- Infrequently written control and initialization
routines - High predicted hit rate derived from profiling
- Good candidates
- Configuration globals MAC table, classification
table - Lookup tables
51Caching Route Lookups
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (3/6)
01 10 11
- Packet forwarding routes are stored in trie
tables - Frequently executed path
- Route lookups
- Infrequently executed path
- Route update
- Updated with an atomic write
00
01
10
11
11
a
c
b
00
01
10
11
a
c
b
c
52Delayed-update software-controlled caches
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (4/6)
Infrequent write path
Frequent read path
Shared data
Base Access
Optimized Access
- Delayed-update coherency checks home location
only occassionaly - update_flag set on any change to the cached
variable - Update check rate set as a function of tolerable
error rate and variables expected load and store
rate
53Program stack layout optimization
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (5/6)
- Shangri-las runtime model
- Supports calling convention
- Stack holds PPFs local variables and temporary
spill locations - Baker does not support recursion stack could be
assigned statically to different locations - Want to assign disjoint stack frames to limited
Local Memory - Stack is mapped to Local Memory and SRAM
- Only 48 words / thread for stack
54Program stack layout optimization
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (6/6)
main() 16 words
Stack
Local Memory 48 words
PPF2 32 words
PPF1 16 words
SRAM
PPF3 16 words
- PPFs higher in call graph assigned to Local
Memory first - Dispatch model ensures relatively flat call graph
- If PPF is called from two places, assign to
minimum stack location that will not collide with
live stack frames
55Conclusions
- Proposed optimizations for generating code
competitive with hand-tuned code from high-level
languages - Memory-level optimizations
- Program partitioning to heterogeneous cores
- Optimizations to support packet abstractions
- Total system performance will be shown after we
describe code generation optimizations and the
run-time system
56Code Generation and Optimizations
- Part of the Shangri-la Tutorial presented at
MICRO-37 - December 5, 2004
57Outline
- Compiler Flow
- Intel XScale Processor Code Generation
- MicroEngine Code Generation
58Shangri-la Compiler Flow
59Intel XScale Processor Code Generation
- Intel XScale Processor
- Runs configuration, management, control plane,
and cold code - With OS and virtually unlimited code store
- Less performance critical
- Code generation
- Shares the compilation path with ME till WOPT
- Regenerates C source code with proper naming
convention - Leverages an existing Gcc compiler for Intel
XScale Processor - Issue on address translation
- Intel XScale Processor uses virtual address and
ME uses physical address memory type - Perform address translation only on Intel XScale
Processor for addresses exposed between two types
of cores
60ME Code Generator
61Register Allocation
- ME architectural constraints on assigning
registers - Multiple register banks used in specific types of
instructions - GPR banks, SRAM/DRAM Transfer In/Out banks, Next
Neighbor bank - Cannot use certain banks of registers for both A
and B operands - E.g. GPR A and GPR B banks
- ME register allocation framework
- Step 1 identifying candidate banks
- For each TN (virtual register), identify all
possible register banks at each occurrence
according to ME ISA - If there is at least one common register bank,
follow conventional register allocation
62Register Allocation (cont.)
- ME register allocation framework (cont.)
- Step 2 resolving bank conflicts if no common
bank exists - Locate conflicting edges
- Partition def-use graph
- Add moves between sub-graphs
- Step 3 allocating intra-set registers
- Perform conventional register allocation but
observe the constraints on A and B operands - Add an edge between two source operands in the
same instruction in a symbolic register conflict
graph - Use different heuristics to balance the usage of
GPR A and B banks
63Calling Convention and Stack
- Support calling convention
- Caller/callee save registers, parameter passing,
etc. - Perform code generation, e.g. register
allocation, within a function scope - Ease debugging and performance tuning to focus on
changes only in the affected scope - Support calling stack despite no recursion
- Stack frame for local vars, spilled parameters,
register spills - Calling stack grows from LM to SRAM
- Allocate disjoint stack frames on precious LM
- Statically decide the memory level for a frame
for both performance and code size reasons
64More Features in Code Generation and Optimizations
- Inter-procedural analysis and function inlining
- Global scalar optimization for register promotion
- Parameterized machine model for ease of porting
- Code size guard to throttle the aggressiveness of
optimizations which increase code size - Global instruction scheduling and latency hiding
- Bitwise optimizations
- Loop unrolling
65The Run-time System
- Part of the Shangri-la Tutorial presented at
MICRO-37 - December 5, 2004
66RTS Goals
Motivation (1/2)
- Adapt execution of the application to match the
current workload - Isolate the RTS user from hardware-specific
features commonly needed for packet processing
67Adaptation Opportunities
Motivation (2/2)
68Outline
- RTS Design Overview
- Run-time Adaptation Mechanisms
- Binding
- Checkpointing
- State migration
- Run-time Adaptation Results
- Overheads and costs
- Benefits
- Future research
- Summary
69RTS Theory of Operations
Summary
Adaptation Results
Adaptation Mechanisms
RTS Design (1/4)
System monitor collets run-time statistics (queue
depths), triggers adaptation
Resource planner and allocator computes new
processor mapping based on global knowledge
Executable binaries
XScale
Resource Abstraction Layer hides the
implementation of processor resources
Topology
ME
A
B
A
Resource Planner Allocator
A
A
Triggers
C
B, C
B
System Monitor
B
Mapping
IntelXScaleCore
C
Queue depths
C
Resource Abstraction Layer (RAL)
Traffic Mix
Resource Mapping
Run-time system
70The Resource Abstraction Layer
Summary
Adaptation Results
Adaptation Mechanisms
RTS Design (2/4)
- Three goals
- Support adaptation Packet channels and Locks
- Allow common abstractions for rest of RTS code
Processing units, network interfaces - Allow for portability of compilers code
generator - Data memory, packet memory, timers, hash, random
- Key Lesson
- Noble last goal, but performance cost can be
large - Focus on supporting adaptation
71How RAL Supports Adaptation
Summary
Adaptation Results
Adaptation Mechanisms
RTS Design (3/4)
A microengine-based example
RAL calls are initially undefined
Application .o file
RAL .o file
Final .o file
RAL Implementation 0
RAL Implementation 1
RAL Implementation 2
RAL Implementation 3
At run time, the RTS has the application .o file
At run time, the RTS has the application .o file
and the RAL .o file
RAL Implementation 4
RAL Implementation 5
RAL Implementation 6
Linker adjusts jump targets using import variable
mechanism
Linker adjusts jump targets using import variable
mechanism
Linker adjusts jump targets using import variable
mechanism
Process repeated after each adaptation
72System Monitor and Resource Planner
Summary
Adaptation Results
Adaptation Mechanisms
RTS Design (4/4)
- System Monitor
- Triggering policies
- E.g., queue thresholds
- Resource planner and allocator
- Mapping policies
- Move code into/out of fast path
- Duplicate code within the fast path
73Adaptation Mechanisms
Summary
Adaptation Results
RTS Design
Adaptation Mechs (1/7)
- Binding
- Checkpointing
- State migration
74Why Have Binding?
Summary
Adaptation Results
RTS Design
Adaptation Mechs (2/7)
Now we can use NN rings, local locks
A
B
A
B
A
A
B
B
IntelXScaleCore
IntelXScaleCore
Want to be able to use the fastest
implementations of resources available
75Binding The Value of Choosing the Right Resource
Summary
Adaptation Results
RTS Design
Adaptation Mechs (3/7)
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
76Binding Compile-time or Not?
Summary
Adaptation Results
RTS Design
Adaptation Mechs (4/7)
77Checkpointing
Summary
Adaptation Results
RTS Design
Adaptation Mechs (5/7)
- When migrating, RTS follows simple algorithm
- Tell affected processing units to stop at
checkpoint location - Wait for processing unit to reach checkpoint
location - Reload and run processing units
78Checkpointing contd
Summary
Adaptation Results
RTS Design
Adaptation Mechs (6/7)
- Finding the best checkpoint is easier in packet
processing than in general domains - Leverage characteristics of data-flow
applications - Typically implemented as a dispatch loop
- Dispatch loop is executed at high-frequency
- Top of the dispatch loop has no stack information
- Since compiler creates dispatch loop, compiler
inserts checkpoints in the code
79State Migration
Summary
Adaptation Results
RTS Design
Adaptation Mechs (7/7)
- Once a processor has been checkpointed, state
from old resources must be moved to new resources - E.g., Packets sitting in previous packet channel
implementations, cached data - Solution
- Copy packets in old channels to new channels
- Flush any caches
80Adaptation Results
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (1/7)
- Adaptation costs (i.e., overheads)
- Checkpointing
- Loading
- Binding
- State migration (not covered)
- Cumulative effects
- Adaptation benefits
- Experimental setup
- Radisys, Inc. ENP2611
- 1 600MHz Intel IXP2400 Processor
- MontaVista Linux
- Timer measurement accuracy 0.53us
Third party brands/names are property of their
respective owners
81Checkpointing Overhead
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (2/7)
- Factors
- Time to inform a processing unit to stop at the
checkpoint - ME 60us Intel Xscale core 34us
- Time to check if all threads have stopped
- ME 3us Intel XScale core 3us
- Time to start a processing unit
- ME 0.036ms Intel Xscale core 0.097ms
Linux kernel thread
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
82Loading Overhead
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (3/7)
- Intel XScale core thread start time 0.054ms
- Graph shows ME load times
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
83Binding Overhead
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (4/7)
- Intel XScale core binding
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
84Cumulative Effects of Adaptation Overheads
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (5/7)
- Not all adaptation time represents an inoperable
system - Can leave some processors running while
checkpointing others
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
85Adaptation Overhead Learnings
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (6/7)
- Overall adaptation time is
- Linking time (checkpointing and loading
time number of cores) - Packet loss occurs during checkpointing and
loading, but not during binding - So, focus optimizations on starting, stopping,
and loading - Exchange time in loading for more time in linking
86Theoretical Benefits of Adaptation
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (7/7)
- For more details see paper in HotNets-II
- http//nms.lcs.mit.edu/HotNets-II/papers/adaptatio
n-case.pdf
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
87Future Research
RTS Design
Adaptation Mechanisms
Adaptation Results
Summary (1/2)
- Gather experimental benefits of adaptation
- Define and develop performance determinism in the
face of adaptation - Apply power scaling to adaptation mechanisms
- Co-exist commercial operating systems with
adaptation
88Summary
RTS Design
Adaptation Mechanisms
Adaptation Results
Summary (2/2)
- An adaptive run-time system provides benefits in
- Performance, supported services, and power
consumption - The system can be built with a truly programmable
large-scale chip multiprocessor, requires - Checkpointing
- Binding
- State migration
- Adaptation costs come primarily from loading and
checkpointing times, optimize these
89Shangri-la Performance Evaluation
- Part of the Shangri-la Tutorial presented at
MICRO-37 - December 5, 2004
90Evaluation Setup
Results
Benchmarks
Resource Budgets
Setup (1/3)
- Hardware
- Radisys ENP2611 evaluation board (3 x 1Gbps
optical ports) - IXIA packet generator (2 x 1Gbps optical ports)
- Currently only capable of generating 2Gbs traffic
- Benchmarks
- L3-Switch (3126 lines) L2 bridging and L3
forwarding - Firewall (2784 lines) Simple firewall using
ordered rule-based classification - MPLS (4331 lines) Multi-protocol labeled
switching (transit node) - Packet traces
- L3-Switch and MPLS evaluated using NPF packet
traces - Firewall used custom packet trace
Third party brands/names are property of their
respective owners
91Mt. Hood Board
Results
Benchmarks
Resource Budgets
Setup (2/3)
- One Intel IXP2400
- Three 1Gbps optical ports
- 64MB DRAM
- 8MB SRAM
3 optical ports
DRAM
92Test Development Environment
Results
Benchmarks
Resource Budgets
Setup (3/3)
- Linux host machine
- Provides power to Radisys ENP2611 board via PCI
bus - Compiles code for MEs and Intel XScale
- Running NFS server
- Intel XScale core running Linux and Shangri-la
RTS - Read generated binaries from host machines NFS
server and load onto MEs
Ethernet serial cable
Linux host
Radisys ENP2600
2x 1Gbps optical links
IXIA packet generator
Third party brands/names are property of their
respective owners
93Instruction and memory budgets at 2.5Gb/s
Results
Benchmarks
Setup
Resource Budgets (1/5)
x6
- Assumed memory access latency 100 cycles
- Scratch Memory 60 cycles
- SRAM 90 cycles
- DRAM 120 cycles
- Memory access budget refers only to number of
memory accesses that can be overlapped with
computation - Does not account for bandwidth of SRAM/DRAM
94Evaluating Intel IXP2400 Memory Bandwidth
Results
Benchmarks
Setup
Resource Budgets (2/5)
- Modified empty PPF connected to Rx and Tx
- Add loop to access chosen memory level n times
- Graph throughput of various configurations
- n 1, 2, 4, 1024
- Memory accessed SCRATCH, SRAM, DRAM
- Results using minimum-sized 64B packets
95Scratch Memory Bandwidth
Results
Benchmarks
Setup
Resource Budgets (3/5)
- Significant difference in memory bandwidth
consumed according to access size
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
96SRAM Memory Bandwidth
Results
Benchmarks
Setup
Resource Budgets (4/5)
- Behavior is similar to Scratch Memory
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
97DRAM Memory Bandwidth
Results
Benchmarks
Setup
Resource Budgets (5/5)
- DRAM accesses significantly constrain forwarding
rate
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
98L3-Switch
Results
Resource Budgets
Setup
Benchmarks (1/4)
- Performs core router functionality
- Bridge packets not destined for this router
- Handle ARP packets for resolving Ethernet
addresses - Route IP packets targeting this router
lpm_lookup.p
eth_encap.m
options_processor.p
encap.p
l3_cls.p
icmp_processor.p
arp.p
Rx
Tx
l2_cls.p
l3_fwdr.m
l2_bridge.m
bridge.p
l3_switch.m
99Firewall
Results
Resource Budgets
Setup
Benchmarks (2/4)
- Filter out unwanted packets from a WAN (e.g.
Internet) - Assign flow IDs to packets according
user-specified rules list using src IP, dst IP,
src port, dst port, TOS and protocol - Drop packets for specified flow IDs
- Optimize assignment of flow IDs
- Try to find flow ID in the hash table, placed in
the table by a previous packet with the same
fields - Otherwise, do a long search by testing the rules
in order
Rx
Tx
firewall.p
hash_lookup.p
long_search.p
classifier.m
firewall.m
100Multi-Protocol Label Switching (MPLS)
Results
Resource Budgets
Setup
Benchmarks (3/4)
- Route packets using attached labels instead of IP
addresses - Reduces routing hardware requirements
- Facilitates high-level traffic management on
user-defined packet streams
mpls.m
ilm.p
ops.p
MPLS ? MPLS?
ops.p
ftn.p
eth_encap.m
encap.p
l3_fwdr.m
arp.p
Rx
Tx
l2_cls.p
l2_bridge.m
bridge.p
mpls_app.m
101Other Network Benchmarks
Results
Resource Budgets
Setup
Benchmarks (4/4)
- Network address translation (NAT)
- Allows multiple LAN hosts connect to a WAN (e.g.
Internet) through one IP address - Achieved by remapping LAN IPs and ports
- WAN hosts only see the NAT router
- Manages a table mapping active connections
between the LAN and WAN - Quality of Service (QoS)
- Allows partitioning of available bandwidth to
user-specified traffic streams - Packet streams are throttled by intentionally
dropping packets - Header compression
- Reduce size of transmitted packet headers
- Since many fields are similar for packets in the
same flow, achieve compression by only
transmitting differences - Various security features
- Encryption / decryption
102Dynamic Memory Accesses
Benchmarks
Resource Budgets
Setup
Results (1/4)
SWC software cache PHR pkt handling
removal SOAR static offset align PAC
pkt access combing -O2 inline pkt handling
-O1 typical scalar opts BASE no opts
- Table shows average per-packet access count
- PAC significantly reduces packet memory accesses
- -O1 enables pipeline to fit on one ME
- SWC and PHR also contribute to reduced memory
accesses
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
103L3-Switch Forwarding Rate
Benchmarks
Resource Budgets
Setup
Results (2/4)
SWC software cache PHR pkt handling
removal SOAR static offset align PAC
pkt access combing -O2 inline pkt handling
-O1 typical scalar opts BASE no opts
Top-end performance still constrained by memory
bandwidth PHR and SWC alleviate this somewhat
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
PHR and SOAR have the most impact on forwarding
rate
PAC and SOAR have the most impact on L3-Switch
forwarding rate
- Forwarding rate of minimum-sized packets (64B)
- Reduced memory access and instruction count both
improve forwarding rate
104Firewall Forwarding Rate
Benchmarks
Resource Budgets
Setup
Results (3/4)
SWC software cache PHR pkt handling
removal SOAR static offset align PAC
pkt access combing -O2 inline pkt handling
-O1 typical scalar opts BASE no opts
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Per-ME performance improvement dominated by PAC
105MPLS Forwarding Rate
Benchmarks
Resource Budgets
Setup
Results (4/4)
SOAR does not help this application due to
stacking of MPLS labels
SWC software cache PHR pkt handling
removal SOAR static offset align PAC
pkt access combing -O2 inline pkt handling
-O1 typical scalar opts BASE no opts
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
- Results are for MPLS transit only (internal
routers of a MPLS domain) - Similar performance characteristics to L3-Switch
106Conclusions
- Demonstrate performance from a high-level
language comparable to hand-tuned code - Memory-level optimizations
- Program partitioning to heterogeneous cores
- Optimizations to support packet abstractions
- Language features are more attractive as users
can enjoy ease of programming without sacrificing
performance - Modular program design
- Packet model supporting encapsulation, metadata
and bit-level accesses - Flat memory model
- Able to achieve 2Gbps on L3-Switch, Firewall and
MPLS Transit
107Summary
- Part of the Shangri-la Tutorial presented at
MICRO-37 - December 5, 2004
108Summary - Baker
- Goals
- Enable efficient expression of packet processing
applications on large-scale chip-multiprocessors
(e.g., Intel IXP2400 processor) - Enable good execution performance
- Approach
- Hide hardware details
- Expose domain-specific constructs
- Reduce C Features
109Summary Compiler Optimizations
- Demonstrate performance from a high-level
language comparable to hand-tuned code - Memory-level optimizations
- Program partitioning to heterogeneous cores
- Optimizations to support packet abstractions
- Machine-specific optimizations
- Able to achieve maximal packet forwarding rates
(2Gbps) on L3-Switch, Firewall and MPLS Transit
110Summary Runtime Adaptation
- An adaptive system is important for packet
processing to adapt to varying workloads
dynamically - Benefits in performance, services, and power
consumption - The system can be built with a truly programmable
large-scale chip multiprocessor, requires - Checkpointing
- Binding
- State migration
- Adaptation costs come primarily from loading and
checkpointing times, optimize these
111Key Learning
- High-level language features ease programming for
complex multi-core network processors - Effective compiler optimizations able to achieve
performance comparable to hand-tuned systems - Architecture-specific, domain-specific, and
general optimizations all critical to obtain high
performance - Ease of programming and performance can co-exist
- Runtime adaptation a key feature to future
network systems - The system can be built with a large-scale CMP
- Many learning applicable to general CMP systems