Micro37 Tutorial Compilation System for Throughputdriven Multicore Network Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Micro37 Tutorial Compilation System for Throughputdriven Multicore Network Processors

Description:

Micro-37 Tutorial. Compilation System for Throughput-driven Multi ... Fabric. I/F. PCI. QDR SRAM. Controller. Scratch. Memory. Hash. Unit. Multi-threaded (x8) ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 112
Provided by: ejjo
Category:

less

Transcript and Presenter's Notes

Title: Micro37 Tutorial Compilation System for Throughputdriven Multicore Network Processors


1
Micro-37 TutorialCompilation System for
Throughput-driven Multi-core Network Processors
  • Michael K. Chen
  • Erik Johnson
  • Roy Ju
  • michael.k.chen, erik.j.johnson,
    roy.ju_at_intel.com
  • Corporate Technology Group
  • Intel Corp.
  • December 5, 2004

2
Agenda
  • Project Overview
  • Domain-specific Language
  • High-level Optimizations
  • Code Generations and Optimizations
  • Performance Characterization
  • Runtime Adaptation
  • Summary

3
Project Overview
  • Part of the Shangri-la Tutorial presented at
    MICRO-37
  • December 5, 2004

4
Outline
  • Problem Statement
  • Overview of Shangri-la System
  • Status and Teams

5
The Problem
Packet processing application
  • State-of-the-art
  • Hand-tuned code for maximal performance
  • ? but often error-prone and not scalable
  • Static resource allocation often tailored to one
    particular workload
  • ? not flexible to varying workloads and hardware

6
Shangri-La Overview
  • Mission research an industry leading programming
    environment for packet processing on Intel chip
    multiprocessors (CMP) silicon
  • Challenges
  • Hide architectural details from programmers
  • Automate allocation of system resources
  • Adapt resource allocation to match dynamic
    traffic conditions
  • Achieve performance comparable to hand-tuned
    systems
  • Technology
  • Language enable portable packet processing
    applications
  • Compiler automate code partitioning and
    optimizations
  • Run-time System adapt to dynamic workloads


7
Architectural Features of Intel IXP Processor
  • Heterogeneous, multi-cores
  • Intel Xscale Processor (control) and
    MicroEngines (data)
  • Memory hierarchy
  • Local memory (LM) distributed on MEs
  • No HW cache
  • Scratch, SRAM, DRAM shared
  • Long memory latency
  • MicroEngine
  • Single issue deferred slots
  • Light-weighted HW multi-threading
  • Event signals to synchronize threads
  • Multiple register banks and constraints as
    operands in instructions
  • Limited code store

8
Packet Processing Applications
  • Types of apps
  • IPv4 forwarding, L3-Switch, MPLS (Multi-Protocol
    Label Switch), NAT (Network Address Translation),
    Firewall, QoS (Quality of Service)
  • Characteristics of packet processing apps
  • Performance metric throughput (vs. latency)
  • Mostly memory bound
  • Large amount of packets without locality
  • Smaller instruction footprint
  • Execution paths tend to be predictable

9
Anatomy of Shangri-La
General-purpose Compiler
Language(s)
Baker Programming Language
  • Modular language (with C-like syntax) to express
    applications as a dataflow graph

Baker Compiler
Front-end
  • Extract run-time characteristics by executing
    application

Profiler
Profiling
  • Compiler optimizations for pipeline construction
    and data structure mapping/caching

Pi Compiler
Inter-Procedural Opt.
Loop/Memory Opt.
Aggregate Compiler
Global Optimizations
  • Code generation and optimization for
    heterogeneous cores

Code Generation
  • Dynamically adapt mapping to match traffic
    fluctuations

Run-time System
Execution environment
10
Baker Language
  • Familiar to embedded systems programmers
  • Syntactically feels like C
  • Simplifies the development of packet-processing
    applications
  • Hides architectural details
  • Single-level of memory
  • Implicit threading model
  • Modular programming encapsulation
  • Domain-specific
  • Data flow model
  • Actors and interconnects (PPFs and channels)
  • Built-in types, e.g. packet
  • Enables compiler to generate efficient code on
    target CMP hardware

11
Shangri-la Example
Modular, simple description module l3_switch
module eth_rx, eth_tx // Built-in ppf
l2_clsfr module eth_encap_mod, l3_fwdr,
l2_bridge wiring eth_rx.eth0 -gt
l3_switch.l2_clsfr.input_chnl
l3_fwdr.input_chnl lt- l2_clsfr.l3_fwrd_chnl
... ...
  • Profiler
  • Baker

Looks like C int l3_switch.l2_clsfr.process
(ether_packet_t in_pkt) ... if ( fwd ) p
packet_decap(in_pkt) channel_put(l3_forward_c
hnl, p) else channel_put(l2_bridge_chnl,
in_pkt)
L3 Switch
  • IR run on IR-simulator
  • Stimulated by packet trace
  • Statistics stored in IR

L3 Fwdr
L2 Cls
Eth Encap
TX
RX
L2 Bridge
12
Compiler and Optimizations
  • Perform program and data partitioning
  • Cluster multiple finer-grained components into
    larger aggregates
  • Balancing between replication and pipelining
  • Automatic data mapping on memory hierarchy
  • Optimizations and code generation
  • Code generations for heterogeneous processing
    cores
  • Global machine independent optimizations
  • Optimizations for memory hierarchy
  • Machine dependent code generation and
    optimizations

13
Shangri-la Example
  • Aggregate Compiler
  • Pipeline Compiler

Internal Channels converted to function calls
L3 Switch
L3 Fwdr
Intel XScale
ME
L2 Cls
Eth Encap
TX
RX
L2 Bridge
Executable binaries
Each Aggregate given a main with a while(1) loop
Aggregates PPFs Critical Path PPFS in same
aggregate
14
Run-time Adaptation
  • Workloads fluctuate over time
  • Usually over provision to handle worst case
  • Adapt to workload
  • Change mapping to increase performance when
    needed
  • Power-down unneeded processors
  • Adaptation requirements
  • Hardware independent abstraction
  • Querying of resource utilization

15
Shangri-la Example
  • Run-time System

Automatically map aggregates to processing units
L3 Switch
Automatically remap at runtime
XScale
ME
IntelXScaleCore
MEv2 2
MEv2 3
MEv2 4
MEv2 1
MEv2 7
MEv2 6
MEv2 5
MEv2 8
16
Project Status
  • Project started in Q1 2003
  • Collaboration among Intel, Chinese Academy of
    Sciences, UT-Austin
  • Compiler based on Open Research Compiler (ORC)
  • A completed prototype system with maximal packet
    forwarding rate on a number of applications
  • Research project to transfer technology to
    product groups

17
Acknowledgements
  • Communication Technology Lab, Intel
  • Erik Johnson, Jamie Jason, Aaron Kunze, Steve
    Goglin, Arun Raghunath, Vinod Balakrishnan,
    Robert Odell
  • Microprocessor Technology Lab, Intel
  • Xiao Feng Li, Lixia Liu, Jason Lin, Mike Chen,
    Roy Ju, Astrid Wang, Kaiyu Chen, Subramanian
    Ramaswamy
  • Institute of Computing Technology, Chinese
    Academy of Sciences
  • Zhaoqing Zhang, Ruiqi Lian, Chengyong Wu, Junchao
    Zhang, Jiajun Wu, HanDong Ye, Tao Liu, Bin Bao,
    Wei Tang, Feng Zhou
  • University of Texas at Austin
  • Harrick Vin, Jayaram Mudigonda, Taylor Riche,
    Ravi Kokku

18
The Baker Language
  • Part of the Shangri-la Tutorial presented at
    MICRO-37
  • December 5, 2004

19
Baker Overview and Goals
  • Baker is C with data-flow and packet processing
    extensions
  • Goal 1 Enable efficient expression of packet
    processing applications on large-scale
    chip-multiprocessors (e.g., Intel IXP2400
    processor)
  • Encourage interesting, complex application
    development
  • Be familiar to embedded systems programmers
    -should start with C
  • Goal 2 Enable good execution performance
  • Scalable performance across new versions of
    large-scale CMP
  • Expose compiler and run-time system to
    optimization opportunities
  • Dont constrain the compilers ability to place
    code or data dont preclude run-time adaptation

20
Outline
  • The Baker Approach
  • Hardware abstractions and models
  • Domain-specific constructs
  • Standard C language feature reductions
  • Results
  • Future Research
  • Summary

21
Bakers Hardware Models
Results and Summary
C Restrictions
Domain Features
Hardware Model (1/5)
  • Memory
  • Concurrency, i.e., cores and threads
  • I/O, e.g., receive and transmit

22
A Single-level Memory Model
Results and Summary
C Restrictions
Domain Features
Hardware Model (2/5)
  • Baker exposes a single-level, shared memory
    model, like C
  • Makes programming easier
  • Variable declaration and use, malloc/free work
    just like C
  • Enables compiler freedom in optimizing data
    placement
  • Move most accessed data structures (or parts of
    structures) to fastest memory
  • Enables compiler to move code to any core
  • Code not tied to a particular cores physical
    memory model

23
Implicitly Threaded Concurrency
Results and Summary
C Restrictions
Domain Features
Hardware Model (3/5)
  • Baker exposes a multithreaded concurrency model
  • Programmer knows code may execute concurrently
  • Programmer does not
  • Know the of number of cores
  • Explicitly create or destroy threads
  • A consequence Programmers must protect shared
    memory with locks
  • Enables compiler and run-time system to optimize
    execution
  • Can create an application pipeline and balance it
  • Can optimize locks based on which processors
    access a lock

24
Example of Implicit Threading
Results and Summary
C Restrictions
Domain Features
Hardware Model (4/5)
Code shown for illustrative purposes only and
should not be considered valid.
25
I/O As A Driver Model
Results and Summary
C Restrictions
Domain Features
Hardware Model (5/5)
  • RX and TX require hardware knowledge
  • E.g., PHYs and MACs, RBUFs, TBUFs, flow control
    hardware
  • Difficult to abstract this hardware using common
    C constructs
  • Solution
  • Dont write these in Baker
  • Written once in assembly by the system vendors
    for each board
  • Baker developers use receive and transmit code
    like a device driver

26
Exposing Domain Features
Results and Summary
C Restrictions
Hardware Model
Domain Features (1/5)
  • Hiding hardware features can drastically decrease
    performance
  • Baker exposes application domain features to
    compensate
  • Tailor the compiler and run-time system
    optimizations to the domain
  • Programmer is forced to help the compiler and
    run-time system find parallelism
  • But in a natural way
  • Two types of domain features
  • Data-flow abstractions
  • Packet processing abstractions

27
Data Flow Overview
Results and Summary
C Restrictions
Hardware Model
Domain Features (2/5)
  • A data flow is a directed graph
  • Graph nodes are called actors (or
    packet-processing functions, PPF, in Baker) and
    represent the computation
  • Graph edges are called channels and move data
    between actors
  • Data-flow is a natural fit for the packet
    processing domain

28
Data Flow PPFs and Channels
Results and Summary
C Restrictions
Hardware Model
Domain Features (3/5)
  • PPFs (or Actors)
  • Implicitly Concurrent
  • Stateful
  • Support multiple inputs and outputs
  • No assumptions about a steady rate of packet
    consumption
  • Channels
  • Queue-like properties
  • Asynchronous, unidirectional, typed, reliable
  • Active and passive varieties
  • Can be replaced with function
  • Run-time system can choose an optimal
    implementation
  • E.g., Scratch rings vs next neighbor rings

29
Packet Processing Features
Results and Summary
C Restrictions
Hardware Model
Domain Features (4/5)
  • Packets and Meta-data as first class objects
  • Packets
  • Programmer accesses packet data through a special
    pointer type, all packet accesses go through
    these pointers
  • Allows compiler to coalesce reads/writes, avoid
    head and tail manipulation, etc.
  • Meta-data
  • Storage associated and carried with a packet
  • E.g., input port, output port, etc.
  • Accessed via the packets pointer
  • Useful to programmers to carry per-packet state
    passed between actors
  • Language ensures that meta-data is created before
    it is used

30
Example Application
Results and Summary
C Restrictions
Hardware Model
Domain Features (5/5)
Code shown for illustrative purposes only and
should not be considered valid.
31
Reduce Language Features
Results and Summary
Hardware Model
Domain Features
C Restrictions (1/1)
  • By removing some features of C, compiler is able
    to make more optimizations
  • Typesafe pointers
  • Compiler is able to do much better alias analysis
  • Networking code typically does not use tricky
    pointer manipulations
  • Some features needed to be removed to avoid large
    overheads on the microengines
  • Recursion
  • No natural stack on the microengine so the
    compiler has to implement one
  • Eliminating recursion simplifies stack analysis
  • Function Pointers
  • Removed for similar reasons as recursion
  • Unfortunately, network programmers actually use
    them a great deal

32
Results
Hardware Model
Domain Features
C Restrictions
Results Summary (1/3)
  • Source-lines of code measured using sloccount
  • Does not do complexity analysis, does not handle
    assembly code

These tests and ratings are measured using
specific computer systems and/or components and
reflect the size of the indicated code as
measured by those tests.  Any difference in
system hardware or software design or
configuration may affect actual sizes.
33
Future Research
Hardware Model
Domain Features
C Restrictions
Results Summary (2/3)
  • Existing languages expose packets as completely
    independent, however flows are a more appropriate
    independence class for data in this domain
  • How should flows of packets be represented in a
    language and how to optimize around these
  • Automated ordering
  • Flow-data locality improvements
  • Flow-lock elision

34
Summary
Hardware Model
Domain Features
C Restrictions
Results Summary (3/3)
  • Goals
  • Enable efficient expression of packet processing
    applications on large-scale chip-multiprocessors
    (e.g., Intel IXP2400 processor)
  • Enable good execution performance
  • Approach
  • Hide hardware details
  • Single memory, implicit threading, RX/TX as
    drivers
  • Expose domain-specific constructs
  • Data-flow, packets, meta-data
  • Reduce C Features
  • Typesafe pointers, recursion, function pointers

35
High-Level Optimizations
  • Part of the Shangri-la Tutorial presented at
    MICRO-37
  • December 5, 2004

36
Shangri-La Compiler Overview
  • Convert Baker program into compiler intermediate
    representation (IR)
  • Derive run-time characteristics by simulating
    application
  • Compiler optimizations for pipeline construction
    and data structure mapping/caching
  • Code generation and optimization for
    heterogeneous cores
  • Load application and perform dynamic resource
    linking

37
Profiling Overview
  • Simulation of high-level IR
  • Developed a custom IR interpreter
  • Different from the traditional 2-pass profiling
  • Profiling information guides optimizations in
    later phases
  • Stimulated using user-supplied packet traces
  • Information collected
  • Execution frequency
  • Communication
  • Memory access statistics

38
Pi Compiler Details
  • Performs most high-level optimizations
  • Mapping PPFs to heterogeneous cores
  • Assign memory levels to global data structures
  • Perform inter-procedural analysis for
    optimizations needing support
  • Guided by profiling results

39
Supporting Language Features with Compiler
Optimizations
  • Automatic program partitioning
  • Packet handling optimizations
  • Automatic memory mapping
  • Modular, dataflow language
  • Packet abstraction model
  • Flat memory hierarchy

40
Key Compiler Technologies
  • Automatic program partitioning to heterogeneous
    cores
  • Packet handling optimizations
  • Packet access combining
  • Static offset and alignment resolution
  • Packet primitive removal
  • Partitioned memory hierarchy optimizations
  • Memory mapping
  • Delayed-update software-controlled caches
  • Program stack layout optimization

41
Partitioning Across Heterogeneous Cores
Memory Hierarchy Optimizations
Packet Handling Optimizations
Automatic Program Partitioning (1/3)
  • Partition across Intel XScale and multiple MEs
  • Partitioning considerations
  • Identifying control and data planes
  • Minimizing inter-processor communication costs
  • Account for dynamic characteristics using
    profiling results
  • Satisfying code size constraint
  • Different memory addresses seen by different
    cores
  • Insert address translations
  • Minimize insertions and impact on performance

42
Inputs Into Partitioning Algorithm
Memory Hierarchy Optimizations
Packet Handling Optimizations
Automatic Program Partitioning (2/3)
  • Throughput-driven cost model
  • Eliminates latency from consideration
  • Expresses goal appropriately for domain
  • Relevant profiling statistics
  • PPF execution time
  • Global data access frequency
  • Channel utilization
  • Possible partitioning strategies
  • Pipelining application across cores
  • Replicating application across cores

43
Partitioning Algorithm
Memory Hierarchy Optimizations
Packet Handling Optimizations
Automatic Program Partitioning (3/3)
Pi Compiler
Intra-PPF IPA
Code size exec time est.
  • Pipeline Compiler

Memory mapper
Merge PPFs with highest communication cost
Aggregate formation
L3 Fwdr
L3 Switch
Intra-aggregate IPA
Aggregate dump
Rx
Tx
L2 Cls
Eth Encap
L2 Bridge
Duplicate aggregate with lowest throughput
Duplicate entire pipeline on available MEs
44
Packet Access Combining
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (1/5)
  • Basic packet accesses are powerful
  • Support for language features
  • Naïve mapping results in at least one memory
    access per packet access
  • Combine multiple packet accesses / metadata
    accesses
  • L3-Switch has 24 packet accesses per packet on
    critical path
  • Take advantage of IXPs wide DRAM access
    instruction
  • Buffer values in local memory or transfer
    registers

45
Packet Access Combining Example
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (2/5)
b read pkt (off64b, sz16b) t1 ( b gtgt 8 )
0xff
t1 pkt-gtttl (off64b, sz8b)
t2 pkt-gtprot (off72b, sz8b)
t2 b 0xff
  • Analysis overview
  • Isolate packet accesses
  • Perform checks to guarantee packet accesses
    combined safely
  • Validate range and size of combined memory access
  • Replace combined accesses with accesses to / from
    Local Memory / transfer registers

46
Static Offset and Alignment Resolution (SOAR)
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (3/5)
packet_encap
offset( src_ip ) 26B
packet_decap
offset( src_ip ) ???
  • Generic packet accesses
  • Can handle arbitrary layering of protocols and
    arbitrary field offsets
  • Clearly simplifies programmers tasks
  • But dynamic offset and alignment determination
    add significant overheads
  • Dynamic offsets handling adds 20 instructions
    per packet access
  • Dynamic alignment adds several instructions per
    packet access

47
Static Offset and Alignment Resolution (SOAR)
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (4/5)
  • Statically resolved packet field alignment
    eliminates a few instructions
  • Statically resolved packet field offset and
    alignment can be accessed with a few instructions
  • Implemented using custom dataflow analysis

18/18 resolved
3/3 resolved
l3_switch.m
lpm_lookup.p
eth_encap.m
Eth ? IP
IP ? Eth
options_processor.p
encap.p
l3_cls.p
icmp_processor.p
arp.p
Rx
Tx
l2_cls.p
1/1 resolved
Eth ? Arp
l3_fwdr.m
New ICMP ? IP Copy IP ? ICMP ? IP
2/2 resolved
l2_bridge.m
bridge.p
Copy Eth
48
Eliminate Unnecessary Packet Primitives in Code
Automatic Program Partitioning
Memory Hierarchy Optimizations
Packet Handling Optimizations (5/5)
  • Eliminate unnecessary packet_encap and
    packet_decap primitives
  • Balanced packet_encap and packet_decap in the
    same aggregate can be eliminated because they
    have no external effect
  • Works in conjunction with SOAR analysis results
  • Convert metadata accesses into local memory
    accesses when all uses are within the same
    aggregate
  • Private uses of metadata have no external effect
  • metadata accesses composed of 1 SRAM and 20
    instructions
  • Candidate accesses can be identified with def-use
    analysis

49
Global Data Memory Mapping
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (1/6)
  • Collect dynamic access frequencies to shared
    global data structures
  • Map data structures to appropriate memory levels
  • Map small, frequently accessed data structures to
    Scratch Memory
  • Otherwise, place in SRAM
  • Pointers may point to objects in different levels
    of memory
  • Perform congruence analysis to allocate such
    objects to a common memory level

50
Delayed-Update Software-Controlled Caches
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (2/6)
  • Cache unprotected global data structures
  • Since these structures are not protected by
    locks, assume that they can tolerate delayed
    update
  • Delayed update results in some mishandled
    packets, tolerable for network applications
  • Identify caching candidates automatically from
    profiling statistics
  • Frequently read packet processing core
  • Infrequently written control and initialization
    routines
  • High predicted hit rate derived from profiling
  • Good candidates
  • Configuration globals MAC table, classification
    table
  • Lookup tables

51
Caching Route Lookups
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (3/6)
01 10 11
  • Packet forwarding routes are stored in trie
    tables
  • Frequently executed path
  • Route lookups
  • Infrequently executed path
  • Route update
  • Updated with an atomic write

00
01
10
11
11
a
c
b
00
01
10
11
a
c
b
c
52
Delayed-update software-controlled caches
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (4/6)
Infrequent write path
Frequent read path
Shared data
Base Access
Optimized Access
  • Delayed-update coherency checks home location
    only occassionaly
  • update_flag set on any change to the cached
    variable
  • Update check rate set as a function of tolerable
    error rate and variables expected load and store
    rate

53
Program stack layout optimization
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (5/6)
  • Shangri-las runtime model
  • Supports calling convention
  • Stack holds PPFs local variables and temporary
    spill locations
  • Baker does not support recursion stack could be
    assigned statically to different locations
  • Want to assign disjoint stack frames to limited
    Local Memory
  • Stack is mapped to Local Memory and SRAM
  • Only 48 words / thread for stack

54
Program stack layout optimization
Automatic Program Partitioning
Packet Handling Optimizations
Memory Hierarchy Optimizations (6/6)
main() 16 words
Stack
Local Memory 48 words
PPF2 32 words
PPF1 16 words
SRAM
PPF3 16 words
  • PPFs higher in call graph assigned to Local
    Memory first
  • Dispatch model ensures relatively flat call graph
  • If PPF is called from two places, assign to
    minimum stack location that will not collide with
    live stack frames

55
Conclusions
  • Proposed optimizations for generating code
    competitive with hand-tuned code from high-level
    languages
  • Memory-level optimizations
  • Program partitioning to heterogeneous cores
  • Optimizations to support packet abstractions
  • Total system performance will be shown after we
    describe code generation optimizations and the
    run-time system

56
Code Generation and Optimizations
  • Part of the Shangri-la Tutorial presented at
    MICRO-37
  • December 5, 2004

57
Outline
  • Compiler Flow
  • Intel XScale Processor Code Generation
  • MicroEngine Code Generation

58
Shangri-la Compiler Flow
59
Intel XScale Processor Code Generation
  • Intel XScale Processor
  • Runs configuration, management, control plane,
    and cold code
  • With OS and virtually unlimited code store
  • Less performance critical
  • Code generation
  • Shares the compilation path with ME till WOPT
  • Regenerates C source code with proper naming
    convention
  • Leverages an existing Gcc compiler for Intel
    XScale Processor
  • Issue on address translation
  • Intel XScale Processor uses virtual address and
    ME uses physical address memory type
  • Perform address translation only on Intel XScale
    Processor for addresses exposed between two types
    of cores

60
ME Code Generator
61
Register Allocation
  • ME architectural constraints on assigning
    registers
  • Multiple register banks used in specific types of
    instructions
  • GPR banks, SRAM/DRAM Transfer In/Out banks, Next
    Neighbor bank
  • Cannot use certain banks of registers for both A
    and B operands
  • E.g. GPR A and GPR B banks
  • ME register allocation framework
  • Step 1 identifying candidate banks
  • For each TN (virtual register), identify all
    possible register banks at each occurrence
    according to ME ISA
  • If there is at least one common register bank,
    follow conventional register allocation

62
Register Allocation (cont.)
  • ME register allocation framework (cont.)
  • Step 2 resolving bank conflicts if no common
    bank exists
  • Locate conflicting edges
  • Partition def-use graph
  • Add moves between sub-graphs
  • Step 3 allocating intra-set registers
  • Perform conventional register allocation but
    observe the constraints on A and B operands
  • Add an edge between two source operands in the
    same instruction in a symbolic register conflict
    graph
  • Use different heuristics to balance the usage of
    GPR A and B banks

63
Calling Convention and Stack
  • Support calling convention
  • Caller/callee save registers, parameter passing,
    etc.
  • Perform code generation, e.g. register
    allocation, within a function scope
  • Ease debugging and performance tuning to focus on
    changes only in the affected scope
  • Support calling stack despite no recursion
  • Stack frame for local vars, spilled parameters,
    register spills
  • Calling stack grows from LM to SRAM
  • Allocate disjoint stack frames on precious LM
  • Statically decide the memory level for a frame
    for both performance and code size reasons

64
More Features in Code Generation and Optimizations
  • Inter-procedural analysis and function inlining
  • Global scalar optimization for register promotion
  • Parameterized machine model for ease of porting
  • Code size guard to throttle the aggressiveness of
    optimizations which increase code size
  • Global instruction scheduling and latency hiding
  • Bitwise optimizations
  • Loop unrolling

65
The Run-time System
  • Part of the Shangri-la Tutorial presented at
    MICRO-37
  • December 5, 2004

66
RTS Goals
Motivation (1/2)
  • Adapt execution of the application to match the
    current workload
  • Isolate the RTS user from hardware-specific
    features commonly needed for packet processing

67
Adaptation Opportunities
Motivation (2/2)
68
Outline
  • RTS Design Overview
  • Run-time Adaptation Mechanisms
  • Binding
  • Checkpointing
  • State migration
  • Run-time Adaptation Results
  • Overheads and costs
  • Benefits
  • Future research
  • Summary

69
RTS Theory of Operations
Summary
Adaptation Results
Adaptation Mechanisms
RTS Design (1/4)
System monitor collets run-time statistics (queue
depths), triggers adaptation
Resource planner and allocator computes new
processor mapping based on global knowledge
Executable binaries
XScale

Resource Abstraction Layer hides the
implementation of processor resources
Topology
ME
A
B
A
Resource Planner Allocator
A
A
Triggers
C
B, C
B
System Monitor
B
Mapping
IntelXScaleCore
C
Queue depths
C
Resource Abstraction Layer (RAL)
Traffic Mix
Resource Mapping
Run-time system
70
The Resource Abstraction Layer
Summary
Adaptation Results
Adaptation Mechanisms
RTS Design (2/4)
  • Three goals
  • Support adaptation Packet channels and Locks
  • Allow common abstractions for rest of RTS code
    Processing units, network interfaces
  • Allow for portability of compilers code
    generator
  • Data memory, packet memory, timers, hash, random
  • Key Lesson
  • Noble last goal, but performance cost can be
    large
  • Focus on supporting adaptation

71
How RAL Supports Adaptation
Summary
Adaptation Results
Adaptation Mechanisms
RTS Design (3/4)
A microengine-based example
RAL calls are initially undefined
Application .o file
RAL .o file
Final .o file
RAL Implementation 0
RAL Implementation 1
RAL Implementation 2
RAL Implementation 3
At run time, the RTS has the application .o file
At run time, the RTS has the application .o file
and the RAL .o file
RAL Implementation 4
RAL Implementation 5
RAL Implementation 6
Linker adjusts jump targets using import variable
mechanism
Linker adjusts jump targets using import variable
mechanism
Linker adjusts jump targets using import variable
mechanism
Process repeated after each adaptation
72
System Monitor and Resource Planner
Summary
Adaptation Results
Adaptation Mechanisms
RTS Design (4/4)
  • System Monitor
  • Triggering policies
  • E.g., queue thresholds
  • Resource planner and allocator
  • Mapping policies
  • Move code into/out of fast path
  • Duplicate code within the fast path

73
Adaptation Mechanisms
Summary
Adaptation Results
RTS Design
Adaptation Mechs (1/7)
  • Binding
  • Checkpointing
  • State migration

74
Why Have Binding?
Summary
Adaptation Results
RTS Design
Adaptation Mechs (2/7)
Now we can use NN rings, local locks
A
B
A
B
A
A
B
B
IntelXScaleCore
IntelXScaleCore
Want to be able to use the fastest
implementations of resources available
75
Binding The Value of Choosing the Right Resource
Summary
Adaptation Results
RTS Design
Adaptation Mechs (3/7)
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
76
Binding Compile-time or Not?
Summary
Adaptation Results
RTS Design
Adaptation Mechs (4/7)
77
Checkpointing
Summary
Adaptation Results
RTS Design
Adaptation Mechs (5/7)
  • When migrating, RTS follows simple algorithm
  • Tell affected processing units to stop at
    checkpoint location
  • Wait for processing unit to reach checkpoint
    location
  • Reload and run processing units

78
Checkpointing contd
Summary
Adaptation Results
RTS Design
Adaptation Mechs (6/7)
  • Finding the best checkpoint is easier in packet
    processing than in general domains
  • Leverage characteristics of data-flow
    applications
  • Typically implemented as a dispatch loop
  • Dispatch loop is executed at high-frequency
  • Top of the dispatch loop has no stack information
  • Since compiler creates dispatch loop, compiler
    inserts checkpoints in the code

79
State Migration
Summary
Adaptation Results
RTS Design
Adaptation Mechs (7/7)
  • Once a processor has been checkpointed, state
    from old resources must be moved to new resources
  • E.g., Packets sitting in previous packet channel
    implementations, cached data
  • Solution
  • Copy packets in old channels to new channels
  • Flush any caches

80
Adaptation Results
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (1/7)
  • Adaptation costs (i.e., overheads)
  • Checkpointing
  • Loading
  • Binding
  • State migration (not covered)
  • Cumulative effects
  • Adaptation benefits
  • Experimental setup
  • Radisys, Inc. ENP2611
  • 1 600MHz Intel IXP2400 Processor
  • MontaVista Linux
  • Timer measurement accuracy 0.53us

Third party brands/names are property of their
respective owners
81
Checkpointing Overhead
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (2/7)
  • Factors
  • Time to inform a processing unit to stop at the
    checkpoint
  • ME 60us Intel Xscale core 34us
  • Time to check if all threads have stopped
  • ME 3us Intel XScale core 3us
  • Time to start a processing unit
  • ME 0.036ms Intel Xscale core 0.097ms

Linux kernel thread
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
82
Loading Overhead
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (3/7)
  • Intel XScale core thread start time 0.054ms
  • Graph shows ME load times

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
83
Binding Overhead
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (4/7)
  • ME binding
  • Intel XScale core binding

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
84
Cumulative Effects of Adaptation Overheads
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (5/7)
  • Not all adaptation time represents an inoperable
    system
  • Can leave some processors running while
    checkpointing others

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
85
Adaptation Overhead Learnings
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (6/7)
  • Overall adaptation time is
  • Linking time (checkpointing and loading
    time number of cores)
  • Packet loss occurs during checkpointing and
    loading, but not during binding
  • So, focus optimizations on starting, stopping,
    and loading
  • Exchange time in loading for more time in linking

86
Theoretical Benefits of Adaptation
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results (7/7)
  • For more details see paper in HotNets-II
  • http//nms.lcs.mit.edu/HotNets-II/papers/adaptatio
    n-case.pdf

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
87
Future Research
RTS Design
Adaptation Mechanisms
Adaptation Results
Summary (1/2)
  • Gather experimental benefits of adaptation
  • Define and develop performance determinism in the
    face of adaptation
  • Apply power scaling to adaptation mechanisms
  • Co-exist commercial operating systems with
    adaptation

88
Summary
RTS Design
Adaptation Mechanisms
Adaptation Results
Summary (2/2)
  • An adaptive run-time system provides benefits in
  • Performance, supported services, and power
    consumption
  • The system can be built with a truly programmable
    large-scale chip multiprocessor, requires
  • Checkpointing
  • Binding
  • State migration
  • Adaptation costs come primarily from loading and
    checkpointing times, optimize these

89
Shangri-la Performance Evaluation
  • Part of the Shangri-la Tutorial presented at
    MICRO-37
  • December 5, 2004

90
Evaluation Setup
Results
Benchmarks
Resource Budgets
Setup (1/3)
  • Hardware
  • Radisys ENP2611 evaluation board (3 x 1Gbps
    optical ports)
  • IXIA packet generator (2 x 1Gbps optical ports)
  • Currently only capable of generating 2Gbs traffic
  • Benchmarks
  • L3-Switch (3126 lines) L2 bridging and L3
    forwarding
  • Firewall (2784 lines) Simple firewall using
    ordered rule-based classification
  • MPLS (4331 lines) Multi-protocol labeled
    switching (transit node)
  • Packet traces
  • L3-Switch and MPLS evaluated using NPF packet
    traces
  • Firewall used custom packet trace

Third party brands/names are property of their
respective owners
91
Mt. Hood Board
Results
Benchmarks
Resource Budgets
Setup (2/3)
  • One Intel IXP2400
  • Three 1Gbps optical ports
  • 64MB DRAM
  • 8MB SRAM

3 optical ports
DRAM
92
Test Development Environment
Results
Benchmarks
Resource Budgets
Setup (3/3)
  • Linux host machine
  • Provides power to Radisys ENP2611 board via PCI
    bus
  • Compiles code for MEs and Intel XScale
  • Running NFS server
  • Intel XScale core running Linux and Shangri-la
    RTS
  • Read generated binaries from host machines NFS
    server and load onto MEs

Ethernet serial cable
Linux host
Radisys ENP2600
2x 1Gbps optical links
IXIA packet generator
Third party brands/names are property of their
respective owners
93
Instruction and memory budgets at 2.5Gb/s
Results
Benchmarks
Setup
Resource Budgets (1/5)
x6
  • Assumed memory access latency 100 cycles
  • Scratch Memory 60 cycles
  • SRAM 90 cycles
  • DRAM 120 cycles
  • Memory access budget refers only to number of
    memory accesses that can be overlapped with
    computation
  • Does not account for bandwidth of SRAM/DRAM

94
Evaluating Intel IXP2400 Memory Bandwidth
Results
Benchmarks
Setup
Resource Budgets (2/5)
  • Modified empty PPF connected to Rx and Tx
  • Add loop to access chosen memory level n times
  • Graph throughput of various configurations
  • n 1, 2, 4, 1024
  • Memory accessed SCRATCH, SRAM, DRAM
  • Results using minimum-sized 64B packets

95
Scratch Memory Bandwidth
Results
Benchmarks
Setup
Resource Budgets (3/5)
  • Significant difference in memory bandwidth
    consumed according to access size

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
96
SRAM Memory Bandwidth
Results
Benchmarks
Setup
Resource Budgets (4/5)
  • Behavior is similar to Scratch Memory

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
97
DRAM Memory Bandwidth
Results
Benchmarks
Setup
Resource Budgets (5/5)
  • DRAM accesses significantly constrain forwarding
    rate

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
98
L3-Switch
Results
Resource Budgets
Setup
Benchmarks (1/4)
  • Performs core router functionality
  • Bridge packets not destined for this router
  • Handle ARP packets for resolving Ethernet
    addresses
  • Route IP packets targeting this router

lpm_lookup.p
eth_encap.m
options_processor.p
encap.p
l3_cls.p
icmp_processor.p
arp.p
Rx
Tx
l2_cls.p
l3_fwdr.m
l2_bridge.m
bridge.p
l3_switch.m
99
Firewall
Results
Resource Budgets
Setup
Benchmarks (2/4)
  • Filter out unwanted packets from a WAN (e.g.
    Internet)
  • Assign flow IDs to packets according
    user-specified rules list using src IP, dst IP,
    src port, dst port, TOS and protocol
  • Drop packets for specified flow IDs
  • Optimize assignment of flow IDs
  • Try to find flow ID in the hash table, placed in
    the table by a previous packet with the same
    fields
  • Otherwise, do a long search by testing the rules
    in order

Rx
Tx
firewall.p
hash_lookup.p
long_search.p
classifier.m
firewall.m
100
Multi-Protocol Label Switching (MPLS)
Results
Resource Budgets
Setup
Benchmarks (3/4)
  • Route packets using attached labels instead of IP
    addresses
  • Reduces routing hardware requirements
  • Facilitates high-level traffic management on
    user-defined packet streams

mpls.m
ilm.p
ops.p
MPLS ? MPLS?
ops.p
ftn.p
eth_encap.m
encap.p
l3_fwdr.m
arp.p
Rx
Tx
l2_cls.p
l2_bridge.m
bridge.p
mpls_app.m
101
Other Network Benchmarks
Results
Resource Budgets
Setup
Benchmarks (4/4)
  • Network address translation (NAT)
  • Allows multiple LAN hosts connect to a WAN (e.g.
    Internet) through one IP address
  • Achieved by remapping LAN IPs and ports
  • WAN hosts only see the NAT router
  • Manages a table mapping active connections
    between the LAN and WAN
  • Quality of Service (QoS)
  • Allows partitioning of available bandwidth to
    user-specified traffic streams
  • Packet streams are throttled by intentionally
    dropping packets
  • Header compression
  • Reduce size of transmitted packet headers
  • Since many fields are similar for packets in the
    same flow, achieve compression by only
    transmitting differences
  • Various security features
  • Encryption / decryption

102
Dynamic Memory Accesses
Benchmarks
Resource Budgets
Setup
Results (1/4)
SWC software cache PHR pkt handling
removal SOAR static offset align PAC
pkt access combing -O2 inline pkt handling
-O1 typical scalar opts BASE no opts
  • Table shows average per-packet access count
  • PAC significantly reduces packet memory accesses
  • -O1 enables pipeline to fit on one ME
  • SWC and PHR also contribute to reduced memory
    accesses

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
103
L3-Switch Forwarding Rate
Benchmarks
Resource Budgets
Setup
Results (2/4)
SWC software cache PHR pkt handling
removal SOAR static offset align PAC
pkt access combing -O2 inline pkt handling
-O1 typical scalar opts BASE no opts
Top-end performance still constrained by memory
bandwidth PHR and SWC alleviate this somewhat
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
PHR and SOAR have the most impact on forwarding
rate
PAC and SOAR have the most impact on L3-Switch
forwarding rate
  • Forwarding rate of minimum-sized packets (64B)
  • Reduced memory access and instruction count both
    improve forwarding rate

104
Firewall Forwarding Rate
Benchmarks
Resource Budgets
Setup
Results (3/4)
SWC software cache PHR pkt handling
removal SOAR static offset align PAC
pkt access combing -O2 inline pkt handling
-O1 typical scalar opts BASE no opts
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
Per-ME performance improvement dominated by PAC
105
MPLS Forwarding Rate
Benchmarks
Resource Budgets
Setup
Results (4/4)
SOAR does not help this application due to
stacking of MPLS labels
SWC software cache PHR pkt handling
removal SOAR static offset align PAC
pkt access combing -O2 inline pkt handling
-O1 typical scalar opts BASE no opts
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance.
  • Results are for MPLS transit only (internal
    routers of a MPLS domain)
  • Similar performance characteristics to L3-Switch

106
Conclusions
  • Demonstrate performance from a high-level
    language comparable to hand-tuned code
  • Memory-level optimizations
  • Program partitioning to heterogeneous cores
  • Optimizations to support packet abstractions
  • Language features are more attractive as users
    can enjoy ease of programming without sacrificing
    performance
  • Modular program design
  • Packet model supporting encapsulation, metadata
    and bit-level accesses
  • Flat memory model
  • Able to achieve 2Gbps on L3-Switch, Firewall and
    MPLS Transit

107
Summary
  • Part of the Shangri-la Tutorial presented at
    MICRO-37
  • December 5, 2004

108
Summary - Baker
  • Goals
  • Enable efficient expression of packet processing
    applications on large-scale chip-multiprocessors
    (e.g., Intel IXP2400 processor)
  • Enable good execution performance
  • Approach
  • Hide hardware details
  • Expose domain-specific constructs
  • Reduce C Features

109
Summary Compiler Optimizations
  • Demonstrate performance from a high-level
    language comparable to hand-tuned code
  • Memory-level optimizations
  • Program partitioning to heterogeneous cores
  • Optimizations to support packet abstractions
  • Machine-specific optimizations
  • Able to achieve maximal packet forwarding rates
    (2Gbps) on L3-Switch, Firewall and MPLS Transit

110
Summary Runtime Adaptation
  • An adaptive system is important for packet
    processing to adapt to varying workloads
    dynamically
  • Benefits in performance, services, and power
    consumption
  • The system can be built with a truly programmable
    large-scale chip multiprocessor, requires
  • Checkpointing
  • Binding
  • State migration
  • Adaptation costs come primarily from loading and
    checkpointing times, optimize these

111
Key Learning
  • High-level language features ease programming for
    complex multi-core network processors
  • Effective compiler optimizations able to achieve
    performance comparable to hand-tuned systems
  • Architecture-specific, domain-specific, and
    general optimizations all critical to obtain high
    performance
  • Ease of programming and performance can co-exist
  • Runtime adaptation a key feature to future
    network systems
  • The system can be built with a large-scale CMP
  • Many learning applicable to general CMP systems
Write a Comment
User Comments (0)
About PowerShow.com