Title: Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian, Jason H. Lin, Lixia Liu, Tao Liu, Roy Ju
1Shangri-La Achieving High Performance from
Compiled Network Applications while Enabling Ease
of ProgrammingMichael K. Chen, Xiao Feng Li,
Ruiqi Lian, Jason H. Lin, Lixia Liu, Tao Liu, Roy
Ju
- Discussion Prepared by Jennifer Chiang
2Shangri-La Some Insight
- A synonym for paradise
- Legendary place from
- James Hiltons novel Lost Horizon
- Goal achieve a perfect compiler
3Introduction
- Problem Programming network processors is
challenging. - Tight memory access and instruction budgets to
sustain high line rates. - Traditionally done by hand coded assembly.
- Solution Recently, researchers proposed high
level programming languages for packet
processing. - Challenge Is compiling these languages into code
as competitive as hand tuned assembly?
4Shangri-La Compiler from 10,000 foot view
- Consists of programming language, compiler, and
runtime system targeted towards Intel IXP
multi-core network processor. - Accepts packet program written in Baker.
- Maximizes processor utilization
- Hot code paths mapped across processing elements.
- No hardware caches on target.
- Delayed update software controlled caches for
frequently accessed data. - Packet handling optimizations
- Reduce per packet memory access and instruction
counts. - Custom stack model
- Maps stack frames to fastest levels of target
processors memory hierarchy.
5Baker Programming Language
- Backer programs are structured as a dataflow of
packets from Rx to Tx. - Module container for holding related PPFs,
wirings, support code shared data. - PPF (Packet processing functions)
- C like code that performs the actual packet
processing. - Hold temporary local states access global data
structures. - CC (Communication channels)
- Input and output channel endpoints of PPFs wired
together. - Asynchronous queues ordered by FIFO.
6Baker Program Example
Module
PPF
CC
7Packet Support
- Specify protocols using Backers protocol
construct - Metadata used to store state associated with a
packet, but not contained in a packet. - Useful for storing state associated with a packet
from one PPF and used later by another PPF - Packet_handle
- used to manipulate packets.
Data
Metadata
Packet_handle
8IXP2400 Network Processor
- Intel XScale core process control packets,
execute noncritical application code, handle
initialization and management of the network
processor. - 8 MEs (microengines) - lightweight,
multi-threaded pipelined processors running
special ISA designed for processing packets. - 4 levels of memory Local Memory, Scratch Memory,
SRAM, DRAM
DRAM
SRAM
XScale Core
Scratch Memory
Local Memory
9Compiler Details
10Aggregation
- Throughput model t n / p x k
- n number of MEs
- k pipeline stage with lowest throughput
- t throughput
- P total number of pipeline stages
- Latency of a packet through the system can be
tolerated, but minimum forwarding rates must be
guaranteed. - Maximize throughput, compiler uses pipeline or
duplicates code across multiple processing
elements. - Techniques pipelining, merging, duplication
11Delayed-Update Software Controlled Caching
- Caching candidates frequently read data
structures with high hit rates, but infrequently
written. - Updates to these structures rely only on
coherency of single atomic write to guarantee
correctness. - Reduces frequency and cost of coherency checks.
- Late penalty packet delivery errors
12PAC
- Packet access combining
- Packet data always stored in DRAM memory.
- If every packet access mapped to DRAM access,
packet forwarding rates are quickly limited by
DRAM bandwidth. - Code Generation stage of compiler multiple
protocol field accesses combined into a single
wide DRAM access. - Same can be done for SRAM metadata accesses.
13Stack Layout Optimization
- Goal allocate as many stack frames as possible
to the limited amount of fast memory. - Stack can grow into SRAM, but has high latency
and impacts performance. - Assign local memory to procedures higher in the
program call graph. - Assign SRAM memory when Local Memory is
completely exhausted. - Utilize physical and virtual
- stack pointers.
14Experimental Results
- 3 benchmarks L3-Switch, Firewall, MPLS
- Significant impact of PAC evident in the large
reduction in packet handling SRAM and DRAM
accesses. - Code generated by Shangri-La for all 3
successfully achieved 100 forwarding rates at
2.5Gbps, which meets the designed spec of
IXp24000. - Also, same throughput target achieved by hand
coded assembly written specifically for these
processors.
15Conclusions
- Shangri-La provides complete framework for
aggressively compiling network programs. - Reduce both instruction and memory access counts.
- Achieved goal of 100 packet forwarding rate at
2.5Gbps