Title: for Network Processors
1for Network Processors
Martin Labrecque Gregory Steffan University of
Toronto NPC Conference, October 3rd 2006 Tokyo,
Japan
2Importance of Network Processing
- Packets require processing
- Processing increasingly advanced
- VoIP / Digital TV
- XML e-commerce
- Personalized content
3What is aNetwork Processor?
- Software programmable IC
- Processes streaming data
- Complex parallel architecture
- On-chip resources ? low-level programming
? Programming NPs is complex and expensive
4How to Effectively Exploit an NP?
- Scale app. to more processing elements
- Fixed architecture or lack of compiler support
- Nepal, NP-Click and Nepsim
- Need real parallel communicating programs
- NetBench, NPBench, CommBench
- CRC, MD5, RED are micro-kernels
- Mold app. to find per-application scaling limits
- Shangri-la Chen, packet processing functions
- Intel Li, Dai, partitioning, multithreading
? Need flexibility to explore the architecture
space ? Need automated framework to program NP
5Our Objectives
- Investigate issues in compiling multithreaded
apps - From a high-level language, automatically
- Transform code
- Allocate memory, signals, threads
- Compiler scales throughput on NP hardware
- Evaluate in parameterized NP environment
? Applications drive the NP architecture design
6System Overview / Talk Outline
7Programming Model
- Task graphs from the Click Modular Router
- Advantages
- Compose app. by connecting elements
- Results in a sequential task graph
- Easier analysis than loosely written C
- Click widely accepted (StepNP, CUSP)
?Allows for memory and task management
8Execution of Task Graph
- Task graph pipelines work across packets
- Our apps unordered dependences btw packets
- Conditional processing ? variable latencies
- Assignment of tasks to PEs is complex
9System Overview / Talk Outline
10Managing Tasks
- Scale throughput to larger of processors
- Evaluated transformations
- Replication
- Splitting
- Early signalling
- Speculation
111) Replication
- Replicas of a task can execute concurrently
- Distribute work of packets on replicas
- Assumptions on ordered/unordered dependences
?Replication is key to introduce parallelism
12Synchronization
- Requires dependence identification
- Managed automatically by the compiler
class PacketCounter int packet_cnt PacketCo
unter(packet p) . packet_cnt packet_cnt
1 .
132) Early Task Signalling
A
B
C
D
- Execute work units as early as possible
- Requires dependence profiling
- Announce/wait for/trigger scheme
? Reduces packet processing latency
142) Early Task Signalling
A
B
C
D
A
C
D
B
B only has a dependence with D
- Execute work units as early as possible
- Requires dependence profiling
- Announce/wait for/trigger scheme
? Reduces packet processing latency
152) Early Task Signalling
A
B
C
D
A
C
D
Wait
resume
B
B only has a dependence with D
- Execute work units as early as possible
- Requires dependence profiling
- Announce/wait for/trigger scheme
? Reduces packet processing latency
163) Task Splitting
- Divide a task into sub-tasks
- Allows to schedule splits on different PEs
- Parallel overlap of task splits not supported
- Challenges addressed
- how much splitting
- which tasks
- where to split
174) Speculation
- Entry in a synchronized section triggers local
buffering of all writes - Violation rolls-back to the section entry
- Challenges addressed
- Rollback mechanism
- Commit order
- Context eviction
18System Overview / Talk Outline
19Managing Threads
- Task mapping
- Thread scheduling
- Thread context switching on PE
- Priority system of requests on shared busses
- Thread preemption
? Run-time techniques to maximize NP utilization
20System Overview / Talk Outline
21Managing Memory
- Dependence identification ? synchronization
- Batching
- Group memory accesses in a single request
- Perform this request ahead of time
- Forwarding
- Forward memory buffers between tasks
- Decreases off-chip memory traffic
? Prog. model eases automated parallelisation ?
Compiler can target wide memory busses
22IXP2800
23Separation of Memory Types
Task 1
packet stream
Persistent heap
Instructions
push/pull()
Static and dynamic data structures
Execution context
- stack - temporary heap
...
Task N
? Use memory types to manage dependences ? Use
memory types to ease memory mapping
24System Overview / Talk Outline
25Simulated Architecture
- RISC, single CPI processing elements
- Non-blocking memory instructions
26Methodology
- Vary PE ? the computation/memory bandwidth
- Combine task/thread/memory management config.
27Packet Input Rate
- ? Measure throughput at saturation
- ? Adapt config. for representative measurements
28Simulation
- Intel IXP processor family architectural
parameters - NLANR packet traces
- Header processing applications
- Standards compliant Router
- Network Address Translation (adapted from Click)
29Effect of Replication
- ? What is limiting scaling?
30Bottleneck Identification
- Imperfect overlap of latencies with computation
- Idle CPU time
- On-chip components ??? bottleneck
31Effect of Speculation
- ? Speculation alleviates the synchronization
bottleneck
32Splitting
- Independent schedule of task splits ? mapping
- Splitting improves load balance and throughput
33Conclusion
- 4 task transformations
- Replication key to parallelism
- Early signaling reduces processing latency
- Splitting improves load balance
- Speculation alleviates synchronization
bottleneck - Most powerful combination
- Replication and speculation
- ? Compiler techniques scalable resource and task
mgnt. - ? Use NP HW in conjunction with program. model
- ? Methodology and simulation infrastructure to
evaluate - - parallel applications
- - multi-PE NPs
34Future Work
- Micro-architecture of processing engines
- Deeper task transformations
- Parallel overlap of task splits for one packet
- Data layout
- Experiment on real hardware (ASIC/FPGA)
- Integrate heterogeneous processing elements
35Domo Arigato Gozaimashita Thank you
36Supplementary Slides
37Validation
- Generated sequential traces from IXP1200
applications using Nepsim - Re-played them inside our simulator
38Contributions
- Automatically scale the throughput of an
application to the underlying NP architecture - Network processing task transformations
- Compilation techniques
- Methodology for measuring and evaluating
- parallel applications
- multi-PE NPs
- NPIRE infrastructure
- an integrated environment for NP research.
39Compiler infrastructure
40Context switching
41NP Architecture models
42Locality transformations
- ? Locality transforms improve throughput
? Batching incurs traffic burstiness on shared
busses
43Task transformations
44Thread management
- Context switching
- Priority system on shared busses
- Preemption