for Network Processors - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

for Network Processors

Description:

Allocate memory, signals, threads. Compiler scales throughput on NP hardware ... Thread preemption. Run-time techniques to maximize NP utilization. 20 ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 45
Provided by: eecgUt
Category:

less

Transcript and Presenter's Notes

Title: for Network Processors


1
for Network Processors
Martin Labrecque Gregory Steffan University of
Toronto NPC Conference, October 3rd 2006 Tokyo,
Japan
2
Importance of Network Processing
  • Packets require processing
  • Processing increasingly advanced
  • VoIP / Digital TV
  • XML e-commerce
  • Personalized content

3
What is aNetwork Processor?
  • Software programmable IC
  • Processes streaming data
  • Complex parallel architecture
  • On-chip resources ? low-level programming

? Programming NPs is complex and expensive
4
How to Effectively Exploit an NP?
  • Scale app. to more processing elements
  • Fixed architecture or lack of compiler support
  • Nepal, NP-Click and Nepsim
  • Need real parallel communicating programs
  • NetBench, NPBench, CommBench
  • CRC, MD5, RED are micro-kernels
  • Mold app. to find per-application scaling limits
  • Shangri-la Chen, packet processing functions
  • Intel Li, Dai, partitioning, multithreading

? Need flexibility to explore the architecture
space ? Need automated framework to program NP
5
Our Objectives
  • Investigate issues in compiling multithreaded
    apps
  • From a high-level language, automatically
  • Transform code
  • Allocate memory, signals, threads
  • Compiler scales throughput on NP hardware
  • Evaluate in parameterized NP environment

? Applications drive the NP architecture design
6
System Overview / Talk Outline
7
Programming Model
  • Task graphs from the Click Modular Router
  • Advantages
  • Compose app. by connecting elements
  • Results in a sequential task graph
  • Easier analysis than loosely written C
  • Click widely accepted (StepNP, CUSP)

?Allows for memory and task management
8
Execution of Task Graph
  • Task graph pipelines work across packets
  • Our apps unordered dependences btw packets
  • Conditional processing ? variable latencies
  • Assignment of tasks to PEs is complex

9
System Overview / Talk Outline
10
Managing Tasks
  • Scale throughput to larger of processors
  • Evaluated transformations
  • Replication
  • Splitting
  • Early signalling
  • Speculation

11
1) Replication
  • Replicas of a task can execute concurrently
  • Distribute work of packets on replicas
  • Assumptions on ordered/unordered dependences

?Replication is key to introduce parallelism
12
Synchronization
  • Requires dependence identification
  • Managed automatically by the compiler

class PacketCounter int packet_cnt PacketCo
unter(packet p) . packet_cnt packet_cnt
1 .
13
2) Early Task Signalling
  • Normal
  • execution

A
B
C
D
  • Execute work units as early as possible
  • Requires dependence profiling
  • Announce/wait for/trigger scheme

? Reduces packet processing latency
14
2) Early Task Signalling
  • Normal
  • execution

A
B
C
D
  • Case of early signalling

A
C
D
B
B only has a dependence with D
  • Execute work units as early as possible
  • Requires dependence profiling
  • Announce/wait for/trigger scheme

? Reduces packet processing latency
15
2) Early Task Signalling
  • Normal
  • execution

A
B
C
D
  • Case of early signalling

A
C
D
Wait
resume
B
B only has a dependence with D
  • Execute work units as early as possible
  • Requires dependence profiling
  • Announce/wait for/trigger scheme

? Reduces packet processing latency
16
3) Task Splitting
  • Divide a task into sub-tasks
  • Allows to schedule splits on different PEs
  • Parallel overlap of task splits not supported
  • Challenges addressed
  • how much splitting
  • which tasks
  • where to split

17
4) Speculation
  • Entry in a synchronized section triggers local
    buffering of all writes
  • Violation rolls-back to the section entry
  • Challenges addressed
  • Rollback mechanism
  • Commit order
  • Context eviction

18
System Overview / Talk Outline
19
Managing Threads
  • Task mapping
  • Thread scheduling
  • Thread context switching on PE
  • Priority system of requests on shared busses
  • Thread preemption

? Run-time techniques to maximize NP utilization
20
System Overview / Talk Outline
21
Managing Memory
  • Dependence identification ? synchronization
  • Batching
  • Group memory accesses in a single request
  • Perform this request ahead of time
  • Forwarding
  • Forward memory buffers between tasks
  • Decreases off-chip memory traffic

? Prog. model eases automated parallelisation ?
Compiler can target wide memory busses
22
IXP2800
23
Separation of Memory Types
Task 1
packet stream
Persistent heap
Instructions
push/pull()
Static and dynamic data structures
Execution context
- stack - temporary heap
...
Task N
? Use memory types to manage dependences ? Use
memory types to ease memory mapping
24
System Overview / Talk Outline
25
Simulated Architecture
  • RISC, single CPI processing elements
  • Non-blocking memory instructions

26
Methodology
  • Vary PE ? the computation/memory bandwidth
  • Combine task/thread/memory management config.

27
Packet Input Rate
  • ? Measure throughput at saturation
  • ? Adapt config. for representative measurements

28
Simulation
  • Intel IXP processor family architectural
    parameters
  • NLANR packet traces
  • Header processing applications
  • Standards compliant Router
  • Network Address Translation (adapted from Click)

29
Effect of Replication
  • ? What is limiting scaling?

30
Bottleneck Identification
  • Imperfect overlap of latencies with computation
  • Idle CPU time
  • On-chip components ??? bottleneck

31
Effect of Speculation
  • ? Speculation alleviates the synchronization
    bottleneck

32
Splitting
  • Independent schedule of task splits ? mapping
  • Splitting improves load balance and throughput

33
Conclusion
  • 4 task transformations
  • Replication key to parallelism
  • Early signaling reduces processing latency
  • Splitting improves load balance
  • Speculation alleviates synchronization
    bottleneck
  • Most powerful combination
  • Replication and speculation
  • ? Compiler techniques scalable resource and task
    mgnt.
  • ? Use NP HW in conjunction with program. model
  • ? Methodology and simulation infrastructure to
    evaluate
  • - parallel applications
  • - multi-PE NPs

34
Future Work
  • Micro-architecture of processing engines
  • Deeper task transformations
  • Parallel overlap of task splits for one packet
  • Data layout
  • Experiment on real hardware (ASIC/FPGA)
  • Integrate heterogeneous processing elements

35
Domo Arigato Gozaimashita Thank you
36
Supplementary Slides
37
Validation
  • Generated sequential traces from IXP1200
    applications using Nepsim
  • Re-played them inside our simulator

38
Contributions
  • Automatically scale the throughput of an
    application to the underlying NP architecture
  • Network processing task transformations
  • Compilation techniques
  • Methodology for measuring and evaluating
  • parallel applications
  • multi-PE NPs
  • NPIRE infrastructure
  • an integrated environment for NP research.

39
Compiler infrastructure
40
Context switching
41
NP Architecture models
42
Locality transformations
  • ? Locality transforms improve throughput

? Batching incurs traffic burstiness on shared
busses
43
Task transformations
44
Thread management
  • Mapping and Scheduling
  • Context switching
  • Priority system on shared busses
  • Preemption
Write a Comment
User Comments (0)
About PowerShow.com