for Network Processors - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

for Network Processors

Description:

Allocate memory, signals, threads. Compiler scales throughput on NP hardware ... Thread preemption. Run-time techniques to maximize NP utilization. 20 ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 45

Provided by: eecgUt

Category:

more less

Transcript and Presenter's Notes

Title: for Network Processors

1
for Network Processors
Martin Labrecque Gregory Steffan University of
Toronto NPC Conference, October 3rd 2006 Tokyo,
Japan
2
Importance of Network Processing

Packets require processing
Processing increasingly advanced
VoIP / Digital TV
XML e-commerce
Personalized content

3
What is aNetwork Processor?

Software programmable IC
Processes streaming data
Complex parallel architecture
On-chip resources ? low-level programming

? Programming NPs is complex and expensive
4
How to Effectively Exploit an NP?

Scale app. to more processing elements
Fixed architecture or lack of compiler support
Nepal, NP-Click and Nepsim
Need real parallel communicating programs
NetBench, NPBench, CommBench
CRC, MD5, RED are micro-kernels
Mold app. to find per-application scaling limits
Shangri-la Chen, packet processing functions
Intel Li, Dai, partitioning, multithreading

? Need flexibility to explore the architecture
space ? Need automated framework to program NP
5
Our Objectives

Investigate issues in compiling multithreaded
apps
From a high-level language, automatically
Transform code
Allocate memory, signals, threads
Compiler scales throughput on NP hardware
Evaluate in parameterized NP environment

? Applications drive the NP architecture design
6
System Overview / Talk Outline
7
Programming Model

Task graphs from the Click Modular Router
Advantages
Compose app. by connecting elements
Results in a sequential task graph
Easier analysis than loosely written C
Click widely accepted (StepNP, CUSP)

?Allows for memory and task management
8
Execution of Task Graph

Task graph pipelines work across packets
Our apps unordered dependences btw packets
Conditional processing ? variable latencies
Assignment of tasks to PEs is complex

9
System Overview / Talk Outline
10
Managing Tasks

Scale throughput to larger of processors
Evaluated transformations
Replication
Splitting
Early signalling
Speculation

11
1) Replication

Replicas of a task can execute concurrently
Distribute work of packets on replicas
Assumptions on ordered/unordered dependences

?Replication is key to introduce parallelism
12
Synchronization

Requires dependence identification
Managed automatically by the compiler

class PacketCounter int packet_cnt PacketCo
unter(packet p) . packet_cnt packet_cnt
1 .
13
2) Early Task Signalling

Normal
execution

A
B
C
D

Execute work units as early as possible
Requires dependence profiling
Announce/wait for/trigger scheme

? Reduces packet processing latency
14
2) Early Task Signalling

Normal
execution

A
B
C
D

Case of early signalling

A
C
D
B
B only has a dependence with D

Execute work units as early as possible
Requires dependence profiling
Announce/wait for/trigger scheme

? Reduces packet processing latency
15
2) Early Task Signalling

Normal
execution

A
B
C
D

Case of early signalling

A
C
D
Wait
resume
B
B only has a dependence with D

Execute work units as early as possible
Requires dependence profiling
Announce/wait for/trigger scheme

? Reduces packet processing latency
16
3) Task Splitting

Divide a task into sub-tasks
Allows to schedule splits on different PEs
Parallel overlap of task splits not supported
Challenges addressed

how much splitting
which tasks
where to split

17
4) Speculation

Entry in a synchronized section triggers local
buffering of all writes
Violation rolls-back to the section entry
Challenges addressed

Rollback mechanism
Commit order
Context eviction

18
System Overview / Talk Outline
19
Managing Threads

Task mapping
Thread scheduling
Thread context switching on PE
Priority system of requests on shared busses
Thread preemption

? Run-time techniques to maximize NP utilization
20
System Overview / Talk Outline
21
Managing Memory

Dependence identification ? synchronization
Batching
Group memory accesses in a single request
Perform this request ahead of time
Forwarding
Forward memory buffers between tasks
Decreases off-chip memory traffic

? Prog. model eases automated parallelisation ?
Compiler can target wide memory busses
22
IXP2800
23
Separation of Memory Types
Task 1
packet stream
Persistent heap
Instructions
push/pull()
Static and dynamic data structures
Execution context
- stack - temporary heap
...
Task N
? Use memory types to manage dependences ? Use
memory types to ease memory mapping
24
System Overview / Talk Outline
25
Simulated Architecture

RISC, single CPI processing elements
Non-blocking memory instructions

26
Methodology

Vary PE ? the computation/memory bandwidth
Combine task/thread/memory management config.

27
Packet Input Rate

? Measure throughput at saturation
? Adapt config. for representative measurements

28
Simulation

Intel IXP processor family architectural
parameters
NLANR packet traces
Header processing applications
Standards compliant Router
Network Address Translation (adapted from Click)

29
Effect of Replication

? What is limiting scaling?

30
Bottleneck Identification

Imperfect overlap of latencies with computation
Idle CPU time
On-chip components ??? bottleneck

31
Effect of Speculation

? Speculation alleviates the synchronization
bottleneck

32
Splitting

Independent schedule of task splits ? mapping
Splitting improves load balance and throughput

33
Conclusion

4 task transformations
Replication key to parallelism
Early signaling reduces processing latency
Splitting improves load balance
Speculation alleviates synchronization
bottleneck
Most powerful combination
Replication and speculation

? Compiler techniques scalable resource and task
mgnt.
? Use NP HW in conjunction with program. model
? Methodology and simulation infrastructure to
evaluate
- parallel applications
- multi-PE NPs

34
Future Work

Micro-architecture of processing engines
Deeper task transformations
Parallel overlap of task splits for one packet
Data layout
Experiment on real hardware (ASIC/FPGA)
Integrate heterogeneous processing elements

35
Domo Arigato Gozaimashita Thank you
36
Supplementary Slides
37
Validation

Generated sequential traces from IXP1200
applications using Nepsim
Re-played them inside our simulator

38
Contributions

Automatically scale the throughput of an
application to the underlying NP architecture
Network processing task transformations
Compilation techniques
Methodology for measuring and evaluating
parallel applications
multi-PE NPs
NPIRE infrastructure
an integrated environment for NP research.