Title: A Distributed Scalable Multithreaded Network Processor
1A Distributed Scalable Multi-threaded Network
Processor
- Behnam Robatmili
- University of Tehran
2Outline
- Introduction
- Design Principle
- Architecture
- Thread Unit
- Decode Issue
- Functional Units
- Switches
- Software Model
- Conclusions
3Introduction
Network Processing Limits Higher Line Speed More
Complicated and Flexible Protocols
4The Best Solution
- Application Specific Instruction Processors
(ASIP), or Specific Purpose Processors (SPP)
called Network Processors. - This means using special purpose processors,
which are optimized for specific tasks. - The best examples DSPs and I/O processors.
5Comparison between different methods of
implementations
6Comparison between implementation different
technologies
7Design Principle
- Designing a processor consists of designing its
instruction-set and architecture or the data path
and controller. - software and hardware properties of such a device
are very important. - One must study processing services required in
networking.
8Instruction Set Issues
- Pattern Matching As prefix matching or exact
matching which is used in lookup and
classification and most other parts of routers
frequently. - Lookup Hash tables and tree or trie-based lookup
which is needed in forwarding engines and many
other parts. - Computation Calculating CRC and Checksums which
is needed for lower layers processing that must
be done for input packets and efficiently.
9Instruction Set Issues2
- Data Manipulation Bit-wise and byte-wise access
to data for extracting fields of protocols. It is
used in parsing packets and implementing
protocols. - Queuing Packet queuing and Quality of Service.
Output or input packets must be queued before
getting their needed services. - Control Processing Managing tables (Route tables
or lookup tables) and using timers for
implementing protocols and QOS. - Segmentation and Reassembling Many time a large
packet must be break to smaller pieces of data
called cells like ATM cells and then processing
must be perform on these cells. These actions may
be needed before the switch node in many routers
in fast path.
10Intel IXP Additional Instructions
- Simple RISC cores augmented with
- FIND_BSET Finds first bit set in any 16 bit
register field. - HASHx_64 Performs 64 bit hash.
- CTX_ARB Swaps contents of arguments and wake on
events.
11Miscellaneous Example Instructions
- port oriented bit-wise move instructions like
move port1_at_bit1 port2_at_bit2 - block transfer instructions to solve to problems
associated with DMA - compute and jump subb r1 1h L1
- FSM-Support and Thread-Switch support
12Architecture Issues
- GOLDEN RULE
- In a packet processing system most of processing
preformed on a packet is independent of other
packets. - NP architectures can utilize parallelism in level
of Instructions (ILP) and in level of packets
(PLP).
13Architecture Example (Agree)
14Architecture Example (TOP Core)
15Architecture Example (Intel IXP)
16Design Architecture
A Simultaneous Multi Threading Processor
17Architecture
- In this processor
- there are separate register files, program
counters, stack and status registers in Thread
Unit for each thread. - Instruction cache and functional units are shared
between hardware threads.
18Thread Unit
Performs Instruction Level Parallelism for each
Thread. Hold separate parts for each Thread.
19Decode Issue
20Functional Units
21Dispatcher Collector
- Simple Switch Fabric to perform TLP among
different Threads in the system. - Access and control Functional Units
- Unique input Frames.
- Dispatcher Instruction Frame
- Collector Result Frame
22Processor Stages
23Software Model
Different Packet Processing Operations in
different static Threads. Processes communicate
with each others using Queue Units.
24Sample Scenario
- T1 reads a packets from NIC and stores them in
the data memory, writes to Q1. - T2 dequeues Q1 and performs IP lookup operation
on the packet and puts in Q2. - T3 dequeues Q2 and sends packet to switch fabric
interface (IOU). - T4 reads packets from the fabric and stores them
in the packet storage memory and writes to Q3. - T5 dequeues Q3, applies QOS and if it should be
sent, sends it to NIC.
25Conclusions
- Advantages
- Flexible
- Distributed
- Performance
- Disadvantages
- Additional Information sent and received