Title: Block Design Review: PlanetLab Line Card Header Format
1Block Design ReviewPlanetLab Line Card Header
Format
David M. Zar dzar_at_wustl.edu http//www.arl.wustl.e
du/projects/techX
2Revision History
- 10/31/06 (DMZ)
- Initial Draft
- 11/04/06 (DMZ)
- Updates for performance issues
3Line Card Centric Overview
Lookup
Switch Tx
QM/Schd
Hdr Format
S W I T C H
Phy Int Rx
Key Extract
Port Splitter
QM/Schd
Lookup
Key Extract
Switch Rx
Phy Int Tx
Hdr Format
Port Splitter
- Port Splitter (Ingress and Egress)
- Accepts packets on a NN ring
- Based on the physical destination port number
- 0-4 go to QM1 on a scratch ring
- 5-9 go to QM2 on a scratch ring
- Measured delay is about 120 cycles, including
memory latency
4Ingress Header Format
5Ingress Header Format
- Microengine Usage
- One microengine
- Eight identical threads
- NN ring input from Lookup
- NN ring output to Port Splitter
- Main functions
- Using data from Lookup, modify packet header in
DRAM for proper routing to PE - Destination MAC address
- First five bytes are same as source MAC address
- Source MAC address
- Address of this LC
- VLAN tag
- Adjust pre-queue stats counters
- Format input data for QM
- QID
- Port Number
- Ethernet Frame Length
6LC Ingress Functional Blocks
Lookup
Switch Tx
Hdr Format
Phy Int Rx
Key Extract
Ouput PacketFormat
Possible Input Packet Formats
7MAC Address and VLAN Tag (Ingress)
- The source MAC address is fixed and set at boot
time (_WU_get_mac_address) - The destination MAC address will only differ in
the last byte and this byte is obtained from the
Lookup data. - The VLAN tag is obtained from the Lookup data.
8Stats/Counters (Ingress/Egress)
- The Stats Index is obtained from the Lookup Data
- The pre-queue packet and byte counters are
updated (_WU_update_counters) - Packet counter is incremented (atomic SRAM)
- Byte count is incremented by the number of bytes
in the entire Ethernet frame (_WU_get_enet_frame_l
ength). - Frame_length IP_pkt_len 18
- 18 is the VLAN Ethernet header length
9QM Data Formatting (Ingress and Egress)
- QID is extracted from Lookup data
- Port number is extracted from Lookup data
- Total Ethernet frame length is passed to QM
- Stats index is passed on for post-queue counters
10Ingress HF Block Diagram
dl_source()
Signal next ctx
_WU_get_enet_frame_length
Cycles 10
NN Dequeue
Cycles 2
Cycles 17
init signal
DRAM 45 4B writes Cycles 26
_WU_write_vlan_header
Wait for prev ctx
Cycles 5
SRAM 1 read 1 write Cycles 10
_WU_update_counters
Signal next ctx
Cycles 1
NN Enqueue
Cycles 16
SRAM 3 writes Cycles 12
_WU_update_buffer_descriptor
Wait for prev ctx
dl_sink()
Total cycles 336699 Budget 1400
MHz/(10Gbs/890) 100.8 gt 100 cycles Measured
Latency 745
11Ingress Validation
- Send in non-tunneled packets and check output
packets to see they are our internal, tunneled,
packets. - Worked during development but not tested in
integrated system at this point. - Send in tunneled packets and check output packets
to see they are our internal, tunneled, packets. - Example 01020304 05060708 090a0b0c 81000aaa
08004500 00380000 0000ff11 3a61c0a8 0001c0a8
00020001 00010024 ffbd4500 001c0000 0000ff11
3a7dc0a8 0001c0a8 00020001 00020008 7e87 6d7e
d5be CRC thats stripped by RX - -gt
- 01020304 0a020102 03040a0b 81000002 08004500
00380000 0000ff11 3a61c0a8 0001c0a8 00020001
00010024 ffbd4500 001c0000 0000ff11 3a7dc0a8
0001c0a8 00020001 00020008 7e87
12Egress Header Format
13Egress Header Format
- Microengine Usage
- One microengine
- Eight identical threads
- NN ring input from Lookup
- NN ring output to Port Splitter
- Main functions
- Using data from Lookup, modify packet header in
DRAM for proper routing to Switch - Destination MAC address
- First five bytes are same as source MAC address
- Destination MAC address is looked up based on IP
address from lookup - Source MAC address
- Address of this LC
- VLAN tag
- Adjust pre-queue stats counters
- Format input data for QM
- QID
- Port Number
- Ethernet Frame Length
14LC Egress Functional Blocks
Lookup
Phy Int Tx
Hdr Format
S W I T C H
Key Extract
Switch Rx
Output Packet Format
Input Packet Format
15MAC Address and VLAN Tag (Egress)
- The source MAC address is fixed and set at boot
time (_WU_get_mac_address) - The destination MAC address will only differ in
the last nibble and this nibble is obtained from
the Lookup data. - _WU_ip_lookup will take 32 bits from the
destination IP address and use the local CAM to
obtain the least significant 4 bits of the MAC
address. - The CAM state bits are used for this so thats
why there are only 4 bits of data returned - The VLAN tag is obtained from the Lookup data.
16Egress HF Block Diagram
dl_source()
Signal next ctx
Cycles 10
_WU_get_enet_frame_length
Cycles 1
NN Dequeue
Cycles 2
_WU_ip_lookup
init signal
Wait for prev ctx
Cycles 1
DRAM 1 4B read 4 4B writesCycles
32
_WU_write_vlan_header
Cycles 2
_WU_update_counters
SRAM 1 add 1 incrCycles 6
Signal next ctx
Cycles 1
NN Enqueue
SRAM 3 writesCycles 10
_WU_update_buffer_descriptor
Wait for prev ctx
dl_sink()
Total cycles 65 Measured Latency 660
17Egress Validation
- Send in our internal, tunneled packets and check
output packets to see they are our valid IP,
tunneled, packets. - For the PlanetLab demo, there are no non-tunneled
output packets - Check packet and byte counters for valid updates
- Check CAM for proper initialization (data watch)
18HF Initialization (Ingress/Egress)
- All memory locations defined in dl_system.h
- Base address for HF
- LCI/E_HF_SRAM_INIT_BASE
- MAC_ADDR_HI32
- MAC_ADDR_LO16
- Pre-Queue Counters
- LCI/E_LU_COUNTERS_SRAM_INIT_BASE
- LCI/E_LU_PRE_Q_PKT_CNT_OFFSET offset into
counters structure for packet counter - LCI/E_LU_PRE_Q_BYTE_CNT_OFFSET offset into
counters structure for byte counter. - Thread 0 waits for signal from rx
- For Egress, the CAM is filled (_WU_hfe_initialize_
ip_lookup) with data from LCE_HF_SRAM_INIT_BASE
8 - each entry is 64 bits cam_entry (32b), RSVD
(28b), MAC_DEST (4b)
19File Locations (Ingress and Egress)
- Main code
- Applications/LC_Ingress/src/hdr_format/PL/hdr_form
at.uc - Applications/LC_Egress/src/hdr_format/PL/hdr_forma
t.uc - Library
- library/DataPlane/hdr_format_util.uc
20Required Includes (Ingress and Egress)
- Files
- build/PL/dispatch_loop/dl_system.h
- memory locations
- IXA_SDK_4.0/src/library/microblocks_library/
- dl_meta for metadata macros
- IXA_SDK_4.0/src/library/dataplane_library/
- dram for DRAM read/write macros
- sram for SRAM read/write/add/incr macros
- xbuf for transfer buffer macros
21Performance Issues
22Ingress Performance Anomalies
23Ingress Anomalies (Explanation)
24Ingress Anomalies (Explanation)
The SRAM Controllers have a command FIFO
These bus arbiters are shared across all memory
interfaces
25Ingress/Egress SRAM Issues
- It seems that using atomic ADD/INCR instructions
is expensive at the SRAM controller - If I remove them and read the SRAM, add myself,
write the SRAM, this is quicker and consumes less
of the SRM controller time an, thus, the command
queue never backs up. - The this new design, there are more instructions
executed, but there may be a few I could
eliminate with some optimizing of code. - No stalling in the WU microblocks (well QM does
and RX and TX still do but these looks normal).
26Ingress/Egress Performance
- 99 CPU cycles
- 745 cycles latency
- Expected performance
- Should have no trouble going at 10 Gb/s but does
- Simulated performance (as of 11/06/2006)
- 10 Gb
- With all other microengines in place (i.e. real
simulation)
27Future Work
28Ingress/Egress Future Work
- Determine source of I/O stalls
- Update Stubs projects for validation of
Ingress/Egress blocks (done for Ingress) - Extend Both blocks for all possible packet
formats - Ingress inputs
- Egress outputs
- Possible instruction optimization to give a
little headroom (99 cycles out of 100).
Currently, design will not work for standard IPv4
packets PlanetLab VLAN packets are OK.