QM Performance Analysis

About This Presentation

Title:

QM Performance Analysis

Description:

The fifth scratch get always causes a stall since the cmd fifo on the ME is only 4 deep. ... Every pkt causes queue to be evicted by Enqueue and new one loaded. ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 15

Provided by: kareny

Category:

more less

Transcript and Presenter's Notes

Title: QM Performance Analysis

1
QM Performance Analysis
John DeHart
2
ONL NP Router
Large SRAM Ring
xScale
xScale (3 Rings?)
Assoc. Data ZBT-SRAM
Small SRAM Ring
Scratch Ring
SRAM
LD
TCAM
SRAM
Except
Errors
NN Ring
NN
64KW
Parse, Lookup, Copy (3 MEs)
Rx (2 ME)
HdrFmt (1 ME)
Mux (1 ME)
Tx (1 ME)
QM (1 ME)
NN
Mostly Unchanged
xScale
64KW
64KW
64KW
64KW
64KW
64KW
Plugin System Update Requests
Plugin Ctrl Msgs
512W
512W
512W
512W
512W
New
NN
NN
NN
NN
Plugin0
Plugin1
Plugin2
Plugin3
Plugin4
512W
512W
SRAM
Needs A Lot Of Mod.
Rx Mux HF Copy Plugins Tx
Needs Some Mod.
Tx, QM Parse Plugin XScale
FreeList Mgr (1 ME)
Stats (1 ME)
SRAM
3
Performance

What is our performance target?
To hit 5 Gb rate
Minimum Ethernet frame 76B
64B frame 12B InterFrame Spacing
5 Gb/sec 1B/8b packet/76B 8.22 Mpkt/sec
IXP ME processing
1.4Ghz clock rate
1.4Gcycle/sec 1 sec/ 8.22 Mp 170.3 cycles
per packet
compute budget (MEs170)
1 ME 170 cycles
2 ME 340 cycles
3 ME 510 cycles
4 ME 680 cycles
latency budget (threads170)
1 ME 8 threads 1360 cycles
2 ME 16 threads 2720 cycles
3 ME 24 threads 4080 cycles
4 ME 32 threads 5440 cycles

4
QM Performance

1 ME using 7 threads
Threads each run once per iteration
Enqueue thread and Dequeue threads run in
parallel
Their latencies can overlap
Freelist management thread runs in isolation of
the other threads
1 Enqueue Thread
Processes a batch of 5 packets per iteration
5 Dequeue Threads
Each processes 1 packet per iteration
1 Freelist management Thread
Maintains state of freelist once every iteration
Each iteration can enqueue and dequeue 5 packets
Total latency for an iteration 5 170 cycles
850 cycles
Sum of
Latency of Freelist management thread
Combined latency of Enqueue thread and Dequeue
Threads
Compute budget
(FL_cpu/5) (DQ_cpu) (ENQ_cpu/5) lt 170 cycles
Current (June 2007) BEST CASE (all queues already
loaded) estimates

5
QM Performance Improvements

These are simple improvements that might save us
10s of cycles each.
Change the way we read the scratch ring on input
to Enqueue
Currently we do this (Each gets the input data
for 1 pkt)
.xfer_order rdata_a
.xfer_order rdata_b
.xfer_order rdata_c
.xfer_order rdata_d
.xfer_order rdata_e
scratchget, rdata_a0, 0, ring, 3,
sig_donesram_sig0
scratchget, rdata_b0, 0, ring, 3,
sig_donesram_sig1
scratchget, rdata_c0, 0, ring, 3,
sig_donesram_sig2
scratchget, rdata_d0, 0, ring, 3,
sig_donesram_sig3
scratchget, rdata_e0, 0, ring, 3,
sig_donesram_sig4
The fifth scratch get always causes a stall since
the cmd fifo on the ME is only 4 deep.
And when it stalls it also causes an abort of it
and the following 2 instructions.
Total of 15 cycles consumed by abort (3 cycles)
and stall (12 cycles)
This seems more efficient
.xfer_order rdata_a, rdata_b
.xfer_order rdata_c, rdata_d

6
QM Snapshots

Breakpoint set at start of maintain_fl() macro in
FL management Thread
All queues should be already loaded
Run for one iteration
ENQ processes 5 pkts
Each of the 5 DQ thread processes 1 pkt
Rx reports 10 packets received and Tx reports 5
packets transmitted

7
QM Snapshots

Breakpoint set at start of maintain_fl() macro in
FL management Thread
All queues should be already loaded
Run for one iteration
ENQ processes 5 pkts
Each of the 5 DQ thread processes 1 pkt
Rx reports 10 packets received and Tx reports 5
packets transmitted

8
200Byte Eth Frames

With 200 Byte Eth Frames packets, 5 ports sending
at full rate
Dequeue can not keep up. After about 1030 packets
we start discarding in Enqueue because the queues
are full. Queue thresholds were set to 0xfff
Port rates were set to 0x1000 (greater than 1Gb/s)

9
400Byte Eth Frames

With 400 Byte Eth Frames packets, 5 ports sending
at full rate
Queues build up eventually.
I suspect there is an inherent problem in the way
that dequeue is working that causes it to not be
able to keep up.
Tx is flow controlling the dequeue engines in
this case.
This seems to be what is causing the queues to
build up.

10
More snapshots (June 13, 2007)
11
More snapshots (June 13, 2007)
12
More snapshots (June 13, 2007)
13
More snapshots (June 13, 2007)
14
More snapshots (June 13-15, 2007)