QM Performance Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

QM Performance Analysis

Description:

The fifth scratch get always causes a stall since the cmd fifo on the ME is only 4 deep. ... Every pkt causes queue to be evicted by Enqueue and new one loaded. ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 15
Provided by: kareny
Category:

less

Transcript and Presenter's Notes

Title: QM Performance Analysis


1
QM Performance Analysis
John DeHart
2
ONL NP Router
Large SRAM Ring
xScale
xScale (3 Rings?)
Assoc. Data ZBT-SRAM
Small SRAM Ring
Scratch Ring
SRAM
LD
TCAM
SRAM
Except
Errors
NN Ring
NN
64KW
Parse, Lookup, Copy (3 MEs)
Rx (2 ME)
HdrFmt (1 ME)
Mux (1 ME)
Tx (1 ME)
QM (1 ME)
NN
Mostly Unchanged
xScale
64KW
64KW
64KW
64KW
64KW
64KW
Plugin System Update Requests
Plugin Ctrl Msgs
512W
512W
512W
512W
512W
New
NN
NN
NN
NN
Plugin0
Plugin1
Plugin2
Plugin3
Plugin4
512W
512W
SRAM
Needs A Lot Of Mod.
Rx Mux HF Copy Plugins Tx
Needs Some Mod.
Tx, QM Parse Plugin XScale
FreeList Mgr (1 ME)
Stats (1 ME)
SRAM
3
Performance
  • What is our performance target?
  • To hit 5 Gb rate
  • Minimum Ethernet frame 76B
  • 64B frame 12B InterFrame Spacing
  • 5 Gb/sec 1B/8b packet/76B 8.22 Mpkt/sec
  • IXP ME processing
  • 1.4Ghz clock rate
  • 1.4Gcycle/sec 1 sec/ 8.22 Mp 170.3 cycles
    per packet
  • compute budget (MEs170)
  • 1 ME 170 cycles
  • 2 ME 340 cycles
  • 3 ME 510 cycles
  • 4 ME 680 cycles
  • latency budget (threads170)
  • 1 ME 8 threads 1360 cycles
  • 2 ME 16 threads 2720 cycles
  • 3 ME 24 threads 4080 cycles
  • 4 ME 32 threads 5440 cycles

4
QM Performance
  • 1 ME using 7 threads
  • Threads each run once per iteration
  • Enqueue thread and Dequeue threads run in
    parallel
  • Their latencies can overlap
  • Freelist management thread runs in isolation of
    the other threads
  • 1 Enqueue Thread
  • Processes a batch of 5 packets per iteration
  • 5 Dequeue Threads
  • Each processes 1 packet per iteration
  • 1 Freelist management Thread
  • Maintains state of freelist once every iteration
  • Each iteration can enqueue and dequeue 5 packets
  • Total latency for an iteration 5 170 cycles
    850 cycles
  • Sum of
  • Latency of Freelist management thread
  • Combined latency of Enqueue thread and Dequeue
    Threads
  • Compute budget
  • (FL_cpu/5) (DQ_cpu) (ENQ_cpu/5) lt 170 cycles
  • Current (June 2007) BEST CASE (all queues already
    loaded) estimates

5
QM Performance Improvements
  • These are simple improvements that might save us
    10s of cycles each.
  • Change the way we read the scratch ring on input
    to Enqueue
  • Currently we do this (Each gets the input data
    for 1 pkt)
  • .xfer_order rdata_a
  • .xfer_order rdata_b
  • .xfer_order rdata_c
  • .xfer_order rdata_d
  • .xfer_order rdata_e
  • scratchget, rdata_a0, 0, ring, 3,
    sig_donesram_sig0
  • scratchget, rdata_b0, 0, ring, 3,
    sig_donesram_sig1
  • scratchget, rdata_c0, 0, ring, 3,
    sig_donesram_sig2
  • scratchget, rdata_d0, 0, ring, 3,
    sig_donesram_sig3
  • scratchget, rdata_e0, 0, ring, 3,
    sig_donesram_sig4
  • The fifth scratch get always causes a stall since
    the cmd fifo on the ME is only 4 deep.
  • And when it stalls it also causes an abort of it
    and the following 2 instructions.
  • Total of 15 cycles consumed by abort (3 cycles)
    and stall (12 cycles)
  • This seems more efficient
  • .xfer_order rdata_a, rdata_b
  • .xfer_order rdata_c, rdata_d

6
QM Snapshots
  • Breakpoint set at start of maintain_fl() macro in
    FL management Thread
  • All queues should be already loaded
  • Run for one iteration
  • ENQ processes 5 pkts
  • Each of the 5 DQ thread processes 1 pkt
  • Rx reports 10 packets received and Tx reports 5
    packets transmitted

7
QM Snapshots
  • Breakpoint set at start of maintain_fl() macro in
    FL management Thread
  • All queues should be already loaded
  • Run for one iteration
  • ENQ processes 5 pkts
  • Each of the 5 DQ thread processes 1 pkt
  • Rx reports 10 packets received and Tx reports 5
    packets transmitted

8
200Byte Eth Frames
  • With 200 Byte Eth Frames packets, 5 ports sending
    at full rate
  • Dequeue can not keep up. After about 1030 packets
    we start discarding in Enqueue because the queues
    are full. Queue thresholds were set to 0xfff
  • Port rates were set to 0x1000 (greater than 1Gb/s)

9
400Byte Eth Frames
  • With 400 Byte Eth Frames packets, 5 ports sending
    at full rate
  • Queues build up eventually.
  • I suspect there is an inherent problem in the way
    that dequeue is working that causes it to not be
    able to keep up.
  • Tx is flow controlling the dequeue engines in
    this case.
  • This seems to be what is causing the queues to
    build up.

10
More snapshots (June 13, 2007)
11
More snapshots (June 13, 2007)
12
More snapshots (June 13, 2007)
13
More snapshots (June 13, 2007)
14
More snapshots (June 13-15, 2007)
  • QM Totals
  • Enqueue
  • WORST CASEEvery pkt causes queue to be evicted
    by Enqueue and new one loaded.
  • BEST CASE Queues are always already loaded,
    nothing gets evicted.
Write a Comment
User Comments (0)
About PowerShow.com