Title: APIC Stalling problem Notes
1- APIC Stalling problem Notes
- with additional notes on Interrupts
- John DeHart
- Washington University
- jdd_at_arl.wustl.edu
- http//www.arl.wustl.edu/jdd
2Issue
- There seems to be a bug in the system
- Right now only shows up on SPC-II
- In the past, has shown up on SPC-I
- But this could be similar symptoms of different
problems. - No recollection of it ever showing up on end
hosts. - All these different systems have different timing
- We ran into this problem in preparing for and
doing the WU 150th anniversary demo. - Fred is having this problem in his kernel
testing. - JohnD is having this problem in his final SPC-II
performance testing.
3Issue (continued)
- Symptoms
- Transmit queue stalls for paced connections
- Resuming connection as BE (then Paced) clears the
queue most of the time - sometimes it then stalls again and eventually we
can not resume it. - Also stalls for BE connections
- Resuming as BE gets the data flowing again
- actually resuming ANOTHER channel causes the
stalled channel to resume - This seems to imply a possible global pacer
problem? - When it stalls and the APIC runs out of
descriptors, we do get an ERROR interrupt for the
out of descriptors state. - This seems to imply that the APIC and ICU are in
a state such that they can generate an APIC
interrupt to the CPU. - If the APIC had generated an interrupt that had
been lost the APIC and/or ICU would probably
not be in a state that would allow another APIC
interrupt to reach the CPU. - Seems to be traffic rate related
- as the traffic rate approaches the limit of what
we can process the problem is more likely to show
itself - Seems to be SPC related
- some SPCs show the problem more readily than
others
4Tools Assembled
- Monitoring GUI
- PCI Bus analyzer
- setup for it save in
- jddlap/C/SPC_II/PCI_Traces/SPC_II_PCI_Setup.stp
- /project/arl/jdd/SPC_II/PCI_Traces/SPC_II_PCI_Setu
p.stp - SPCWatch
- using APIC control cells dump portions of memory
and APIC registers without going through the
kernel. Going through the kernel sometimes
changes the state of the current problem by
resuming a stalled xmit connection. Depending on
how much memory is being dumped, this may take a
long time (16 bytes of memory per APIC control
cell). - scripts using it are in
- /d/jdd/wu_arl/HARDWARE_TESTS/SPC_TEST_PCI/
- dumpAllMSRDescs dumps ALL 64K APIC Descriptors
- dumpMSRrxDescs dumps all 8K Rx descriptors
- dumpMSRtxDescs dumps all 8K Tx descriptors
- getTxConnAndChanStatusRegs retrieves the
Connection and Channel Status regs for a Tx chan. - sencmd
- SPC/usr/local/bin/datatest
- SPC/usr/local/bin/readCounts
- Jammer
5Useful Notes
- APIC Sync Bits
- 0 DONE_VALIDLINK (APIC is done, belongs to the
Driver now) - 1 DONE_INVALIDLINK (should never happen for Tx)
- 2 NOT_READY (Belongs to the Driver)
- 3 READY (Belongs to the APIC)
- Kernel modified to support PCI Bus Analyzer
- Bus Analyzer requires line card to be removed
from SPC-II - SPC-II with no line card does not get a grant for
sending data to the line card so the FPGA Fifos
fill up and drop cells. Ick. - Kernel modified to send external data to switch
- with this we can also monitor the output rate of
the external data VCs. - APIC Descriptor Address Ranges
- Index 0 Addr 0x1d17000 Invalid descriptor
- Index 0001 - 8192 Addr 0x1d17010 0x1d37000
Rx Descs - Index 8193 16384 Addr 0x1d37010 0x1d57000
Tx Descs - APIC Registers of interest
- 0x518 Interrupt Acknowledge Register
- 0x530 Notification Register
- 0xD500CH08 TX Channel status register
- 0xD500CHF0 TX Channel BE Resume register
6Questions
- Does the driver handle multiple buffers chained
together on receive properly? - It is possible for the last cell of a packet to
get dropped making the packet look like a long
packet spanning multiple buffers. - Are there any buffer start address concerns?
- old notes on APIC bug which caused us to align
buffers on 48 and 56 byte boundaries - This is the RX Sync bug (July 1999 Kits slides)
which locks up the APIC and needs a reset to get
going again. This does not sound like what is
happening to us now. - although, could this be what eventually happens
after a few resumes when the SPC locks up?
7Issue (continued)
- Suspects
- Lost interrupt
- APIC Hardware bug
- interrupt handling
- timing between two instances of INTR signal being
asserted. - descriptor handling
- pacer
- flow control
- other?
- APIC driver bug
- interrupt handling
- descriptor handling
- other?
- NetBSD Interrupt handling bug
- SPC-II FPGA flow control bug
8Issue (continued)
- Plan of Attack
- Analyze apic driver code
- compare MSR vs. end host driver code.
- Get details of descriptor chain when it stalls
- dump APIC descriptor chain as it exists in memory
- dump APIC current descriptor chain register for
stalled channel - monitor interrupt counts on SPC-II and compare to
packet counts - vmstat I
- Note what IRQs are assigned to what at boot time.
- Turn off SPC-II FPGA flow control to APIC
- change VHDL
- rebuild bitfile
- re-program SPC-II FPGA
- retest
9Additional Issue/Symptom
- We sometimes get into a state where
- We send an MSR command/control cell to a port
- The APIC does not register a cell arrival.
- Neither the OPP transmit cell counter nor the OPP
drop cell counters on that port increment. - Suspect APIC or FPGA flow control issue
10SPC II FPGA Architecture
SPC-II CLOCK DOMAINS
PCI Bus Port
APIC
Reset
Port 1
Port 0
Reset
16
16
16
16
B
OSC
C
B
G
1
3
6
D
16/32
VPI01
VPI01 VCI 38
A
2
32
VPI00
4
64ltVCIlt127???
32
5
H
16/32
E
FPX
Switch
LC
SPC-II FPGA
11SPC FPGA Fifos
- FIFO 1 Large Sync Fifo 512 Words 36 cells
- FIFO 2 Large Async Fifo 512 Words 36 cells
- FIFO 3 Tiny Sync Fifo 64 Words 4 cells
- FIFO 4 Tiny Sync Fifo 64 Words 4 cells
- FIFO 5 Medium Async Fifo 128 Words 9 cells
- FIFO 6 Medium Sync Fifo 128 Words 9 cells
12Flow Control Test 1
- Send data from Switch to SPC-II
- transit through APIC from Port 1 to Port 0
- SPC-II is reset, no kernel running
- No data crossing PCI bus
- No descriptors/buffers used
- Overload 16 bit APIC interface
- Send 1.2 Gb/s
- 982 Mb/s goes through APIC
- 220 Mb/s is dropped in OPP CS0 buffer
- Turn data on/off repeatedly
- no stall/hang-up
- when data turned back on it continues to transit
APIC
13Flow Control Test 2A (AAL5Generator)
- Send data from Switch to SPC-II
- Load kernel (JDDs BE Debug Kernel) and process
packets - Configure switch and routes so that two input
ports (P1, P5) get a copy of the traffic to be
routed. - Configure the two input ports (P1, P5) routes so
that they route the traffic to Egress port 0 - Overload APIC processing in Kernel on Port 0
- send 60 Mb/s at each input port
- using AAL5Generator smooth pacing at batch (8)
of cells level - total of 120 Mb/s at output port
- pkt sz 1500 bytes (
- Kernel error messages
- RX CID (65 and 69) out of descriptors
- indicates we are sending more data at the kernel
than it can handle - Bad CRC
- indicates cells are being dropped somewhere
- either APIC or SPC-II FPGA.
- Probably APIC, if it was FPGA, it would flow
control switch - but we may not be sending enough for FC to back
up all the way through OPP buffer. - But no cells are dropped in OPP
- indicates SPC-II FPGA is not flow controlling
switch
14dips in output are due to Kernel printing error
msgs
15Flow Control Test 2B (sendpkts)
- Send data from Switch to SPC-II
- Load kernel (JDDs BE Debug Kernel) and process
packets - Configure switch and routes so that two input
ports (P1, P5) get a copy of the traffic to be
routed. - Configure the two input ports (P1, P5) routes so
that they route the traffic to Egress port 0 - Overload APIC processing in Kernel on Port 0
- send 60 Mb/s at each input port
- using sendpkts sends batches of packets
- total of 120 Mb/s at output port
- pkt sz 1500 bytes (
- Kernel error messages
- RX CID (65 and 69) out of descriptors
- indicates we are sending more data at the kernel
than it can handle - Bad CRC
- indicates cells are being dropped somewhere
- either APIC or SPC-II FPGA.
- Probably APIC, if it was FPGA, it would flow
control switch - but we may not be sending enough for FC to back
up all the way through OPP buffer. - But no cells are dropped in OPP
- indicates SPC-II FPGA is not flow controlling
switch
16Freds Kernel sendpkts B 40 p 10 a 20 c -S
17Freds Kernel sendpkts B 80 p 10 a 20 c -S
18Analysis of previous screen dump
- P0 has stopped sending any pkts out to the link
- P0 has stopped back pressuring the switch
- APIC interrupts still being generated and counted
by kernel (vmstat i) - APIC still counting cells arriving
- APIC NOT counting cells on PCI bus
- APIC thinks it is getting cells and generating
Interrupts. - What does the kernel think in this state?
- channel is suspended and needs resuming...
- when resumed things start working again.
- This is probably the Ready descriptor error
- So in this state the APIC is out of descriptors
and all of its cell buffers are probably full. - Is it just continually generating ERROR
interrupts? - And discarding every cell it receives (after
counting it)?
19JDDs version of Freds Kernelsendpkts c v S
a 20 x 8000
BE Resume of P0 channel 80 resumes data output.
20Analysis of previous screen dump
- After resuming BE twice (each worked), the third
time it stalled the kernel had crashed. - panic kernel assertion 0 failed apic.c, line
1045 - This assert is checking that a TX descriptor
being allocated from the free list has SYNC bits
set to NOT_READY. - need to repeat the test with proper debug turned
on so we can see what descriptor it is and what
the sync bits are actually set to. - Repeated, after 8 successful resume BE
- Port 0 (APIC/Crit) msr_apic_txdesc_alloc
Desc-gtMatchFlags ! DESC_SYNC_NOT_READY!, offset
15851 sync 0 - panic kernel assertion 0 failed file
../../../../dev/ic/apic.c line 1045
21APIC errors detected
- APIC errors that occurred during different runs.
- --------------------------------------------------
---- - Port -1 (Ctl/Info) msr_process_ctlcell cmd 0x1,
ver 0, seq 0, len 4, flags 0x9 - Port 0 (APIC/Error) apic_intr Unexpected RX
Error on CID 65, chanstatus 0x07 - apic0 Descriptor Error Match incorrect (not
0xcafe) 0x07 - --------------------------------------------------
---- - Port 0 (APIC/Crit) msr_free_txdescs Invalid tx
desc index (current 14250 or next 128)panic
kernel - assertion "((((txindx) gt ((0x00000001 8192 -
1) 1)) ((txindx) lt (((0x00000001 8192 -
1) 1) - 8192 - 1))) (((nextindx) gt ((0x00000001
8192 - 1) 1)) ((nextindx) lt (((0x00000001
8192 - - 1) 1) 8192 - 1))))" failed file
"../../../../dev/ic/apic.c", line 2332 - Stopped at 0xf018ff8c leave
- dbgt
- --------------------------------------------------
----
22State when we set debug and get stats causes the
xmit channel to come alive again!
23Freds Kernel sendpkts B 80 p 10 a 20 c
Ssendpkts has stopped sending data???
24Freds Kernel sendpkts c v S a 20 x 8000
25Flow Control Test 3
- Send data from Switch to SPC-II
- Load kernel and process packets
- Configure classifier and data pkts so they are
dropped - i.e. no route for destination address.
- Overload APIC processing in Kernel
- Turn data on/off repeatedly
26SPC-I System FPGA
- Supported
- Four Interrupts supported and statically
assigned - PIT (IRQ 0)
- APIC (IRQ 5)
- COM1 (IRQ 4)
- COM2 (IRQ 3)
- Static fully-nested interrupt priority structure.
- Specific End of Interrupt is the only EOI mode
supported - Not Supported
- Special Mask Mode
- Automatic End of Interrupt (AUTO_EOI_1,
AUTO_EOI_2) - Special Fully Nested Mode
27SPC-II Interrupts
- Supported by a real Southbridge/ICU
- FPGA provides flow control
- but with the traffic patterns and rates we are
using there should be no flow control asserted.
28Hardware Interrupt Structure (Ignoring Bus)
MASK/UNMASK
CPU
ACK
INTR
ICU
ACK
INTR
APIC
29Overview of what happens
- APIC generates INTR to ICU
- Apic will not generate another INTR until ACKed
- ICU pushes INTR(IRQ) onto Bus
- ICU will only send higher priority interrupts
- CPU gets INTR
- MASK IRQ in ICU
- ICU will not send this IRQ again
- ACK IRQ in ICU
- Allows lower priority interrupts from ICU
- Check priority and hold if lower than current
- Call APIC inter handler
- ACK Intr in APIC
- APIC can generate another INTR to ICU
- Intr processing
- process all packets that have been received
- put packets being forwarded on transmit queue and
resume transmit queue if needed - Return
- UNMASK IRQ in ICU
- ICU can send us this IRQ again
30 sys/arch/i386/isa/vector.s
- include "opt_ddb.h"
- include lti386/isa/icu.hgt
- include ltdev/isa/isareg.hgt
- define ICU_HARDWARE_MASK
- define IRQ_BIT(irq_num) (1 ltlt ((irq_num) 8))
- define IRQ_BYTE(irq_num) ((irq_num) / 8)
- ifdef ICU_SPECIAL_MASK_MODE // SPC System FPGA
does not support SMM - define ACK1(irq_num)
- define ACK2(irq_num) \
- movb (0x60IRQ_SLAVE),al / specific EOI for
IRQ2 / \ - outb al,IO_ICU1
- define MASK(irq_num, icu)
- define UNMASK(irq_num, icu) \
- movb (0x60(irq_num8)),al / specific EOI
/ \ - outb al,icu
31 sys/arch/i386/isa/vector.s
- else / I.E. NOT ICU_SPECIAL_MASK_MODE /
- ifndef AUTO_EOI_1
- define ACK1(irq_num) \
- movb (0x60(irq_num8)),al / specific EOI
/ \ - outb al,IO_ICU1
- else
- define ACK1(irq_num)
- endif
- ifndef AUTO_EOI_2
- define ACK2(irq_num) \
- movb (0x60(irq_num8)),al / specific EOI
/ \ - outb al,IO_ICU2 / do the second ICU first
/ \ - movb (0x60IRQ_SLAVE),al / specific EOI for
IRQ2 / \ - outb al,IO_ICU1
- else
- define ACK2(irq_num)
- endif
32 sys/arch/i386/isa/vector.s
- ifdef ICU_HARDWARE_MASK
- define MASK(irq_num, icu) \
- movb _C_LABEL(imen) IRQ_BYTE(irq_num),al
/ imen interrupt mask enable (2 bytes)/ - orb IRQ_BIT(irq_num),al
/ mask our irq (put a 1 in its
place) / - movb al,_C_LABEL(imen) IRQ_BYTE(irq_num)
- FASTER_NOP
- outb al,(icu1)
/ write it to the
ICU / - define UNMASK(irq_num, icu)
- cli
- movb _C_LABEL(imen) IRQ_BYTE(irq_num),al
- andb IRQ_BIT(irq_num),al
- movb al,_C_LABEL(imen) IRQ_BYTE(irq_num)
- FASTER_NOP
- outb al,(icu1)
- sti
- else / ICU_HARDWARE_MASK /
- define MASK(irq_num, icu)
- define UNMASK(irq_num, icu)
33 sys/arch/i386/isa/vector.s
- ifdef __ELF__
- define XINTR(irq_num) Xintr//irq_num
- define XHOLD(irq_num) Xhold//irq_num
- define XSTRAY(irq_num) Xstray//irq_num
- else
- define XINTR(irq_num) _Xintr//irq_num
- define XHOLD(irq_num) _Xhold//irq_num
- define XSTRAY(irq_num) _Xstray//irq_num
- endif
34 sys/arch/i386/isa/vector.s
- / Beginning of INTR Macro /
- define INTR(irq_num, icu, ack)
- IDTVEC(resume//irq_num)
- cli
- jmp 1f
- IDTVEC(recurse//irq_num)
- pushfl
- pushl cs
- pushl esi
- cli
Block the CPU from accepting any more interrupts.
35 sys/arch/i386/isa/vector.s
- XINTR(irq_num)
- pushl 0 / dummy error code /
- pushl T_ASTFLT / trap for doing ASTs /
- INTRENTRY
- MAKE_FRAME
- MASK(irq_num, icu) / mask it in hardware /
- ack(irq_num) / and allow other intrs /
- incl MY_COUNTV_INTR / statistical info /
-
ICU will not send us anymore of this IRQ
ACK this IRQ to the ICU. Allows it to generate
other interrupts. Without this the ICU would only
generate higher priority interrupts
When an interrupt occurs the CPU will clear the
interrupt enable bit (equivalent of cli) An iret
restores the bit.
36 sys/arch/i386/isa/vector.s
- testb IRQ_BIT(irq_num),_C_LABEL(cpl)
IRQ_BYTE(irq_num) - jnz XHOLD(irq_num) / currently masked hold it
/ - 1 movl _C_LABEL(cpl),eax / cpl to restore on
exit / - pushl eax
- orl _C_LABEL(intrmask) (irq_num) 4,eax
- movl eax,_C_LABEL(cpl) / add in this intr's
mask / - sti / safe to take intrs now /
-
Pre-computed masks for each IRQ IRQ 0
0xe0000021 IRQ 3 0xe0000039 IRQ 4
0xe0000039 IRQ 5 0xc0000020 0 0 0 0 0 0 0 0
bits 5 4 3 2 1 0 irq
Add IRQ bit to ipending
In Kernel interrupt mask
Allow CPU to accept more interrupts.
37 sys/arch/i386/isa/vector.s
- movl _C_LABEL(intrhand) (irq_num) 4,ebx /
head of chain / - testl ebx,ebx
- jz XSTRAY(irq_num) / no handlers we're stray
/ - STRAY_INITIALIZE / nobody claimed it yet /
- incl _C_LABEL(intrcnt) (4(irq_num)) / XXX /
38 sys/arch/i386/isa/vector.s
- 7 movl IH_ARG(ebx),eax / get handler arg /
- testl eax,eax
- jnz 4f
- movl esp,eax / 0 means frame pointer /
- 4 pushl eax
- call IH_FUN(ebx) / call it /
- addl 4,esp / toss the arg /
- STRAY_INTEGRATE / maybe he claimed it /
- incl IH_COUNT(ebx) / count the intrs /
- movl IH_NEXT(ebx),ebx / next handler in chain
/ - testl ebx,ebx
- jnz 7b
- STRAY_TEST / see if it's a stray /
- 5 UNMASK(irq_num, icu) / unmask it in hardware
/ - jmp _C_LABEL(Xdoreti) / lower spl and do ASTs /
Call NetBSD Interrupt Handler
Locate a handler for this IRQ
ICU is now able to send us another interrupt for
this IRQ
Return from Interrupt Resume other
interrupts Check for pending interrupts Restore
stack iret
39 sys/arch/i386/isa/vector.s
- IDTVEC(stray//irq_num)
- pushl irq_num
- call _C_LABEL(isa_strayintr)
- addl 4,esp
- incl _C_LABEL(strayintrcnt) (4(irq_num))
- jmp 5b
- IDTVEC(hold//irq_num) // XHOLD()
- orb IRQ_BIT(irq_num),_C_LABEL(ipending)
IRQ_BYTE(irq_num) - INTRFASTEXIT
- / End of INTR Macro /
40 sys/arch/i386/isa/vector.s
- INTR(0, IO_ICU1, ACK1) / Clock
interrupt / - INTR(1, IO_ICU1, ACK1)
- INTR(2, IO_ICU1, ACK1)
- INTR(3, IO_ICU1, ACK1) / COM 2
Interrupt / - INTR(4, IO_ICU1, ACK1) / Com 1 Interrupt
/ - INTR(5, IO_ICU1, ACK1) / APIC Interrupt
/ - INTR(6, IO_ICU1, ACK1)
- INTR(7, IO_ICU1, ACK1)
- INTR(8, IO_ICU2, ACK2)
- INTR(9, IO_ICU2, ACK2)
- INTR(10, IO_ICU2, ACK2)
- INTR(11, IO_ICU2, ACK2)
- INTR(12, IO_ICU2, ACK2)
- INTR(13, IO_ICU2, ACK2)
- INTR(14, IO_ICU2, ACK2)
- INTR(15, IO_ICU2, ACK2)
41 sys/arch/i386/isa/vector.s
- /Add a mask to cpl, and return the old value of
cpl./ - static __inline int
- splraise(ncpl)
- register int ncpl
-
- register int ocpl cpl
- cpl ocpl ncpl
- return (ocpl)
-
- / Restore a value to cpl (unmasking interrupts).
- If any unmasked interrupts are pending,
- call Xspllower() to process them./
- static __inline void
- splx(ncpl)
- register int ncpl
-
- cpl ncpl
- if (ipending ncpl)
- Xspllower()
/Same as splx(), but we return the old value of
spl, for the benefit of some splsoftclock()
callers./ static __inline int spllower(ncpl) reg
ister int ncpl register int ocpl cpl cpl
ncpl if (ipending ncpl) Xspllower() ret
urn (ocpl)
Call Xspllower if there is something pending that
is higher priority then our new cpl
42 sys/arch/i386/isa/icu.s spllower()
- IDTVEC(spllower) // Xspllower()
- pushl ebx
- pushl esi
- pushl edi
- movl _C_LABEL(cpl),ebx save
priority - movl 1f,esi address to resume
loop at - 1 movl ebx,eax
- notl eax
- andl _C_LABEL(ipending),eax
- jz 2f
- bsfl eax,eax
- btrl eax,_C_LABEL(ipending)
- jnc 1b
- jmp _C_LABEL(Xrecurse)(,eax,4)
- 2 popl edi
- popl esi
- popl ebx
- ret
Is there a pending interrupt that is high enough
priority?
If yes, then restart it?