Title: MBUF Problems and solutions on VxWorks
1MBUF Problems and solutions on VxWorks
- Dave Thompson and cast of many.
2MBUF Problems
- This is usually how it lands in my inbox
- On Tue, 2003-05-06 at 2038, Kay-Uwe Kasemir
wrote - gt Hi
- gt
- gt Neither ics-accl-srv1 nor the CA gateway were
able to get to dtl-hprf-ioc3. - gt
- gt Via "cu", the IOC looked fine except for error
messages - gt (CA_TCP) CAS Client accept error was
"S_errno_ENOBUFS" - (CA_online) ../online_notify.c CA beacon error
was "S_errno_ENOBUFS - This has been a problem since before our front
end commissioning even though we are using power
pc IOCs and a fully switched, full duplex, 100
MHz Cisco based network infrastructure. - The error is coming from the Channel Access
Server. -
3Contributing Circumstances
- (According to Jeff Hill)
- The total number of connected clients is high.
- the server's sustained (data) production rate is
higher than the client's sustained consumption
rate. - clients that subscribe for monitor events but do
not call ca_pend_event() or ca_poll() to process
their CA input queue - The server does not get a chance to run
- The server has multiple stale connections
- And also probably
- tNetTask does not get to run
4Contributing Circumstances
- SNS Now has a number of different IOCs
- 21 VxWorks IOCS
- 21 /- Windows IOCs
- 1 Linux IOC
- 4 OPIs in control room and many others on site
- Servers running CA clients like the archiver
- Users remotely logged in running edm via sshs X
tunnel. - CA Gateway
- Other IP clients and services running on vxWorks
and servers. - Other IP applications running on IOCs such as log
tasks, etherIP and serial devices running over IP.
5Our experience to date
- At SNS we have seen all of the contributing
circumstances that Jeff mentions. - At BNL, Larry Hoff saw the problem on an IOC
where the network tasks were being starved. - Many of our IOCs have heavy connection loads.
- There are some CA client and Java CA client
applications which need to be checked. - IOCs get hard reboots to fix problems and thus
leave stale connections. - Other network problems have existed and been
fixed including CA gateway loopback.
6Late breaking
- Jeff Hill was at ORNL last week.
- One of the things he suspected was that the noise
on the Ethernet wiring causes the link to
re-negotiate speed and full/half duplex
operation. - He confirmed that the combination of the MV2100
and the Cisco switches is prone to frequent
auto-negotiation, shutting down Ethernet I/O on
the IOC. - This is not JUST a boot-up problem.
7What is an mbuf anyway?
VxWorks uses this structure to avoid calls to the
heap functions malloc() and free() from within
the network driver.
- mBlks are the nodes that make up a linked list
of clusters. - The clusters store the data while it is in the
network stack. - There is a fixed number of clusters of differing
sizes. - Since a given cluster block can exist on more
than one list, then you need 2X as many mBlks as
clusters.
8Mbuf and cluster pools
- Each network interface has its own mbuf pool
- netStackDataPoolShow() (aka mbufShow)
- The system has a separate mbuf/cluster pool used
for routing, socket information, and the arp
table. - netStackSysPoolShow()
9Output from mbufShow
number of mbufs 400 number of times failed to
find space 0 number of times waited for space
0 number of times drained protocols for space
0 size clusters free usage -------------
--------------------------------------------------
---------------- 64 200 199
1746 128 400 400
190088 256 80 80 337
512 80 80 0
1024 50 50 1
2048 50 50 0 4096
50 50 0 8192
50 50 0
High turnover rate
Added at SNS
This one is mis-configured. Why?
10Our Default Net Pool Sizes
You should add these lines to config.h or maybe
configAll.h define NUM_64 100 / no. 64 byte
clusters / define NUM_128 200 define
NUM_256 40 / no. 256 byte clusters / define
NUM_512 40 / no. 512 byte clusters / define
NUM_1024 25 / no. 1024 byte clusters / define
NUM_2048 25 / no. 2048 byte clusters / define
NUM_CL_BLKS (NUM_64 NUM_128 NUM_256
\ NUM_512 NUM_1024 NUM_2048 \
NUM_4096NUM_8192) define NUM_NET_MBLKS
2(NUM_CL_BLKS) These will override the
definitions in usrNetwork.c.
11What we are doing at SNS
- We are using a kernel addition that provides for
setting the network stack sizes on the bootline. - 4X the vxWorks default sizes are working well.
- We see high use rates for the 128 byte clusters
so that allocation is set extra high. - Use huge numbers only if trying to diagnose
problem such as a resource leak. - Configuring the network interfaces to disable
auto-negotiation of speed and full-duplex. - Code for the kernel addition is available at
http//ics-web1.sns.ornl.gov/EPICS-S2003