Title: Network Monitoring in the BaBar Experiment
1Network Monitoring in the BaBar Experiment
S. Luitz, D. Millsom, D. Salomoni
2Summary
- The BaBar Data Acquisition Network
- A Typical Scenario...
- Traffic Monitoring and Recording
- Traffic Dump Analysis Tools
- Real-Time Analysis of Traffic
- Conclusions and Outlook
3The BaBar Data Acquisition Network (1)
- ca. 200 VME single board computers (VxWorks) 100
Mbit/s full duplex Ethernet - 78 Sun Ultra 5 "farm" workstations for Level-3
trigger and fast monitoring 2 100 Mbit/s full
duplex Ethernet each ("dual homed") - 5 Sun Ultra 60 application servers (e.g. Run
control) 100 Mbit/s full duplex Ethernet - 15 Sun Ultra 5 display console machines 10 or
100 Mbit/s Ethernet
4The BaBar Data Acquisition Network (2)
- 1 Sun E 450 (4 CPU, 780 Gbyte RAID) central
boot/NFS/database/data buffer server 2 x 1GBit/s
Ethernet - various development and user workstations
- 3 Cisco Cat 5500 switches
- 2 VLANs / IP subnets
- dedicated real-time DAQ network (35-40 MByte/s)
- general purpose / data transfer network
5(No Transcript)
6A Typical Scenario
- Problem
- Shift crew reports "Run control server problem
ca. 45 min ago at 2350" - A look at the system logs shows NFS timeouts at
2308 but no network-related events (like
spanning tree reconfigurations) - Central network monitoring shows "normal" traffic
- What was going on? Did someone/something overload
the NFS server? Data base access? ...? - Server based performance monitoring very poor !
- Wouldnt it be nice to be able to have a close
look at the network traffic around 2305?
7Traffic Monitoring and Recording (1)
- We can! Even with free software tools!
- Configure switch to forward all traffic in the
BaBar general-purpose VLAN/subnet to a monitoring
port (SPAN) - Standard protocol analyzers no good small
buffers, what to trigger on? - Sun E 250 with 72 Gbyte disk and Gigabit Ethernet
as traffic recorder and protocol analyzer - Record packet headers into "circular" disk buffer
8Traffic Monitoring and Recording (2)
- Use tcpdump (ftp//ftp.ee.lbl.gov) to capture
packet headers and write them to files - In our environment
- We cant monitor the real-time network, switch
backplane capacity could be exceeded at peak - We have 3 switches, however presently we only
monitor the switch where the file server is
connected - Typical captured data rates during normal
operation 4 Gbytes / hour
9Analysis Tools (1)
- How to look at Gigabytes of recorded network
data? - Use tcpdump to filter dump file (e.g. "host
bbr-srv02 and host bbr-srv03 and port nfs") into
a smaller file - Use tcpslice (ftp//ftp.ee.lbl.gov) to isolate
time intervals from the dump files - Use tcptrace to automatically analyze TCP
connections and plot throughput
graphshttp//jarok.cs.ohiou.edu/software/tcptrace
/tcptrace.html - Look at low rate events directly with tcpdump
10Analysis Tools (2)
- Sample tcptrace output for a connection (NFS)
NFS port on server
TCP connection 4 host g
BBR-SRV03.SLAC.Stanford.EDU32769 host h
BBR-SRV02.SLAC.Stanford.EDU2049
complete conn yes first packet Fri Jan
28 232435.019938 2000 last packet
Fri Jan 28 232435.027876 2000 elapsed
time 00000.007938 total packets 11
filename srv02srv03.dump g-gth
h-gtg total
packets 6 total packets
5 ack pkts sent 5
ack pkts sent 5 pure
acks sent 3 pure acks sent
2 unique bytes sent 44
unique bytes sent 28
actual data pkts 1 actual
data pkts 1 actual data bytes
44 actual data bytes 28
data xmit time 0.000 secs
data xmit time 0.000 secs idletime
max 4.4 ms idletime max
4.1 ms throughput 5543
Bps throughput 3527 Bps
Not much happened!
Much more info available, edited to fit ...
11Analysis Tools (3)
Throughput between two hosts Yellow dots
instantaneous rate, quantization due to time
resolution of packet time (GBit!) Red line
Averaged rates
12Analysis Tools (4)
- The network dump can e.g. answer the following
questions (and many more) - Who (UID,GID) has read the 25 Gbyte data file
over NFS? - Were NFS timeouts correlated to a high NFS
transaction volume/rate? - Which hosts were accessing the file server?
- Do we have hosts/software with configuration
problems? (Wrong subnet masks, applications using
incorrect subnet broadcast addresses) - However, the analysis of the files is
complicated, wed like to have better tools!
13Real-Time Analysis of Traffic
- A very interesting and promising free tool is
NTOP (www.ntop.org) - Captures packets, analyzes the protocol headers
in real-time and dynamically generates web pages,
e.g. - Protocols and their distribution
- Hosts, host info, data sources and destinations
- Throughput graphs
- Traffic matrix
- Still in development, not perfectly stable yet
14Real-Time Monitoring
NTOP example
15Conclusions and Outlook
- Network traffic recording and analysis
- is feasible (with some restrictions) even in high
performance switched network environments - looking forward to the next generation of
gigabit-speeds-monitoring-capable switches and
workstations - has shown to be very helpful in understanding
host and network performance problems and
computing infrastructure troubleshooting - Powerful free software tools are available
- but multiple programs, command line based, make
analysis of network traffic log files quite a
complicated procedure - The ultimate tool would be a PAW-like program for
networks which allows filtering and plotting with
a simple command language