Title: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips
1Monitoring Temperature and Fan Speed Using
Ganglia and Winbond Chips
Caitie McCaffrey, Yemi Adesanya August 2006
2- The SLAC Computing Services Group is
dedicated to providing leadership and support in
computing and communications to the laboratory as
a whole, and to physics research, in particular - Major Concerns
- Power consumption
- Cooling
- Monitoring
3What Is My Computer Doing???
- I/O Rate
- CPU usage
- Memory Usage
- Temperature
- Fan Speed
- Load
Monitoring Software -low overhead -scalable -low
impact on individual machines
4- Ganglia is a scalable distributed monitoring
system for high-performance computing systems
such as clusters and Grids - Scalable, overhead increases by number of
clusters not nodes - Works on multiple operating systems
- Round Robin Database
- Measures metrics like CPU usage, load, I/O rate,
and memory usage
GMOND, GMETAD, GMETRIC
5Ganglia Architecture
http//www.slac.stanford.edu/comp/unix/ganglia/ind
ex.html
Updates RRD, polls clusters periodically
Cluster Two Machines 1 and 3 know state of entire
cluster
1
2
A
4
3
Cluster One All machines know state of entire
cluster
B
C
6(No Transcript)
7GMETRIC
- Allows users to monitor metrics to expand on the
core monitored by the daemon gmond - Name
- Value
- Type
- Units
- gmetric conf/var/ganglia/gmond.conf nCPUTemp1
v75 tuint8 uCelsius - Good because allows us to be more machine
specific, can monitor temperature and fan speed
8A little bit on hardware
- Noma - batch machines
- Tyan Thunder LE-T motherboard
- Winbond w83782d (lm_sensor compatible)
- 2 pentium III processors
- Why is temperature important?
- Chip specifications give temperature range
- Behavior is unpredictable outside temperature
range - Clues to weird machine behavior
- Pentiums have a max temp of 77-82 C
Tyan Thunder LE-T
9Whats a Noma?
NOMA
- Horse from Noma County Japan
- Smallest native Japanese pony 10.1 -10.3 hands
- Super rare 27 pure blood nomas left (1988)
Some more machines
DON
COB
TORI
ORLOV
MORAB
10- caitiem_at_noma0449 sensors
- w83782d-i2c-0-29
- Adapter SMBus PIIX4 adapter at 0580
- Algorithm Non-I2C SMBus adapter
- VCore 1 1.48 V (min 4.08 V, max 4.08
V) - VCore 2 1.26 V (min 4.08 V, max 4.08
V) - 3.3V 3.37 V (min 2.97 V, max 3.63
V) - 5V 4.97 V (min 4.50 V, max 5.48
V) - 12V 12.08 V (min 10.79 V, max 13.11
V) - -12V -1.03 V (min -13.21 V, max -10.90
V) - -5V 2.84 V (min -5.51 V, max -4.51
V) - V5SB 5.12 V (min 4.50 V, max 5.48
V) - VBat 3.34 V (min 2.70 V, max 3.29
V) - fan1 8231 RPM (min 3000 RPM, div 2)
- fan2 8333 RPM (min 3000 RPM, div 2)
- fan3 0 RPM (min 3000 RPM, div 2)
- temp1 77C (limit 60C)
sensor thermistor - ALARM
- temp2 65.0C (limit 60C, hysteresis
50C) sensor thermistor
11Perl
- Fills gap between low level languages like C and
C and high level languages like shell. - -mostly fast
- -basically unlimited
- -good for working with text
- -portable
- Regular Expressions
- /temp(0-9)\s\(0-9\.0-9)/
- matches
- temp1 77C (limit 60C)
sensor thermistor - temp2 65.0C (limit 60C,
hysteresis 50C) sensor thermistor
12Sample Time - Decreasing
- Time interval 12.15 minutes
- Fri Aug 11 030405 PDT 2006
- FanSpeed1 8035
- FanSpeed2 7941
- Temp 1 77
- Change 0
- Temp 2 64.0
- Change 0
- Temp 3 64.0
- Change 1
- Time interval 9.8415 minutes
- Fri Aug 11 031615 PDT 2006
Want Sample time to decrease faster when
temperatures are changing faster
New time old time Decrement (Change /
Trigger) if new time lt min time then newTime
minTime
- Parameters
- Trigger 0.5 degrees
- Decrement 0.9
- MaxTime 15 minutes
- MinTime 1 minute
New time 12.15 .9 (1 / .05) 9.8415
13Sample Time Increasing
- Time interval 12.15 minutes
- Fri Aug 11 082518 PDT 2006
- Found FanSpeed1 8035
- Found FanSpeed2 7941
- Temp 1 77
- Change 0
- Temp 2 64.0
- Change 0
- Temp 3 64.0
- Change 0
- Time interval 13.5 minutes
- Fri Aug 11 083728 PDT 2006
Want Sample Time to Increase Temperature is
changing slowly or not at all
If we increase by large amounts we could miss
valuable data
NewTime OldTime / Decrement
- Parameters
- Trigger 0.5 degrees
- Decrement 0.9
- MaxTime 15 minutes
- MinTime 1 minute
NewTime 12.15 / 0.9 13.5
14noma0450
noma0449
15- Up and running on two Nomas currently
- Noma0449
- Noma0450
- Will be installed on all Nomas
- Can be used on any Ganglia monitored machine with
a compatible Winbond chip
Acknowledgements
Much thanks to the DOE, SCCS systems group and
especially Yemi Adesanya, John Goebel, Karl
Amrhein for all their help throughout the summer.
16Smartmontools for SCSI devices
- Command smartctl l error /dev/sda
Error counter log Errors Corrected
Total Total Correction
Gigabytes Total delay
rereads/ errors algorithm
processed uncorrected minor
major rewrites corrected invocations
109 bytes errors read 234237 0
0 234237 234237
605.516 0 write 0 0
0 0 0
1457.589 0 Non-medium error
count 0
http//smartmontools.sourceforge.net/smartmontools
_scsi.html
17Corrected Errors
- Minor/ Fast
- Correction algorithm works successfully
- No delay to reading later sectors
- These are ok
- Major / Slow
- Correction algorithm works successfully
- Delay in reading later sectors
- Not so good
- Uncorrected Errors
- Correction algorithm fails
- Very Bad
18Other Information
- Total rereads/rewrites errors corrected by
applying retries - Total errors corrected number of all
correctable errors - Correction Algorithm Invocation number of times
algorithm - is used
- Gigabytes Processed number of bytes
successfully and unsuccessfully read or written
19This indicates there might be a problem
This should be a flag as well
This is ok, its correcting the errors and not
losing any time doing so
20errorsWatch
- Monitors
- Read Uncorrected Errors
- Read Delayed Errors
- Read No Delay Errors
- Write Uncorrected Errors
- Write Delayed Errors
- Write No Delay Errors
- Total Uncorrected Errors
- Total Delayed Errors
-Noma -Don -Tori -Cob -Morab -Orlov
Collects Data Once a Day