Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips

Description:

A little bit on hardware. Noma - batch machines. Tyan Thunder LE ... Smallest native Japanese pony 10.1 -10.3 hands. Super rare 27 pure blood nomas left (1988) ... – PowerPoint PPT presentation

Number of Views:284
Avg rating:3.0/5.0
Slides: 21
Provided by: cait6
Category:

less

Transcript and Presenter's Notes

Title: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips


1
Monitoring Temperature and Fan Speed Using
Ganglia and Winbond Chips
Caitie McCaffrey, Yemi Adesanya August 2006
2
  • The SLAC Computing Services Group is
    dedicated to providing leadership and support in
    computing and communications to the laboratory as
    a whole, and to physics research, in particular
  • Major Concerns
  • Power consumption
  • Cooling
  • Monitoring

3
What Is My Computer Doing???
  • I/O Rate
  • CPU usage
  • Memory Usage
  • Temperature
  • Fan Speed
  • Load

Monitoring Software -low overhead -scalable -low
impact on individual machines
4
  • Ganglia is a scalable distributed monitoring
    system for high-performance computing systems
    such as clusters and Grids
  • Scalable, overhead increases by number of
    clusters not nodes
  • Works on multiple operating systems
  • Round Robin Database
  • Measures metrics like CPU usage, load, I/O rate,
    and memory usage

GMOND, GMETAD, GMETRIC
5
Ganglia Architecture
http//www.slac.stanford.edu/comp/unix/ganglia/ind
ex.html
Updates RRD, polls clusters periodically
Cluster Two Machines 1 and 3 know state of entire
cluster
1
2
A
4
3
Cluster One All machines know state of entire
cluster
B
C
6
(No Transcript)
7
GMETRIC
  • Allows users to monitor metrics to expand on the
    core monitored by the daemon gmond
  • Name
  • Value
  • Type
  • Units
  • gmetric conf/var/ganglia/gmond.conf nCPUTemp1
    v75 tuint8 uCelsius
  • Good because allows us to be more machine
    specific, can monitor temperature and fan speed

8
A little bit on hardware
  • Noma - batch machines
  • Tyan Thunder LE-T motherboard
  • Winbond w83782d (lm_sensor compatible)
  • 2 pentium III processors
  • Why is temperature important?
  • Chip specifications give temperature range
  • Behavior is unpredictable outside temperature
    range
  • Clues to weird machine behavior
  • Pentiums have a max temp of 77-82 C

Tyan Thunder LE-T
9
Whats a Noma?
NOMA
  • Horse from Noma County Japan
  • Smallest native Japanese pony 10.1 -10.3 hands
  • Super rare 27 pure blood nomas left (1988)

Some more machines
DON
COB
TORI
ORLOV
MORAB
10
  • caitiem_at_noma0449 sensors
  • w83782d-i2c-0-29
  • Adapter SMBus PIIX4 adapter at 0580
  • Algorithm Non-I2C SMBus adapter
  • VCore 1 1.48 V (min 4.08 V, max 4.08
    V)
  • VCore 2 1.26 V (min 4.08 V, max 4.08
    V)
  • 3.3V 3.37 V (min 2.97 V, max 3.63
    V)
  • 5V 4.97 V (min 4.50 V, max 5.48
    V)
  • 12V 12.08 V (min 10.79 V, max 13.11
    V)
  • -12V -1.03 V (min -13.21 V, max -10.90
    V)
  • -5V 2.84 V (min -5.51 V, max -4.51
    V)
  • V5SB 5.12 V (min 4.50 V, max 5.48
    V)
  • VBat 3.34 V (min 2.70 V, max 3.29
    V)
  • fan1 8231 RPM (min 3000 RPM, div 2)
  • fan2 8333 RPM (min 3000 RPM, div 2)
  • fan3 0 RPM (min 3000 RPM, div 2)
  • temp1 77C (limit 60C)
    sensor thermistor
  • ALARM
  • temp2 65.0C (limit 60C, hysteresis
    50C) sensor thermistor

11
Perl
  • Fills gap between low level languages like C and
    C and high level languages like shell.
  • -mostly fast
  • -basically unlimited
  • -good for working with text
  • -portable
  • Regular Expressions
  • /temp(0-9)\s\(0-9\.0-9)/
  • matches
  • temp1 77C (limit 60C)
    sensor thermistor
  • temp2 65.0C (limit 60C,
    hysteresis 50C) sensor thermistor

12
Sample Time - Decreasing
  • Time interval 12.15 minutes
  • Fri Aug 11 030405 PDT 2006
  • FanSpeed1 8035
  • FanSpeed2 7941
  • Temp 1 77
  • Change 0
  • Temp 2 64.0
  • Change 0
  • Temp 3 64.0
  • Change 1
  • Time interval 9.8415 minutes
  • Fri Aug 11 031615 PDT 2006

Want Sample time to decrease faster when
temperatures are changing faster
New time old time Decrement (Change /
Trigger) if new time lt min time then newTime
minTime
  • Parameters
  • Trigger 0.5 degrees
  • Decrement 0.9
  • MaxTime 15 minutes
  • MinTime 1 minute

New time 12.15 .9 (1 / .05) 9.8415
13
Sample Time Increasing
  • Time interval 12.15 minutes
  • Fri Aug 11 082518 PDT 2006
  • Found FanSpeed1 8035
  • Found FanSpeed2 7941
  • Temp 1 77
  • Change 0
  • Temp 2 64.0
  • Change 0
  • Temp 3 64.0
  • Change 0
  • Time interval 13.5 minutes
  • Fri Aug 11 083728 PDT 2006

Want Sample Time to Increase Temperature is
changing slowly or not at all
If we increase by large amounts we could miss
valuable data
NewTime OldTime / Decrement
  • Parameters
  • Trigger 0.5 degrees
  • Decrement 0.9
  • MaxTime 15 minutes
  • MinTime 1 minute

NewTime 12.15 / 0.9 13.5
14
noma0450
noma0449
15
  • Up and running on two Nomas currently
  • Noma0449
  • Noma0450
  • Will be installed on all Nomas
  • Can be used on any Ganglia monitored machine with
    a compatible Winbond chip

Acknowledgements
Much thanks to the DOE, SCCS systems group and
especially Yemi Adesanya, John Goebel, Karl
Amrhein for all their help throughout the summer.
16
Smartmontools for SCSI devices
  • Command smartctl l error /dev/sda

Error counter log Errors Corrected
Total Total Correction
Gigabytes Total delay
rereads/ errors algorithm
processed uncorrected minor
major rewrites corrected invocations
109 bytes errors read 234237 0
0 234237 234237
605.516 0 write 0 0
0 0 0
1457.589 0 Non-medium error
count 0
http//smartmontools.sourceforge.net/smartmontools
_scsi.html
17
Corrected Errors
  • Minor/ Fast
  • Correction algorithm works successfully
  • No delay to reading later sectors
  • These are ok
  • Major / Slow
  • Correction algorithm works successfully
  • Delay in reading later sectors
  • Not so good
  • Uncorrected Errors
  • Correction algorithm fails
  • Very Bad

18
Other Information
  • Total rereads/rewrites errors corrected by
    applying retries
  • Total errors corrected number of all
    correctable errors
  • Correction Algorithm Invocation number of times
    algorithm
  • is used
  • Gigabytes Processed number of bytes
    successfully and unsuccessfully read or written

19
This indicates there might be a problem
This should be a flag as well
This is ok, its correcting the errors and not
losing any time doing so
20
errorsWatch
  • Monitors
  • Read Uncorrected Errors
  • Read Delayed Errors
  • Read No Delay Errors
  • Write Uncorrected Errors
  • Write Delayed Errors
  • Write No Delay Errors
  • Total Uncorrected Errors
  • Total Delayed Errors

-Noma -Don -Tori -Cob -Morab -Orlov
Collects Data Once a Day
Write a Comment
User Comments (0)
About PowerShow.com