I control all my hosts with a script running on a server machine. I've recently added a lot of functionality to the script in several areas.
I've been fine tuning parts of it for a week or two and the final version was launched yesterday. Today I find that it has performed as expected in a hardware failure situation that occurred earlier today.
I use lots of old (2001-2003 vintage) hard disks in my machines. Because of their age (and hot, 24/7 life) you might expect regular failures. What is surprising is that a HD failure, for me, is a quite uncommon event. Perhaps that may be linked to the fact that the drives rarely stop/start. Another factor is that more than half of my drives are of one particular model (Seagate ST3200-14A 20GB). These seem to want to go on forever.
I just had a HD failure. It wasn't one of those Seagates but rather a 20GB Western Digital. The failure was picked up by my control script as it happened, so as soon as I arrived on site and switched on the monitor for the server machine, the notification was immediately visible on the screen. It had happened a couple of hours earlier. The script produces an ongoing list of last octets of the IP addresses of all the machines that it visits. This list is continually updated on the screen as each host is monitored.
Any particular octet can be colour coded before printing to indicate various conditions and/or failure modes. Ordinarily, the ongoing record is just a list of these octets in black on the white screen background. For things needing urgent attention, the octet is printed in bold red on black which makes it highly visible. There are about 10 different things that are deemed to need urgent attention so once I notice it on the screen, I need to consult a log file to get the full details of the problem. Here is a snippet from the log file at the time of the HD failure. The times are local (UTC+10). The top left hand corner of each host's data gives the time, IP address, date and hostname for the machine.
07:31:49: 192.168.0.89 New Tasks : FGRP= 3 BRP6= 3 TOTAL= 6 -- this loop only 30-Jan-16 i3_3240-03 Sub-Totals : FGRP= 3 BRP6= 3 TOTAL= 6 -- after last loop (3.5hrs) Accum Tasks: FGRP= 8 BRP6= 15 TOTAL= 23 -- after loop 5 EAH Files : None DNL but 1 SKP --> LATeah0153E.dat GPU Tasks : Actual/Min/Deg=6/4/ 70C Curr/Est RAC=92761/90000 Ratio=1.031 - >0.9 so OK!
07:35:18: 192.168.0.90 When testing if 'sensors' is installed, got -> /usr/bin/xauth: error in lock /usr/bin/sensors
30-Jan-16 i3_4130-02 Host problem - ssh: Invalid return from command.
An integer was expected but got ->/usr/bin/xauth: error in locking authority file /home/gary/.Xauthority <- from date +%s ######
Fatal error trying to get the current unix time
There are entries for two hosts. The first (host has no problem - last octet is .89) shows the normal output for new tasks requested, for data files downloaded or skipped (supplied from the cache instead of downloading). The last line is for GPU tasks returned. The first group of three numbers represent tasks returned (actual) in a particular period, the absolute minimum this should be before issuing a warning, and the GPU temperature at the time. The second group is the current and theoretical RAC values. The script is happy as long as the ratio of the two exceeds 0.9. So this is just a normal, routine log entry.
The second entry (last octet is .90) is for the host with the disk failure. Since all communication with the worker hosts is done using ssh, I've gone to a bit of extra effort to intercept what comes back for each ssh connection. This is done inside a bash function which runs ssh under a timeout and parses the return before deciding if there's a problem or not. In this particular case, a temperature measurement was about to be performed and a preliminary check was being made to test that the program required to read the value was installed on the host. What came back (the xauth error message) was not what was expected so the attempt to measure the temperature was abandoned. This is not regarded as a 'fatal' error and would otherwise have been flagged as a warning and the script would have moved on.
However, the next bit is regarded as a fatal error. A date command was being run to get the unix time on the host and the bash function was expecting the integer representation of unix time but got an error message instead. At this point the logic in the function declares a fatal error and all further attempts to deal with the host are stopped. The script just puts appropriate entries in various log files and on the console screen before moving to the next host.
I've configured all my hosts to be able to measure CPU core temperatures and GPU temperatures and have the values logged every time the control script pays a visit. Here are excerpts from the two separate logs being kept. These both cover the period of the problem on host .90 and also show the host back in action with a replacement hard disk. The first table is for CPU core temperatures.
.... 80,NVI, 2-69/70, 2-71/71, 2-70/72, 2-71/71, 2-72/73, 2-71/72 81,NVI, 2-61/65, 2-57/62, 2-59/64, 2-59/65, 2-60/63, 2-60/65 82,NVI, 2-63/64, 2-62/64, 2-62/63, 2-62/63, 2-62/64, 2-62/65 83,NVI, 2-63/65, 2-62/63, 2-63/63, 2-63/63, 2-64/64, 2-62/64 84,NVI, 2-63/67, 2-63/66, 2-64/66, 2-64/66, 2-63/65, 2-63/68 85,AMD, 2-78/79, 2-79/79, 2-79/79, 2-79/79, 2-80/81, 2-79/80 86,AMD, 2-67/72, 2-64/70, 2-65/69, 2-67/71, 2-65/71, 2-65/71 87,AMD, 2-56/59, 2-55/59, 2-54/61, 2-55/59, 2-54/59, 2-52/60 88,AMD, 2-67/67, 2-64/65, 2-64/64, 2-66/67, 2-64/64, 2-64/65 89,AMD, 2-61/65, 2-61/65, 2-62/64, 2-62/64, 2-64/67, 2-64/68 90,AMD, 2-68/70, 2-67/68, 2-66/68, XXXXX, XXXXX, 2-69/71 92,AMD, 4-72/79, 4-72/82, 4-69/77, 4-72/82, 4-71/81, 4-70/78 ....
I've included a number of hosts apart from the problem one, with a mixture of GPU types. The first two entries on a line are the octet and the GPU type. After that, there are comma separated data groups that are approximately 3.5 hours apart (7 per 24 hours). In each group the three numbers represent the number of physical CPU cores, Core Temp Min, and Core Temp Max. All the dual core hosts in the above list are i3's with HT enabled - 4 virtual cores. The only true quad core is .92 and you can see a big difference between Min and Max values because of only two cores running CPU tasks.
If a temperature cannot be measured for any reason, particular strings get inserted to indicate the reason. XXXXX means the host had a fatal error of some sort. The next table shows GPU temperatures for the same group of hosts.
.... 80,NVI, 67C, 67C, 67C, 69C, 68C, 68C, 67C, 67C, 67C, 67C, 68C 81,NVI, 69C, 69C, 69C, 71C, 69C, 70C, 69C, 69C, 69C, 69C, 70C 82,NVI, 72C, 72C, 72C, 73C, 73C, 73C, 73C, 73C, 73C, 73C, 73C 83,NVI, 68C, 68C, 68C, 70C, 69C, 69C, 69C, 68C, 69C, 68C, 69C 84,NVI, 72C, 72C, 72C, 73C, 72C, 72C, 72C, 72C, 72C, 72C, 72C 85,AMD, 72C, 72C, 73C, 74C, 73C, 73C, 73C, 73C, 73C, 73C, 73C 86,AMD, 69C, 69C, 68C, 69C, 69C, 69C, 69C, 69C, 70C, 68C, 69C 87,AMD, 71C, 71C, 71C, 73C, 72C, 73C, 73C, 72C, 72C, 72C, 73C 88,AMD, 71C, 71C, 71C, 72C, 71C, 72C, 71C, 72C, 72C, 71C, 72C 89,AMD, 70C, 70C, 70C, 71C, 70C, 71C, 70C, 70C, 70C, 73C, 73C 90,AMD, 69C, 69C, 69C, 69C, 69C, 69C, 69C, 69C, XXX, XXX, 70C 92,AMD, 66C, 66C, 67C, 68C, 67C, 66C, 65C, 66C, 67C, 67C, 68C ....
Once again, the last octet and the GPU type are listed first with temperature values approximately 3.5hrs apart. As time progresses, new columns are added on the right until the length of all lines reaches close to the right hand margin of the screen. At that time the log is rotated and the just completed file is given an index number and saved, and a new blank file with just the octet and GPU type is started. This is all automatically handled by the script.
Once I discovered the hard disk failure, it took just over 2 hours to restore the machine. The replacement is one of the Seagate 20GB units dated April 2001. It's one of my pile of old disks and it's running just fine. I did try to 'rescue' the previous disk contents but it was a hopeless cause. When attempting to fire up the machine, there is just a series of loud clicks and although I can boot off a live USB, the bad disk doesn't register in the BIOS or get detected by the live OS once booted. So no hope of recovering the BOINC tree.
So, I installed the OS on the replacement disk and installed all the updates and other utilities and graphics drivers, etc. that would be needed. I populated a template BOINC tree with all the current files needed and set up a template client_state.xml. I inserted the necessary details (host ID, rpc_seqno) in the state file to allow the new installation to be recognised as the one on the failed drive. I started BOINC with NNT set and all was normal. I forced an update and the scheduler proceeded to send me all the 'lost tasks' in batches of twelve. 5 batches were needed to get them all - 59 tasks in total. The machine is back to normal with the usual crunch times and is responding normally to the control script. I'm very happy about that. Gotta love how easy it is to do this under Linux!! :-) ;-) :-).
Cheers,
Gary.
Copyright © 2024 Einstein@Home. All rights reserved.
Hard disk failure and recovery - How a control script helped.
)
In business parlance, ¨sweating the assets¨. I just looked at an old disk ¨powered on 6.3 years¨ and thought - it´s just a toddler!
Interesting how no warning...working perfect then ¨clunk¨ i guess the system logs are lost (you don´t syslog to a central log server?)
I don´t know if these old disks support SMART but if you have a script you could add something like from smartmon might give you predictive warnings about drive error rates. I use the hdd temperature util hddtemp to get a temperature feed, this is more to get a case temperature.
Would be interesting to see some SMART data on these really long running drives.
You mention arriving onsite.
)
You mention arriving onsite. Do you have machines in multiple locations and monitor them remotely? I presume a VPN or similar link to the location(s).
I was thinking of putting some machines in a remote location that has cheap rent, but then there is the hassles of getting to them when there is a failure that requires being there.
BOINC blog
NB: I started this reply a
)
NB: I started this reply a fair while ago but got caught up with more pressing stuff. I've just now found time to finish it. Sorry for the delay.
Syslogs are all kept on the individual machines. I've found the basic hardware to be quite reliable and have never felt the urge to pay much attention to these logs. Over the years my most pressing need was to identify hosts where BOINC/apps were putting enough stress on machines to cause misbehaviour. I feel I'm getting close to a very useful system now.
Each time the script visits a host, it checks basic stuff like response to pings, BOINC's status, etc, before compiling any performance stats, etc. If the machine is not alive and well, it fails one of these preliminary checks. There would be disk reads going on during these checks - the utils used have to be loaded from somewhere - maybe they were cached?? The fact that the detected problem was when looking for the 'sensors' utility using which sensors with the expected responses being /usr/bin/sensors or which: no sensors in ... followed by a full list of $PATH, seems to indicate that the disk failed precisely when attempting to read all the various directories in $PATH (the path is quite extensive :-).).
Later, when I arrived on the scene and hooked up the peripherals, moving the mouse brought up a normal looking screen. The clock was ticking over every second and everything looked good. I tried to launch a konsole window and that failed. This was the only obvious indication of a problem. There were no undue disk noises. So, I tried to reboot through the menu - nothing. Eventually, I was forced into a hard reset and during the POST, the disk started making loud clicking noises when the BIOS was trying to detect it. I hooked up a live USB and was able to boot from it even though the HD was still clicking away. The live USB booted and the clicking stopped but when trying to check disk partitions, the disk itself couldn't be detected. I figured it was time to stop mucking around and install a replacement drive.
So, yes, I do believe the exact moment when something went 'clunk' was pretty much captured. Seems rather amazing that it was so sudden.
I guess whilst I'm in the 'adding functionality' mood, these would be pretty logical next steps. I've grown into the habit of being concerned about temperatures because I've seen a lot more fan failures than disk failures. I've been very lucky to have seen so few disk failures.
The ST3200-14A drives all came from a series of auction purchases in 2006/2007. I bought 120 machines, all the same, that had come out of the offices of a big corporation that was in the process of replacing all its computers. The old (2002/2003) machines were being offered in lots of 10 at a series of weekly auctions. I bought the first lot of 10 for $120 and since they had Windows stickers on the cases, I loaded Windows on each one. They had Tualatin Celeron 1300 CPUs which I found I could over clock to around 1550MHz on average. They could outperform the early P4s (1700MHz Willamette) and used less electricity.
As some of the other old-timers around here would tell you, I became quite a fan of the Tualatin P3 processor. The Celeron version was just as good for crunching as the Pentium version. These were the last really good Celerons. The Northwood Celeron was a real dog for crunching. The other good thing that happened was when I switched these machines to Linux in 2007/2008. They showed a significant performance improvement under Linux. I posted about it here and followed up with more results here. As a result of these findings, I bit the bullet and converted the vast bulk of my machines to Linux.
Each time I made a generational change in the farm, I retained the case, disk, CDROM (now long departed - I've used a live USB for nearly 3 years now) and PSU. For the Tualatin P3 -> Core 2 Quad transition, I did buy a big batch of 300W SFX 80+ PSUs (OEM SeaSonic) to replace the 175W SFX Deltas that powered the P3s. I've been using the Deltas to power the more recent Ivy Bridge and Haswell boxes driving many of my GPUs. I managed to pick up a batch of 55 of the 300W units for $13 each just before I started to upgrade to Q6600 and then Q8400 quads. I'm sorry I didn't buy more at the time :-). They've been efficient and reliable PSUs. They've run for 6 years and are now near EOL. They are starting to show swollen caps and I've been bringing them back to life by recapping them. This works well but it's a rather tedious task as some of the caps are quite hard to get at in the tight SFX format. I need to find another job lot - maybe 400W and 80+ gold - to replace them. I can use standard ATX rather than SFX. Should be a bit easier to find.
When I acquired all the boxes with these drives, I just started loading Windows on each one. I remember seeing the occasional warnings from SMART but didn't have any problems installing and running BOINC. So I got in the habit of disabling SMART in the BIOS to stop being bothered by such messages. I figured that BOINC crashing or tasks not validating would be enough warning to look into possible disk issues. Since all hosts are dedicated to crunching under BOINC, I just tend to run disks until they break, since it's very easy to recover from a disk fail.
I had a look in the repo and smartmontools is there. I installed it on a machine with a ST3200-14A drive dated Feb 2003 and the utility runs fine. I re-enabled SMART with it and it produces the following output, most of which is not readily decipherable by me (yet) :-). I'll need to do some research about what it all means.
=== START OF INFORMATION SECTION ===
Model Family: Seagate UX
Device Model: ST320014A
Serial Number: 5JZCY6DV
Firmware Version: 3.07
User Capacity: 20,020,396,032 bytes [20.0 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA/ATAPI-6 (minor revision not indicated)
Local Time is: Sat Feb 6 18:54:03 2016 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 420) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 20) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 094 083 025 Pre-fail Always - 182286684
3 Spin_Up_Time 0x0003 098 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 195
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 086 060 030 Pre-fail Always - 4741141744
9 Power_On_Hours 0x0032 072 072 000 Old_age Always - 24802
10 Spin_Retry_Count 0x0012 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 287
194 Temperature_Celsius 0x0022 038 052 000 Old_age Always - 38
195 Hardware_ECC_Recovered 0x001a 100 253 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
I had produced some output more than two days ago but ran it again just now to get a fresh copy. Out of interest, I ran both outputs through diff to see what had changed in a couple of days. Here are the diffs - not much changed.
The machine with this disk has been running continuously so the time difference between the two sets is over 53 hours. The power_on_hours has only increased by 4. Does that mean the OS is putting the drive into a state where it's not being counted as on? I'm pretty sure the drives are always spinning. I reckon I'd notice if they only spun for 4 hours out of 53 :-). All my cases are open and I often check disk temperature by hand and I always feel (and hear) the vibrations. 24802 hours is 2.83 years. That drive has been in a powered on system for a lot longer than 2.83 years :-).
Maybe the 2.83 years was the accumulated time to the point where I disabled SMART in the BIOS?? I really need to do some research about this. I really know nothing about how it works at this point.
Cheers,
Gary.
RE: You mention arriving
)
I have a few machines at home but the bulk of the fleet is in a specially designed 'crunching lab' in a commercial building about 20 mins from home. I have never gone to the trouble of setting up any sort of link between the two locations. They have quite separate internet connections. The premises were constructed almost 4 years ago and the bulk of the space is rented out to two separate businesses. My lab is a small third tenancy. My Super fund owns the building so I'm quite happy to pay the rent :-).
Because of the number of hosts I have, I needed to do something like this. One reason I don't remote monitor from home is that (knowing me), if I sensed a problem I'd probably not be able to resist going to 'sort it out', then and there. When I go home, I want to switch off and do other things and I don't worry about what the farm is up to. Apart from occasional hardware failures, the farm runs itself most of the time and if something has crashed overnight, there will be a nice easy to digest report about it for me to look into in the morning.
Cheers,
Gary.
RE: NB: I started this
)
Thanks Gary - interesting - I suspect the mobo BIOS is to turn on red warning leds etc rather than turn off the SMART subsystem.
If the drive is being turned off using something like APM then
# hdparm -I /dev/sda
should show it i would think.
It also has a power saving mode capability according to smartmon.(it supports a sleep mode according to Seagate)
More likely the SMART stats are a little misleading the wiki has a theme - use the manufacturer's interpretations as they all differ! I noticed a comment on there about old power on hours wrapped to zero.
I'm 90% certain most drives store the SMART logs/data on the disk itself (tracks -1, -2 etc) so running smartctl that dead drive, probably will just give a single data bit - clunk.
Hi, I try to keep my cpu
)
Hi,
I try to keep my cpu running under 60C.
Isn't 70C too hot?
I have cats - so I get cat-hair clog problems.
And I can tell when I really need to get out the vacuum cleaner
by looking at the temps.
Thanks for script info.
Jay