I'd consider it rather normal for a computer to run for a certain time w/o problems and then suddenly all kinds of weird things happen ....
Thanks very much for taking the trouble to respond in detail like this.
I'm pretty much aware of hardware issues in relation to overclocking and I'm maintaining in excess of 100 machines, many of which are overclocked and many of which do have issues from time to time. The major issue is not really the overclocking but moreso things like fan failures, PSUs drifting out of spec or failing completely, faulty caps on motherboards, flakey RAM, bad sectors on HDs, etc, which all seem to have pretty much the same rate of failure even when the machine isn't overclocked. I repair motherboards with swollen caps. I have several examples where the failure occurred a couple of years ago and, after replacing the caps, the boards have run overclocked ever since without further incident.
Of course, maintaining and extending the life of old hardware can only ever be justified when one is feeding one's addiction ... errr ... following one's hobby :), and certainly is not economically viable in a production environment.
You might be interested in a followup of the other message where I was reporting the string of signal 11 failures with 4.27 on the machine whose onboard network seemed to have suddenly developed issues with the hub to which it was attached. That machine has now returned 3 successful results whilst attached to the different hub. This was the only change I made.
You might also recall that I wasn't able to boot that machine fully to the normal KDE desktop. It kept dropping to a console login screen. On further investigation, the reason was that even although / was mounted rw, no temp files could be written to /tmp and so X couldn't start. Even as root, I couldn't touch a file anywhere on /. mtab said that / was indeed mounted rw so I tried a "mount -o remount /" to see what would happen. Part of the message I got back said the the media was write protected.
So why would a HD partition suddenly think it's write protected? It's not as if there is a little slider like on a floppy??? There are two Linux partitions (also two Windows partitions) on the disk, one mounted on / and the other on /home. /home is fine. Any clues on how I can convince the other partition that it really isn't write protected?
In the meantime, BOINC (which lives on /home) is doing just fine but I imagine I'll probably want to fix a write protected /dev/hda6 at some point :).
So in summary, in this particular case, the string of signal 11s seems to be the result of network failure when connected to the old hub. The machine is running an older version of BOINC. Could that have anything to do with signal 11s in version 4.27?
So why would a HD partition suddenly think it's write protected? It's not as if there is a little slider like on a floppy??? There are two Linux partitions (also two Windows partitions) on the disk, one mounted on / and the other on /home. /home is fine. Any clues on how I can convince the other partition that it really isn't write protected?
FWIW, and this is very phenomenological - I've found MS Windows ( nearly all varieties ) to be a real fiddler with the MBR on disks - such as to produce difficulties with other OS partitions. It would seem that MS Win can't confine it's errors to it's neck of the woods, and/or assumes it's the only OS on the disk! An examination via say Partition Magic ( using the boot floppies ), if you have it, is often very helpful here ....
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
... I've found MS Windows ( nearly all varieties ) to be a real fiddler with the MBR on disks ....
Amen to that brother .... :).
I have a strict policy I always follow when I'm setting up a newly acquired box - as this one was a few months ago when I picked it up, with a bunch of identical ones, at auction.
* Boot from an Ultimate Boot CD for Windows and do a surface scan of the HDD
* If the HD is good then fine but if there are very localised bad sectors (possibly from a head crash) note their location for mapping around purposes if possible during partitioning. I probably have about 15 machines with disks like this, some up to 18 months old with no further issues.
* If I'm planning to install Linux, first install XP SP2 to a minimal NTFS partition (or two), leaving the balance of the disk vacant.
* Boot the PCLinuxOS 2007 live CD and start the HD install procedure.
* It will find the NTFS partition(s) and will create fstab entries for them so that they can be mounted from Linux.
* Allow the install to carve up the unpartitioned space into suitable root swap and /home partitions and then complete the install including grub for dual booting.
* Reboot Linux from the HD, dump my BOINC template in /home/gary/BOINC and fire up BOINC - basically set and forget. The template contains everything that's needed to control BOINC as a daemon with copies of account files, project executables, skygrids etc, so that soon as BOINC fires up it grabs itself a hostid and starts downloading whatever data files are assigned to it.
So Windows always gets installed first and doesn't usually get rebooted. I do have a couple of machines that do dual boot from time to time but not this one. I'm pretty sure it wouldn't have run Windows since BOINC was set up under Linux, so I don't think I can blame Windows for scrambling something. If I get sufficiently annoyed I'll probably store the BOINC folder somewhere and reinstall Linux. It would be nice however to actually find the cause of the problem.
Quote:"The task names end in "S5R3b", and the data files in "S5R3". The Delay Bound ("deadline") has been increased to 18 days."
My Opteron box has crunched one of those WUs in 60676,52 s using 4.27.
Tullio
Also, thanks Bernd/Bruce/et al...for increasing the deadlines slightly. I think it will make things go a little smoother until the official Win/Linux apps have SSE (and even x87?) support... I promise to not even reply to threads about Einstein "hogging my CPU"... ;-) If people don't understand the purpose of High Priority / EDF, I'll leave that to someone else to try to explain to them...
Can't say I have noticed much in the way of a speed increase.
That's because you are not allowing for the cyclic nature of crunch times and haven't noticed the particular significance of the sequence numbers you have been assigned
Quote:
My first did go under 30,000 seconds for the first time
Because it has a sequence number right near the trough in the cycle.
Quote:
but the next two are back the same as before over 41,000 seconds.
Actually the next three if you check again. They have sequence numbers of 224, 225, 234. Your frequency is 745.20 which means that the period of the cycle is 114.4. There is a cycle peak therefore at 229 so you can see that by pure bad luck your three 41K results bracket the peak very closely. Each peak of the cycle represents the slowest possible crunch time for your computer.
Quote:
So after 3 results it is no faster on an AMD Opteron 285 running Fedora Core 3.
You can't really say that unless you do a proper analysis. Go find Mike Hewson's marvelous Ready Reckoner if you want an easy way to do the analysis. Try reading this thread for some background and then look for Mike's posts.
OK Gary, I follow this,
I have processed many more work units and have noticed a speed improvement, but no idea of percentage. Work units in the trough are now getting 27,000 to 30,000 seconds where I was not dropping below 31,000 seconds before on 4.14.
I have not done any analysis as per Mike's guide as yet.
This 4.27 application is now on my other two Linux machines as well.
Just an aside to this improved application is WU 91935877 which ran for 37,913.32 seconds, this I expected from this WU as it starts to scale up toward the Peak from out of the trough.
But WU 91937267, which is the next WU in the frequency run and therefore should take virtually the same amount of time, has in fact taken 62,138.42 seconds.
Can someone explain this please?
Host is an AMD Opteron 285 @2.6Ghz, nothing changed from one result to the other, no reboots, no updates, no changes.
Just an aside to this improved application is WU 91935877 which ran for 37,913.32 seconds, this I expected from this WU as it starts to scale up toward the Peak from out of the trough.
But WU 91937267, which is the next WU in the frequency run and therefore should take virtually the same amount of time, has in fact taken 62,138.42 seconds.
Can someone explain this please?
There are now three close sequence numbers, the third having a runtime of 39,038 which is in line with the original 37,913. The 62,138 should really be around 38,500 to fit the sequence properly.
If you examine the stderr.out for all three tasks you can notice several things:-
* None of the three were stopped and restarted at any stage so all three must have remained in memory when the CPU was switched to other projects.
* All three have wall clock elapsed times considerably in excess of recorded CPU times (101,880, 93,600, 72,000 respectively) indicating that other projects were sharing the CPUs from time to time.
* All three have quite similar numbers of skypoints per checkpoint - look at each line in stderr.out and see how many numbers there are before a 'c'.
* On this basis it is hard to say that one could have had almost double the CPU time. You should see only about half the number of skypoints per checkpoint if that were true.
* One conclusion might be a possible hiccup or bug in the measuring of CPU time for the anomalous task. The wall clock time for that task (93600) is not out of line with the average for the other two "normal" results.
I'd consider upgrading BOINC to the latest recommended version.
The long running task had a "No heartbeat from core client for 30 sec - exiting", tho, meaning that the science app shut down gracefully and restarted later because it didn't hear the regular "heartbeat" from the BOINC core client.
This can happen when there's excessive load on the system, or an abrupt change in the systems real-time clock may suggest to the app that more than 30 seconds have passed (even to this is not the case).
Anyway, I wonder whether this might be a contributing factor??
RE: I'd consider it rather
)
Thanks very much for taking the trouble to respond in detail like this.
I'm pretty much aware of hardware issues in relation to overclocking and I'm maintaining in excess of 100 machines, many of which are overclocked and many of which do have issues from time to time. The major issue is not really the overclocking but moreso things like fan failures, PSUs drifting out of spec or failing completely, faulty caps on motherboards, flakey RAM, bad sectors on HDs, etc, which all seem to have pretty much the same rate of failure even when the machine isn't overclocked. I repair motherboards with swollen caps. I have several examples where the failure occurred a couple of years ago and, after replacing the caps, the boards have run overclocked ever since without further incident.
Of course, maintaining and extending the life of old hardware can only ever be justified when one is feeding one's addiction ... errr ... following one's hobby :), and certainly is not economically viable in a production environment.
You might be interested in a followup of the other message where I was reporting the string of signal 11 failures with 4.27 on the machine whose onboard network seemed to have suddenly developed issues with the hub to which it was attached. That machine has now returned 3 successful results whilst attached to the different hub. This was the only change I made.
You might also recall that I wasn't able to boot that machine fully to the normal KDE desktop. It kept dropping to a console login screen. On further investigation, the reason was that even although / was mounted rw, no temp files could be written to /tmp and so X couldn't start. Even as root, I couldn't touch a file anywhere on /. mtab said that / was indeed mounted rw so I tried a "mount -o remount /" to see what would happen. Part of the message I got back said the the media was write protected.
So why would a HD partition suddenly think it's write protected? It's not as if there is a little slider like on a floppy??? There are two Linux partitions (also two Windows partitions) on the disk, one mounted on / and the other on /home. /home is fine. Any clues on how I can convince the other partition that it really isn't write protected?
In the meantime, BOINC (which lives on /home) is doing just fine but I imagine I'll probably want to fix a write protected /dev/hda6 at some point :).
So in summary, in this particular case, the string of signal 11s seems to be the result of network failure when connected to the old hub. The machine is running an older version of BOINC. Could that have anything to do with signal 11s in version 4.27?
Cheers,
Gary.
RE: So why would a HD
)
FWIW, and this is very phenomenological - I've found MS Windows ( nearly all varieties ) to be a real fiddler with the MBR on disks - such as to produce difficulties with other OS partitions. It would seem that MS Win can't confine it's errors to it's neck of the woods, and/or assumes it's the only OS on the disk! An examination via say Partition Magic ( using the boot floppies ), if you have it, is often very helpful here ....
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: ... I've found MS
)
Amen to that brother .... :).
I have a strict policy I always follow when I'm setting up a newly acquired box - as this one was a few months ago when I picked it up, with a bunch of identical ones, at auction.
* If the HD is good then fine but if there are very localised bad sectors (possibly from a head crash) note their location for mapping around purposes if possible during partitioning. I probably have about 15 machines with disks like this, some up to 18 months old with no further issues.
* If I'm planning to install Linux, first install XP SP2 to a minimal NTFS partition (or two), leaving the balance of the disk vacant.
* Boot the PCLinuxOS 2007 live CD and start the HD install procedure.
* It will find the NTFS partition(s) and will create fstab entries for them so that they can be mounted from Linux.
* Allow the install to carve up the unpartitioned space into suitable root swap and /home partitions and then complete the install including grub for dual booting.
* Reboot Linux from the HD, dump my BOINC template in /home/gary/BOINC and fire up BOINC - basically set and forget. The template contains everything that's needed to control BOINC as a daemon with copies of account files, project executables, skygrids etc, so that soon as BOINC fires up it grabs itself a hostid and starts downloading whatever data files are assigned to it.
So Windows always gets installed first and doesn't usually get rebooted. I do have a couple of machines that do dual boot from time to time but not this one. I'm pretty sure it wouldn't have run Windows since BOINC was set up under Linux, so I don't think I can blame Windows for scrambling something. If I get sufficiently annoyed I'll probably store the BOINC folder somewhere and reinstall Linux. It would be nice however to actually find the cause of the problem.
Cheers,
Gary.
First and only error with
)
First and only error with 4.27 so far:
2008-02-04 17:47:25 [Einstein@Home] Reason: Unrecoverable error for result h1_0763.10_S5R2__167_S5R3a_0 (process exited with code 38 (0x26, -218))
2008-02-04 17:47:25 [Einstein@Home] Computation for task h1_0763.10_S5R2__167_S5R3a_0 finished
2008-02-04 17:47:25 [Einstein@Home] Output file h1_0763.10_S5R2__167_S5R3a_0_0 for task h1_0763.10_S5R2__167_S5R3a_0 absent
cu,
Michael
RE: "The task names end in
)
Quote:"The task names end in "S5R3b", and the data files in "S5R3". The Delay Bound ("deadline") has been increased to 18 days."
My Opteron box has crunched one of those WUs in 60676,52 s using 4.27.
Tullio
RE: RE: "The task names
)
Excellent! I hope there will be some more so that we can check/recallibrate the "cycle model" with those new units.
CU
H-B
RE: RE: "The task names
)
Also, thanks Bernd/Bruce/et al...for increasing the deadlines slightly. I think it will make things go a little smoother until the official Win/Linux apps have SSE (and even x87?) support... I promise to not even reply to threads about Einstein "hogging my CPU"... ;-) If people don't understand the purpose of High Priority / EDF, I'll leave that to someone else to try to explain to them...
RE: RE: Can't say I have
)
OK Gary, I follow this,
I have processed many more work units and have noticed a speed improvement, but no idea of percentage. Work units in the trough are now getting 27,000 to 30,000 seconds where I was not dropping below 31,000 seconds before on 4.14.
I have not done any analysis as per Mike's guide as yet.
This 4.27 application is now on my other two Linux machines as well.
Just an aside to this improved application is WU 91935877 which ran for 37,913.32 seconds, this I expected from this WU as it starts to scale up toward the Peak from out of the trough.
But WU 91937267, which is the next WU in the frequency run and therefore should take virtually the same amount of time, has in fact taken 62,138.42 seconds.
Can someone explain this please?
Host is an AMD Opteron 285 @2.6Ghz, nothing changed from one result to the other, no reboots, no updates, no changes.
RE: Just an aside to this
)
There are now three close sequence numbers, the third having a runtime of 39,038 which is in line with the original 37,913. The 62,138 should really be around 38,500 to fit the sequence properly.
If you examine the stderr.out for all three tasks you can notice several things:-
* All three have wall clock elapsed times considerably in excess of recorded CPU times (101,880, 93,600, 72,000 respectively) indicating that other projects were sharing the CPUs from time to time.
* All three have quite similar numbers of skypoints per checkpoint - look at each line in stderr.out and see how many numbers there are before a 'c'.
* On this basis it is hard to say that one could have had almost double the CPU time. You should see only about half the number of skypoints per checkpoint if that were true.
* One conclusion might be a possible hiccup or bug in the measuring of CPU time for the anomalous task. The wall clock time for that task (93600) is not out of line with the average for the other two "normal" results.
I'd consider upgrading BOINC to the latest recommended version.
Cheers,
Gary.
The long running task had a
)
The long running task had a "No heartbeat from core client for 30 sec - exiting", tho, meaning that the science app shut down gracefully and restarted later because it didn't hear the regular "heartbeat" from the BOINC core client.
This can happen when there's excessive load on the system, or an abrupt change in the systems real-time clock may suggest to the app that more than 30 seconds have passed (even to this is not the case).
Anyway, I wonder whether this might be a contributing factor??
CU
Bikeman