Yet another "output file absent" problem

don
don
Joined: 25 Aug 08
Posts: 4
Credit: 348750219
RAC: 308315
Topic 194224

I've been watching the similar threads... none seem to be like this. BOINC starts the job OK. About 7 minutes in it creates a "cpt" (checkpoint) file - OK. Anywhere from 30 min to 9 hrs later I get an "output file nnn_0 for nnn absent". (nnn being the current, now dead, job) SETI runs just fine. Happens any time of the day. I don't suspect a conflict with anti-virus (Comodo) or windows defender, but both avoid the whole BOINC sub-directory. Active-X is version 10. There are no errors, other than the above, in the stderrdae, stdoutgui, and stdoutdae except for SETI comm fails, when they compress.

I forgot to mention. When it runs OK, it runs for days/ weeks, until it doesn't! Then it won't complete, no matter what I do for days/ weeks until it starts again!! AAARGH! I've reset the project and tested the memory. It appears to write to the disk just fine. I don't have a temperature problem. For 23 hours a day it's the only thing running and it usually croaks when I'm not on. Is there the possibly a data pattern-specific bug in the code?

I'm running:
OS Name Microsoft Windows XP Professional
Version 5.1.2600 Service Pack 3 Build 2600
.
System Manufacturer P4i6G
System Model P4i65G
System Type X86-based PC (Single processor, 2 stacks)
Processor x86 Family 15 Model 3 Stepping 3 GenuineIntel ~2999 Mhz

BOINC has been 3 different versions, now 6.4.7 Screensaver is turned off.

I have snapshots of the std.. and cpt files and directory listings of the slot if required.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5883
Credit: 119076764577
RAC: 24073178

Yet another "output file absent" problem

Quote:
I've been watching the similar threads... none seem to be like this. BOINC starts the job OK. About 7 minutes in it creates a "cpt" (checkpoint) file - OK. Anywhere from 30 min to 9 hrs later I get an "output file nnn_0 for nnn absent". (nnn being the current, now dead, job) SETI runs just fine ....


I run about 170 machines and I've seen numerous examples of problems very similar to this, over quite a considerable time. The science app is highly optimised and is particularly sensitive to hardware instabilities and it's not surprising that it affects E@H and not Seti in your case. I have a number of machines that run both projects and, in my case, it's always the E@H task that fails rather than Seti. Seti still seems to be able to run on quite dodgey hardware.

Here is a list of causes of this type of problem that I've identified, roughly in order of frequency:-

  • * Excessive heat (I have to run some machines in a high ambient environment)
    * CPU fan not running at full speed (dry bearings)
    * Flakey motherboard (check for swollen capacitors)
    * Flakey PSU (check for swollen capacitors)
    * Unstabke overclock
    * Flakey RAM
    * Other random hardware issue

Perhaps, in much earlier days, some of these failures were due to software bugs. That may even still be possible but whenever I now see the problem, I can always find a hardware issue to explain it. The problem invariably disappears when I fix the hardware. As an example I have done about 20 separate motherboard repairs (replacing one or more obviously swollen caps) and in all cases the machines were put back into successful production. As I run a lot of 2001 - 2004 vintage machines, the swollen caps issue is not at all surprising.

Cheers,
Gary.

don
don
Joined: 25 Aug 08
Posts: 4
Credit: 348750219
RAC: 308315

RE: RE: I've been

Message 90830 in response to message 90829

Quote:
Quote:
I've been watching the similar threads... none seem to be like this. BOINC starts the job OK. About 7 minutes in it creates a "cpt" (checkpoint) file - OK. Anywhere from 30 min to 9 hrs later I get an "output file nnn_0 for nnn absent". (nnn being the current, now dead, job) SETI runs just fine ....

I run about 170 machines and I've seen numerous examples of problems very similar to this, over quite a considerable time. The science app is highly optimised and is particularly sensitive to hardware instabilities and it's not surprising that it affects E@H and not Seti in your case. I have a number of machines that run both projects and, in my case, it's always the E@H task that fails rather than Seti. Seti still seems to be able to run on quite dodgey hardware.

Here is a list of causes of this type of problem that I've identified, roughly in order of frequency:-

  • * Excessive heat (I have to run some machines in a high ambient environment)
    * CPU fan not running at full speed (dry bearings)
    * Flakey motherboard (check for swollen capacitors)
    * Flakey PSU (check for swollen capacitors)
    * Unstabke overclock
    * Flakey RAM
    * Other random hardware issue
Perhaps, in much earlier days, some of these failures were due to software bugs. That may even still be possible but whenever I now see the problem, I can always find a hardware issue to explain it. The problem invariably disappears when I fix the hardware. As an example I have done about 20 separate motherboard repairs (replacing one or more obviously swollen caps) and in all cases the machines were put back into successful production. As I run a lot of 2001 - 2004 vintage machines, the swollen caps issue is not at all surprising.

Thanks Gary

This motherboard is actually a replacement for one I had, that had the capacitor problem. I guess I'll have to have a closer look at them and the fans. Maybe I'll just change them out on spec. I have found cases where they don't look bad but are. Won't do any harm.

Thanks for the quick reply.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.