Neeed help with compute errors

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7273741725
RAC: 1851247

RE: I stopped running my

Message 89859 in response to message 89858

Quote:

I stopped running my AMD's at Einstein, but let my Q6600 keep on chuggin. I just checked back in on these boards and found we're running a new S5 app. (64b linux). I checked on my results and found that 1 out of 3 (roughly) of the S5 results returned by this machine have computation errors, but all the following pages of S4 wus have no errors.

I know, the new S5 must be more efficient, creating more heat and thereby causing the errors, and turning back the OC might fix it, and that's what I'm going to try, but I found it interesting anyway.

tony

It does not need to be hotter to possibly fail at different speed, it just needs to be different.

And this effect can be different not only for different processor architectures, but also for individual samples of a particular processor design.

I'm a retired microprocessor design engineer, by the way.

mikey
mikey
Joined: 22 Jan 05
Posts: 12780
Credit: 1868831936
RAC: 1864998

RE: @Bikeman - My problems

Message 89860 in response to message 89853

Quote:

@Bikeman - My problems at Rosetta are unlike and unrelated to those here at E@H. Too complicated to explain here but they are not compute errors per se. My Dell computer is locked and cannot be overclocked. Hard disk is a few years old and could be fading, though I don't know how I would recognize that short of a disk crash.

@Gundolf - Not blaming Boinc for anything, simply stating that if I can't successfully run E@H and Rosetta without grief then I will jettison Boinc which is the substrate on which E@H and Rosetta are run.

@Jord - I had compute errors back in December and rebooting seemed to get things back on track. On Jan 7th windows reported a "dirty" disk and automatically ran chkdsk on restart. There was a hung index entry in the file structure. Ran chkdsk a second time and no additional errors were reported. But I've encountered E@H task compute errors since Jan 7th; yet another chkdsk did not indicate anything wrong with disk, no bad sectors etc.

@Dagorath - computer not overclocked, resides in unheated room, gets cleaned twice yearly. Never find dirt when I do clean it.

I've been using this computer on EAH tasks for 3 years now so either my computer is crapping out or something is wrong with the more recent apps and data files. Explain why I find this error "WARNING: Fixing yLower (-1181 -> 0) [HoughMap.c 771]" in many of the faulty tasks, or why the app couldn't parse the skygrid, or why the access violations -- some of this after detaching and re-attaching with all new files. The last errors encountered were not the same data pack I had previously; encountered similar errors with both packs. Not all tasks on a given data pack were faulty. I'm not capable of debugging this myself so I have no recourse except to abandon ship. I'm not one to settle for a huge error rate. Maybe try again someday when I get a newer computer. This is disheartening, when the appeal fades it tends to fade forever.

The one other thing to try is to upgrade to the latest version of a program and see if the errors are still there. In this case upgrade Boinc to the latest released version, I think it is 6.4.5. This eliminates any chance of that being the source of your problems. Unfortunately trouble shooting can take a long time and still result in nothing found when all is working correctly again.

You could have a memory chip going bad too. Unlikely but possible. MS has a free memory tester you can download http://www.softpedia.com/get/Tweak/Memory-Tweak/Microsoft-Windows-Memory-Diagnostic.shtml

As for a hard drive tester, they are manufacturer specific, although the old IBM one seemed to work on any drive.

Es99
Es99
Joined: 9 Sep 05
Posts: 763
Credit: 394750
RAC: 0

Could someone take a look at

Could someone take a look at my last few results and tell me why I keep getting errors?

Physics is for gurls!

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 4

RE: Could someone take a

Message 89862 in response to message 89861

Quote:
Could someone take a look at my last few results and tell me why I keep getting errors?

too many normally harmless exit(s)

Are you using the CPU throttle option in BOINC? If you are, put it back to 100% (no throttle).

Else, see this FAQ for many other options.

Es99
Es99
Joined: 9 Sep 05
Posts: 763
Credit: 394750
RAC: 0

RE: Are you using the CPU

Message 89863 in response to message 89862

Quote:

Are you using the CPU throttle option in BOINC? If you are, put it back to 100% (no throttle).


The waahh?

Physics is for gurls!

Dagorath
Dagorath
Joined: 22 Apr 06
Posts: 146
Credit: 226423
RAC: 0

RE: RE: Are you using

Message 89864 in response to message 89863

Quote:
Quote:

Are you using the CPU throttle option in BOINC? If you are, put it back to 100% (no throttle).

The waahh?

In your local preferences, on the Processor Usage tab, there is a setting titled "Use at most __ % CPU time". That's the throttle setting. Make sure it says 100%.

Nothing But Idle Time
Nothing But Idl...
Joined: 24 Aug 05
Posts: 158
Credit: 289204
RAC: 0

As the original poster I

As the original poster I would like to get back to my issue with compute errors.

I ran memtest86 without any reported memory errors. I ran two threads of prime95 for 24 hours testing first the cpu/fpu and again testing memory, without any reported errors.

And last I have run chkdsk 3 times without evidence of any physical disk problems, I have checked the "smartdrive" data against an archive data base of 2300 other hard disks like mine and
[pre]
Attribute Current Raw Overall
Raw Read Error Rate 65 90641131 Good
Spin Up Time 98 0 Good
Start/Stop Count 100 0 Very good
Reallocated Sector Count 100 0 Very good
Seek Error Rate 85 365244101 Very good
Power On Hours Count 63 32810 Watch
Warning: Power On Hours Count is below the average limits (84-100).
This is due to my computer running 24/7 for 3 years doing EaH.
The other 2300 computers were not powered full time.

Spin Retry Count 100 0 Very good
Power Cycle Count 100 690 Very good
Hardware ECC Recovered 65 90641131 Good
Current Pending Sector 100 0 Very good
Offline Uncorrectable Sector Count 100 0 Very good
Ultra DMA CRC Error Rate 200 0 Very good
Write Error Rate 100 0 Very good
TA Increase Count 100 0 Very good
Average disk temp for the archive was 38C while mine was 37C.
Concluded Overall fitness/performance of 97%

[/pre]
I cannot find any evidence that my computer is presently failing in any way, including non-boinc applications. However, the output of the failed EaH tasks (IMO) rationally point to EaH files or software rather than my particular system (so I conclude). I will note however that both in December and January around the timeframe of the compute errors Norton AV removed a mass mailing virus called W32.Mytob@mm; may be strictly coincidental or may not and I'm not going to make a connection between the virus and EaH task failures.

I will try again to run tasks with S5R5 and hope for better results.

paul milton
paul milton
Joined: 16 Sep 05
Posts: 329
Credit: 35825044
RAC: 0

go in to your antivirus app

Message 89866 in response to message 89865

go in to your antivirus app and disable scanning of the boinc data directory. in a recent upgrade they moved this directory check the first few lines in the "messages" tab you should see something like..

1/20/2009 4:56:01 PM||Data directory: C:\Documents and Settings\All Users\Application Data\BOINC

reason: some av apps "lock" files when they scan them. if its lockd then boinc cant write to it. you may be surprised just how much disk activity is going on that the blinking red disk light dosent show. i used a program called filemon from sysinternals to see what was going on on my disk. run it. you might be surprised at just how much activity is going on.

seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.

Nothing But Idle Time
Nothing But Idl...
Joined: 24 Aug 05
Posts: 158
Credit: 289204
RAC: 0

RE: go in to your antivirus

Message 89867 in response to message 89866

Quote:

go in to your antivirus app and disable scanning of the boinc data directory. in a recent upgrade they moved this directory check the first few lines in the "messages" tab you should see something like..

1/20/2009 4:56:01 PM||Data directory: C:\Documents and Settings\All Users\Application Data\BOINC

reason: some av apps "lock" files when they scan them. if its lockd then boinc cant write to it. you may be surprised just how much disk activity is going on that the blinking red disk light dosent show. i used a program called filemon from sysinternals to see what was going on on my disk. run it. you might be surprised at just how much activity is going on.


Ah, I was waiting for someone to jump on this bandwagon and correctly so. But alas, my boinc directory and subfolders are already excluded from AV scans. I have tested Norton AV with Boinc excluded and included in scans and over the past 3 years I detected no difference in the operation or stability of Boinc. Excluding even Boinc from a scan makes me nervous though since this is 99% of my internet traffic.

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

RE: Excluding even Boinc

Message 89868 in response to message 89867

Quote:
Excluding even Boinc from a scan makes me nervous though since this is 99% of my internet traffic.


Nobody tells you to stop scanning your Internet traffic, just the directory ;-)

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.