Validate error - What this really means!

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2944527614
RAC: 693846

Agreed, all GTX 6xx and 7xx

Agreed, all GTX 6xx and 7xx cards are reported with an unknown speed and number of cores. The cards were manufactured after this app was built, and the app can't predict the future.

Gavin
Gavin
Joined: 21 Sep 10
Posts: 191
Credit: 40644337738
RAC: 8

Claggy and *Richard, You

Claggy and *Richard,

You beat me to it!

Just looking around my own NVidia hosts and saw that my 660Ti gives the same missing information but does not produce errors (bad WU aside) and realised my error :) Apologies to John Jamulla for my useless input!

But John, are you able to prove the 770 in another machine?

*edit.

John Jamulla
John Jamulla
Joined: 26 Feb 05
Posts: 32
Credit: 1163190011
RAC: 483386

Hi - just gettig back to this

Hi - just gettig back to this and confirming again all my cuda tasks are invalid.

So far I didn't try it in another machine yet?

Should that be my next step?
I'm always nervous upgrading the NVIDIA driver, but I could aos try that.

John Jamulla
John Jamulla
Joined: 26 Feb 05
Posts: 32
Credit: 1163190011
RAC: 483386

Hi - just getting back to

Hi - just getting back to this and confirming again all my cuda tasks are invalid onthis GTX 770. I don't understand why they "complete" without errors though.

So far I didn't try it in another machine yet?
Should that be my next step?

If it fails what do I do? Send it back for warantee or something?

I'm always nervous upgrading the NVIDIA driver, but I could also try that.

John Jamulla
John Jamulla
Joined: 26 Feb 05
Posts: 32
Credit: 1163190011
RAC: 483386

One thing I noticed between

One thing I noticed between the 1 good cuda task and a bad one was:
There's some weirdness in reported amount global memory used. The good one shows 2 GB, the "bad" one shows some huge "weird' amount.

Any idea what this tells me? If there a bug in the driver, or maybe the cuda code.
I am currently running 5 tasks at once on this GPU. Maybe I should try to go back to 1 and see what happens?

Any ideas are appreciated. Maybe there's just a memory problem on the board?

good one: http://einsteinathome.org/task/460356043
...
[18:56:07][5700][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 1146 MB (903 MB free / 2049 MB total) -> Used by this application (assuming a single GPU task): 2 MB
[18:56:07][5700][INFO ] Using CUDA device #0 "GeForce GTX 770" (0 CUDA cores / 0.00 GFLOPS)
[18:56:07][5700][INFO ] Version of installed CUDA driver: 6000
[18:56:07][5700][INFO ] Version of CUDA driver API used: 3020
...

bad one: http://einsteinathome.org/task/460356104
...
[20:53:18][5536][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 898 MB (1151 MB free / 2049 MB total) -> Used by this application (assuming a single GPU task): 4294967048 MB
[20:53:18][5536][INFO ] Using CUDA device #0 "GeForce GTX 770" (0 CUDA cores / 0.00 GFLOPS)
[20:53:18][5536][INFO ] Version of installed CUDA driver: 6000
[20:53:18][5536][INFO ] Version of CUDA driver API used: 3020

CElliott
CElliott
Joined: 9 Feb 05
Posts: 28
Credit: 989366625
RAC: 158738

Several WUs my computer has

Several WUs my computer has processed for Einstein@Home have had validate errors for no apparent reason. Richard Haselgrove posted the note quoted below that finds a bug in Boinc that could be the source of validate errors. Could this bug apply to Einstein@Home? Haselgrove's note has three attachments (mostly logs) that are not included here. Please respond if you would like them posted or sent somewhere.

"User Keith Myers (UID 147145 at http://milkyway.cs.rpi.edu/milkyway/index.php) has asked for my help in identifying task failures at Milkyway.
At my suggestion, he installed Windows client v7.6.2, and the attached message log extracts show the enhanced output that helped identify the CMS-dev problem.
In both cases, the task under scrutiny
(1) de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273
(2) ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220
was declared 'Validate error', and the section is empty. In the special case of Milkyway@Home, these two observations are linked, because the science result is returned in stderr, not a separate upload file.
Also in both cases, the log contains
[slot] failed to remove file slots/x/stderr.txt: unlink() failed

between 'handle_exited_app()' and 'Computation for task ... finished '
It appears that there is a race condition, whereby BOINC tries (and fails) to delete stderr.txt before the operating system has released the write lock. This (I'm presuming) also explains why the file appears empty when read off the disk for incorporation into the client_state structure in memory, prior to reporting the completed task to the project.
In order the preserve the scientific result at Milkyway (and debug and other useful information at other projects), the client should not initiate 'handle_exited_app()' until it has confirmed that the write lock on stderr.txt has been released.

Log 1 also shows that the additional safeguards on cleaning out slots are working properly: if both handle_exited_app() and get_free_slot() fail to delete the file, the next task isn't started in the not-empty slot (11), but in slot 14 instead. And when slot 11 is tested again at the next get_free_slot(), the delete succeeds and the now-empty slot is reused."

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4955
Credit: 18613591355
RAC: 5699086

RE: Several WUs my computer

Quote:
Several WUs my computer has processed for Einstein@Home have had validate errors for no apparent reason. Richard Haselgrove posted the note quoted below that finds a bug in Boinc that could be the source of validate errors. Could this bug apply to Einstein@Home? Haselgrove's note has three attachments (mostly logs) that are not included here. Please respond if you would like them posted or sent somewhere.

If you want to follow along with the nitty-gritty of the problem, the place is this SETI topic. To answer your question, yes it likely does have something to do with the validate errors here at Einstein. There is a flaw in the underlying BOINC code logic regarding just when the results of a projects finished task are made available and can lead to either missing results as in MilkyWay's blank stderr.txt to truncated stderr.txt at SETI or here at Einstein, or any project's science result in fact. The BOINC devs are aware of the problem and seem to have a handle on what needs to be done, though there is still discussion on whether to tackle the problem in a two pronged approach on both the servers side and client side code at the same time, or whether to apply the client side code fix in the easiest to facilitate manner and to follow up with the server side code fix at a later date. As an aside, the BOINC devs have fully understood the "abandoned tasks" problem and actually already have a proposed code fix submitted to the BOINC development committee. I am grateful to the devs to have been able to work with them to understand the problem and am hopeful that a solution is quickly made available. I have been fighting to get this problem recognized for half a year but finally progress is being made.

 

krankie
krankie
Joined: 14 Feb 16
Posts: 3
Credit: 943193
RAC: 0

I suspect a large group of

I suspect a large group of work units on Arecibo Binary Radio Pulsar are being fed duff data as they all seem to be returning nothing except Validate Errors.

It seems to be those beginning p2030.20151015.G187.80-00.20.N.b1s0g0.00000

Here are a few examples, some close together, some further apart

https://einsteinathome.org/workunit/241054328
https://einsteinathome.org/workunit/241054429
https://einsteinathome.org/workunit/241054531
..
..
..
https://einsteinathome.org/workunit/241055334
https://einsteinathome.org/workunit/241055336
..
..
https://einsteinathome.org/workunit/241056336

Find any work unit in this group where two or more results are in and they all have Validate Errors.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5870
Credit: 116968244881
RAC: 36793338

Thanks for the

Thanks for the report.

I've passed it on to the Devs.

Cheers,
Gary.

Happyl
Happyl
Joined: 6 Jul 05
Posts: 8
Credit: 4616006
RAC: 0

same

same here:
https://einsteinathome.org/workunit/241560137

and two more, same machine- now testing on linux- its an APU..dunno if thats important

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.