Validate error - What this really means!

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2959636118

RAC: 705113

Agreed, all GTX 6xx and 7xx

26 Jul 2014 14:31:55 UTC

Message 107900 in response to message 107899

(moderation:

)

Agreed, all GTX 6xx and 7xx cards are reported with an unknown speed and number of cores. The cards were manufactured after this app was built, and the app can't predict the future.

Gavin

Joined: 21 Sep 10

Posts: 191

Credit: 40644337738

RAC: 1

Claggy and *Richard, You

26 Jul 2014 14:41:25 UTC

Message 107901 in response to message 107899

(moderation:

)

Claggy and *Richard,

You beat me to it!

Just looking around my own NVidia hosts and saw that my 660Ti gives the same missing information but does not produce errors (bad WU aside) and realised my error :) Apologies to John Jamulla for my useless input!

But John, are you able to prove the 770 in another machine?

*edit.

John Jamulla

Joined: 26 Feb 05

Posts: 32

Credit: 1174435259

RAC: 550821

Hi - just gettig back to this

19 Oct 2014 10:22:12 UTC

Message 107902 in response to message 107901

(moderation:

)

Hi - just gettig back to this and confirming again all my cuda tasks are invalid.

So far I didn't try it in another machine yet?

Should that be my next step?
I'm always nervous upgrading the NVIDIA driver, but I could aos try that.

John Jamulla

Joined: 26 Feb 05

Posts: 32

Credit: 1174435259

RAC: 550821

Hi - just getting back to

19 Oct 2014 10:23:30 UTC

Message 107903 in response to message 107901

(moderation:

)

Hi - just getting back to this and confirming again all my cuda tasks are invalid onthis GTX 770. I don't understand why they "complete" without errors though.

So far I didn't try it in another machine yet?
Should that be my next step?

If it fails what do I do? Send it back for warantee or something?

I'm always nervous upgrading the NVIDIA driver, but I could also try that.

John Jamulla

Joined: 26 Feb 05

Posts: 32

Credit: 1174435259

RAC: 550821

One thing I noticed between

19 Oct 2014 10:38:16 UTC

Message 107904 in response to message 107903

(moderation:

)

One thing I noticed between the 1 good cuda task and a bad one was:
There's some weirdness in reported amount global memory used. The good one shows 2 GB, the "bad" one shows some huge "weird' amount.

Any idea what this tells me? If there a bug in the driver, or maybe the cuda code.
I am currently running 5 tasks at once on this GPU. Maybe I should try to go back to 1 and see what happens?

Any ideas are appreciated. Maybe there's just a memory problem on the board?

good one: http://einsteinathome.org/task/460356043
...
[18:56:07][5700][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 1146 MB (903 MB free / 2049 MB total) -> Used by this application (assuming a single GPU task): 2 MB
[18:56:07][5700][INFO ] Using CUDA device #0 "GeForce GTX 770" (0 CUDA cores / 0.00 GFLOPS)
[18:56:07][5700][INFO ] Version of installed CUDA driver: 6000
[18:56:07][5700][INFO ] Version of CUDA driver API used: 3020
...

bad one: http://einsteinathome.org/task/460356104
...
[20:53:18][5536][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 898 MB (1151 MB free / 2049 MB total) -> Used by this application (assuming a single GPU task): 4294967048 MB
[20:53:18][5536][INFO ] Using CUDA device #0 "GeForce GTX 770" (0 CUDA cores / 0.00 GFLOPS)
[20:53:18][5536][INFO ] Version of installed CUDA driver: 6000
[20:53:18][5536][INFO ] Version of CUDA driver API used: 3020

CElliott

Joined: 9 Feb 05

Posts: 28

Credit: 1005153082

RAC: 849281

Several WUs my computer has

10 Jul 2015 9:48:22 UTC

Message 107905

(moderation:

)

Several WUs my computer has processed for Einstein@Home have had validate errors for no apparent reason. Richard Haselgrove posted the note quoted below that finds a bug in Boinc that could be the source of validate errors. Could this bug apply to Einstein@Home? Haselgrove's note has three attachments (mostly logs) that are not included here. Please respond if you would like them posted or sent somewhere.

"User Keith Myers (UID 147145 at http://milkyway.cs.rpi.edu/milkyway/index.php) has asked for my help in identifying task failures at Milkyway.
At my suggestion, he installed Windows client v7.6.2, and the attached message log extracts show the enhanced output that helped identify the CMS-dev problem.
In both cases, the task under scrutiny
(1) de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273
(2) ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220
was declared 'Validate error', and the section is empty. In the special case of Milkyway@Home, these two observations are linked, because the science result is returned in stderr, not a separate upload file.
Also in both cases, the log contains
[slot] failed to remove file slots/x/stderr.txt: unlink() failed

between 'handle_exited_app()' and 'Computation for task ... finished '
It appears that there is a race condition, whereby BOINC tries (and fails) to delete stderr.txt before the operating system has released the write lock. This (I'm presuming) also explains why the file appears empty when read off the disk for incorporation into the client_state structure in memory, prior to reporting the completed task to the project.
In order the preserve the scientific result at Milkyway (and debug and other useful information at other projects), the client should not initiate 'handle_exited_app()' until it has confirmed that the write lock on stderr.txt has been released.

Log 1 also shows that the additional safeguards on cleaning out slots are working properly: if both handle_exited_app() and get_free_slot() fail to delete the file, the next task isn't started in the not-empty slot (11), but in slot 14 instead. And when slot 11 is tested again at the next get_free_slot(), the delete succeeds and the now-empty slot is reused."

Keith Myers

Joined: 11 Feb 11

Posts: 4965

Credit: 18756464027

RAC: 7161019

RE: Several WUs my computer

12 Jul 2015 18:42:12 UTC

Message 107906 in response to message 107905

(moderation:

)

Quote:

Several WUs my computer has processed for Einstein@Home have had validate errors for no apparent reason. Richard Haselgrove posted the note quoted below that finds a bug in Boinc that could be the source of validate errors. Could this bug apply to Einstein@Home? Haselgrove's note has three attachments (mostly logs) that are not included here. Please respond if you would like them posted or sent somewhere.

If you want to follow along with the nitty-gritty of the problem, the place is this SETI topic. To answer your question, yes it likely does have something to do with the validate errors here at Einstein. There is a flaw in the underlying BOINC code logic regarding just when the results of a projects finished task are made available and can lead to either missing results as in MilkyWay's blank stderr.txt to truncated stderr.txt at SETI or here at Einstein, or any project's science result in fact. The BOINC devs are aware of the problem and seem to have a handle on what needs to be done, though there is still discussion on whether to tackle the problem in a two pronged approach on both the servers side and client side code at the same time, or whether to apply the client side code fix in the easiest to facilitate manner and to follow up with the server side code fix at a later date. As an aside, the BOINC devs have fully understood the "abandoned tasks" problem and actually already have a proposed code fix submitted to the BOINC development committee. I am grateful to the devs to have been able to work with them to understand the problem and am hopeful that a solution is quickly made available. I have been fighting to get this problem recognized for half a year but finally progress is being made.

krankie

Joined: 14 Feb 16

Posts: 3

Credit: 943193

RAC: 0

I suspect a large group of

2 Mar 2016 18:47:03 UTC

Message 107907 in response to message 107906

(moderation:

)

I suspect a large group of work units on Arecibo Binary Radio Pulsar are being fed duff data as they all seem to be returning nothing except Validate Errors.

It seems to be those beginning p2030.20151015.G187.80-00.20.N.b1s0g0.00000

Here are a few examples, some close together, some further apart

https://einsteinathome.org/workunit/241054328
https://einsteinathome.org/workunit/241054429
https://einsteinathome.org/workunit/241054531
..
..
..
https://einsteinathome.org/workunit/241055334
https://einsteinathome.org/workunit/241055336
..
..
https://einsteinathome.org/workunit/241056336

Find any work unit in this group where two or more results are in and they all have Validate Errors.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117715085690

RAC: 34994128

Thanks for the

2 Mar 2016 20:56:52 UTC

Message 107909 in response to message 107907

(moderation:

)

Thanks for the report.

I've passed it on to the Devs.

Cheers,
Gary.

Happyl

Joined: 6 Jul 05

Posts: 8

Credit: 4616006

RAC: 0

same

9 Mar 2016 18:28:12 UTC

Message 107910

(moderation:

)

same here:
https://einsteinathome.org/workunit/241560137

and two more, same machine- now testing on linux- its an APU..dunno if thats important

Validate error - What this really means!

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports