Linux CUDA validation errors

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

It is more than frustrating,

It is more than frustrating, but discouraging me that I get such a long list of invalid results. :(

I really don't know what to do.

What does the exit code 'Validate error (32:00100000)' mean?

Even some pending results get this code, when the wingman returns his result.

Data from nvidia-smi -q:
[pre]==============NVSMI LOG==============

Timestamp : Fri Mar 18 08:14:42 2011

Driver Version : 270.26

Attached GPUs : 1

GPU 0:4:0
Product Name : GeForce GTX 460
Display Mode : N/A
Persistence Mode : N/A
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : N/A
Inforom Version
OEM Object : N/A
ECC Object : N/A
Power Capping Object : N/A
PCI
Bus : 4
Device : 0
Domain : 0
Function : 0
Link Gen : 1
Link Speed : 0
Device Id : E2210DE
Bus Id : 0:4:0
Fan Speed : 40 %
Memory Usage
Total : 1023 Mb
Used : 917 Mb
Free : 105 Mb
[/pre]

Temperature is about 51-52° C.

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

Update. Nothing has

Update.
Nothing has changed.

Old configuration

New configuration

What else did I do:

  • * Changed the CAT5 cables with CAT6 cables from hosts to switch. No change.

* Connected the Phenom AND the i7 directly to the DHCP server in my Internet router. No change.

* Bought two 10/100MBit PCI cards with pretty old chipsets and plugged them in the Phenom and the i7. No change.

Meanwhile it is such a terrible waste of energy, that I start to think about the consequences. 50% success in about 14h is equivalent to 1,4kW, if I consider that the GFX 570 needs 200W. That's more than 100€/year private money for nothing.

[edit]
Links to the two identical hosts with Linux and Windows showing the error page.

RAMA
RAMA
Joined: 5 May 05
Posts: 18
Credit: 657880205
RAC: 0

Did it also happen when just

Did it also happen when just running 1 task?
Maybe there is a memory conflict on the grafic card when running more WU's with the Linux driver?

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: Did it also happen when

Quote:
Did it also happen when just running 1 task?
Maybe there is a memory conflict on the grafic card when running more WU's with the Linux driver?


The Phenom with the GTX 460 is running 3 tasks without X11 and I tried running just one task on the i7 today, without a change.
I also tried several reboots and even complete shutdowns. I will try Milkyway@home to check if the problems persist.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4313
Credit: 250741417
RAC: 34819

RE: What does the exit code

Quote:
What does the exit code 'Validate error (32:00100000)' mean?

It means that a "result file has too few or too many rows".

The result file PM0109_00491.dm_292_1_0 consist of two (at first glance identical) blocks each of which would probably make a correct result file, but not both together.

I'm still not sure what the actual problem is, I doubt that it's CUDA-related.

AFAIK the app writes the result file as a whole, if it would do the same calculation twice, it would simply overwrite an existing result file, but not append to it.

Are you aware of problems uploading the BRP results? Anything about retries in the client logs? I'll see if I can find some time to dig through the upload handler logs.

BM

BM

Richard de Lhorbe
Richard de Lhorbe
Joined: 15 Dec 05
Posts: 46
Credit: 9518539545
RAC: 800348

I posted a similar problem in

I posted a similar problem in the sticky thread that is for validation issues, but I am not sure if anyone in-the-know has seen it yet. I seem to be having a similar problem except with an error code of

Validate error (2:00000010)

which I would like to know what that implies. I have a computer that a couple of weeks ago was crunching CUDA workunits just fine, and now all of a sudden a few days ago one out of three to as many as half have validation errors. No drivers have been changed, no hardware has been changed, just many of the CUDA workunits have suddenly become invalid.

More info at Message 112129

Regards
Richard

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: RE: What does the

Quote:
Quote:
What does the exit code 'Validate error (32:00100000)' mean?

It means that a "result file has too few or too many rows".

The result file PM0109_00491.dm_292_1_0 consist of two (at first glance identical) blocks each of which would probably make a correct result file, but not both together.

I'm still not sure what the actual problem is, I doubt that it's CUDA-related.

AFAIK the app writes the result file as a whole, if it would do the same calculation twice, it would simply overwrite an existing result file, but not append to it.

Are you aware of problems uploading the BRP results? Anything about retries in the client logs? I'll see if I can find some time to dig through the upload handler logs.

BM


I suspended network activity as Bikeman suggested and copied the content of some result files in OOo-Calc. Each block had exactly 100 lines.
I get some "transient upload errors" with BRP3 tasks AND h1-tasks, but all h1-tasks are valid. Example:

[pre]16-May-2011 12:45:10 [Einstein@Home] Temporarily failed upload of PM0097_034D1.dm_320_2_0: transient upload error
16-May-2011 12:45:10 [Einstein@Home] Backing off 1 min 0 sec on upload of PM0097_034D1.dm_320_2_0
16-May-2011 12:45:10 [Einstein@Home] Temporarily failed upload of PM0097_034D1.dm_320_2_1: transient upload error
16-May-2011 12:45:10 [Einstein@Home] Backing off 1 min 0 sec on upload of PM0097_034D1.dm_320_2_1
16-May-2011 12:45:10 [Einstein@Home] Started upload of PM0097_034D1.dm_320_2_2
16-May-2011 12:45:10 [Einstein@Home] Started upload of PM0097_034D1.dm_320_2_3
16-May-2011 12:45:17 [Einstein@Home] Temporarily failed upload of PM0097_034D1.dm_320_2_2: transient upload error
16-May-2011 12:45:17 [Einstein@Home] Backing off 1 min 0 sec on upload of PM0097_034D1.dm_320_2_2
16-May-2011 12:45:17 [Einstein@Home] Temporarily failed upload of PM0097_034D1.dm_320_2_3: transient upload error
16-May-2011 12:45:17 [Einstein@Home] Backing off 1 min 0 sec on upload of PM0097_034D1.dm_320_2_3
16-May-2011 12:46:17 [Einstein@Home] Started upload of PM0097_034D1.dm_320_2_0
16-May-2011 12:46:17 [Einstein@Home] Started upload of PM0097_034D1.dm_320_2_1
16-May-2011 12:46:20 [Einstein@Home] Finished upload of PM0097_034D1.dm_320_2_0
16-May-2011 12:46:20 [Einstein@Home] Finished upload of PM0097_034D1.dm_320_2_1
16-May-2011 12:46:20 [Einstein@Home] Started upload of PM0097_034D1.dm_320_2_2
16-May-2011 12:46:20 [Einstein@Home] Started upload of PM0097_034D1.dm_320_2_3
16-May-2011 12:46:26 [Einstein@Home] Finished upload of PM0097_034D1.dm_320_2_2
16-May-2011 12:46:26 [Einstein@Home] Finished upload of PM0097_034D1.dm_320_2_3[/pre]

[pre]16-May-2011 11:15:22 [Einstein@Home] Started upload of h1_1487.60_S5R4__97_S5GC1HFa_0_0
16-May-2011 11:15:25 [Einstein@Home] Temporarily failed upload of h1_1487.60_S5R4__97_S5GC1HFa_0_0: transient upload error
16-May-2011 11:15:25 [Einstein@Home] Backing off 1 min 0 sec on upload of h1_1487.60_S5R4__97_S5GC1HFa_0_0
16-May-2011 11:16:25 [Einstein@Home] Started upload of h1_1487.60_S5R4__97_S5GC1HFa_0_0
16-May-2011 11:16:28 [Einstein@Home] Finished upload of h1_1487.60_S5R4__97_S5GC1HFa_0_0[/pre]

I do not get these errors with Win 7 or with the Phenom. That's the reason why I added these 10/100 MBit NIC's to find out if it's a network problem with the new mainbord for the i7 2600K. There is another wired thing. The amount of bytes BOINC has uploaded is always _more_ than file size! But this doesn't change the validity of the h1-tasks.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4313
Credit: 250741417
RAC: 34819

Thanks. The problem you

Thanks.

The problem you reported shows up in many of the BRP3 "validate error" results, not only yours. The pattern currently seems to indicate this being limited to Linux, although I still don't know whether the problem is the App, the Client or even on the server side (file upload handler). Possibly a combination of more than one of that. Anyway, this is not limited to your machine.

Thanks for reporting.

Sorry it took me so long to get to this. My primary focus is the GW app, getting S6Bucket out the door was my highest priority, and I wasn't paying much attention to BRP until this was done.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4313
Credit: 250741417
RAC: 34819

This looks related to a bug

This looks related to a bug in the file upload handler that has been fixed in February this year, although I don't (yet) understand why it only occurs with Linux Core Clients and the BRP App. Anyway, we'll update the FUHs on both machines.

BM

BM

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: Sorry it took me so

Quote:

Sorry it took me so long to get to this. My primary focus is the GW app, getting S6Bucket out the door was my highest priority, and I wasn't paying much attention to BRP until this was done.

BM


No problem, Bernd. I can completely understand your primary objectives.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.