Linux CUDA validation errors

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4313

Credit: 250741417

RAC: 34819

Replacing the file upload

18 May 2011 12:43:37 UTC

Message 104139 in response to message 104138

(moderation:

)

Replacing the file upload handlers will take a few hours, but should be finished by 16:00 UTC (18:00 CEST). To avoid further validation errors from upload retries I suggest you suspend your network connection until then.

M. Schmitt

Joined: 27 Jun 05

Posts: 478

Credit: 15872262

RAC: 0

RE: This looks related to a

18 May 2011 13:00:39 UTC

Message 104140 in response to message 104137

(moderation:

)

Quote:

This looks related to a bug in the file upload handler that has been fixed in February this year, although I don't (yet) understand why it only occurs with Linux Core Clients and the BRP App. Anyway, we'll update the FUHs on both machines.

BM

My C/C++ knowledge is close to zero, but shouldn't there be a flag/constant/property in line 161 like "O_OVERWRITE"?
In case the Client is uploading one or more files of the BRP tasks with the wrong file size(in case the manager is showing the truth), the FUH might send it to the scheduler and it is marked "invalid". The normal behavior should be that the FUH is waiting for the next try of the Client to upload the file(s). Just a guess. ;)

It's still is a miracle to me why the Phenom in the new configuration is running without errors.

Btw. No CUDA errors with Milkyway, Primegrid and GPUGRID with the i7 so far.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4313

Credit: 250741417

RAC: 34819

RE: In case the Client is

18 May 2011 13:21:03 UTC

Message 104141 in response to message 104140

(moderation:

)

Quote:

In case the Client is uploading one or more files of the BRP tasks with the wrong file size(in case the manager is showing the truth), the FUH might send it to the scheduler and it is marked "invalid". The normal behavior should be that the FUH is waiting for the next try of the Client to upload the file(s). Just a guess. ;)

It's a bit more complicated.

The FUH doesn't talk to the "scheduler", it doesn't even talk to the DB.

The client tries to upload the files until it succeeds or times out. This is a communication only between the Client and the FUH. Then, on the next scheduler contact triggered by the need to fetch work or approaching the deadline, the Client reports the results it uploaded, either as "success" or as "upload error" (or other "client error" if applicable). An "error" result is marked invalid by the scheduler immediately.

When a quorum for a workunit is reached (on E@H if two successful results were reported), the validator examines the results of a workunit. First it checks each result individually for syntactic properties, e.g. the number of lines. If the result fails the check, it is marked "validate error" (outcome) and "invalid" (validate state). The results that "survive" this check are compared. If there are found two that "agree", these are marked valid, credit is granted, and one is chosen as the "canonical result". The remaining results (that don't "agree") are marked "invalid".

If the "quorum is lost" during validation, i.e. there an not two results "surviving" the syntax check and "agreeing", the validator triggers another result for this workunit to be generated and sent out. It examines the results of this workunit again in the same way when the additional result is reported.

Quote:

It's still is a miracle to me why the Phenom in the new configuration is running without errors.

Btw. No CUDA errors with Milkyway, Primegrid and GPUGRID with the i7 so far.

There might be another problem hiding behind the upload problem, but that I can't see yet.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4313

Credit: 250741417

RAC: 34819

Both file upload handlers

18 May 2011 15:12:54 UTC

Message 104142

(moderation:

)

Both file upload handlers (BRP & GW) have been updated.

I'll take another look tomorrow on the validate errors we still get from BRP. If all works ok, it should be below 1/3 of what we have now.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4313

Credit: 250741417

RAC: 34819

The BRP validate error rate

20 May 2011 7:08:03 UTC

Message 104143

(moderation:

)

The BRP validate error rate is certainly dropping, though not as fast as I hoped (down by 1/3 so far).

Ziegenmelker, does your problem persist?

From the configurations you posted I'd suspect your network problems to be an issue of compatibility between the linux driver, the interface (of the Phenom?) and the switch. A common issue is "jumbo frames", large IP frames that the Linux driver may send but the switch can't handle. You may want to fiddle with the MTU size settings.

M. Schmitt

Joined: 27 Jun 05

Posts: 478

Credit: 15872262

RAC: 0

Bernd, I still have a lot of

20 May 2011 15:34:50 UTC

Message 104144 in response to message 104143

(moderation:

)

Bernd, I still have a lot of Primegrid and Milkyway tasks. Some Primegrid tasks take about 200h and I did a CUDA-monster task(~25h) at GPUGRID without error(2 files upload without 'transient upload error'), but today I did 5 BRP tasks on the i7 and no validate errors so far:
3 valid and 2 pending, but some of them got that mysterious 'transient upload error'. I will report if any errors show up. Atm it looks good.

About the MTU settings. I left them alone, cause there are only a few options e.g. for 1500(Ethernet, DSL-broadband), but I can add any value and there are options for Ethtool. What MTU setting would you suggest? In the past I tried 1500, but it made no difference.
The switch should not be the problem, cause Win 7 crunches the BRP tasks without any error, or do you think Windows uses a different package size?
But if I'm right, DSL is limited to 1492 anyway, so this might be a good value to test? I will try it tomorrow.

Sorry for my late reply.

Richard de Lhorbe

Joined: 15 Dec 05

Posts: 46

Credit: 9518538727

RAC: 800706

My invalid CUDA error rate

20 May 2011 21:35:48 UTC

Message 104145

(moderation:

)

My invalid CUDA error rate for May 19th, which was I think the first full day with the reloaded FUH was no different, at 11 invalids out of about 30 CUDA workunits completed (it's been between 10 and 15 per day recently, 11 is about average). Today, May 20th, I have only 5 invalid with about three more possible calculations before the date changes over to the 21st. So, an improvement, but still errors compared to two weeks ago when I was generating zero errors.

I have been getting a different error code than ziegenmelker does, being shown as Validate error (2:00000010), no one has yet said what this code represents.

Regards
Richard

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4313

Credit: 250741417

RAC: 34819

RE: Validate error

20 May 2011 21:44:35 UTC

Message 104146 in response to message 104145

(moderation:

)

Quote:

Validate error (2:00000010), no one has yet said what this code represents.

It means that these result files contain something else than only the printed representation of finite floating-point numbers. Probably NaNs ("not a number") or "INF" (infinity). This points to real calculation errors, not network problems.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4313

Credit: 250741417

RAC: 34819

RE: The switch should not

20 May 2011 21:46:45 UTC

Message 104147 in response to message 104144

(moderation:

)

Quote:

The switch should not be the problem, cause Win 7 crunches the BRP tasks without any error, or do you think Windows uses a different package size?

This is my suspicion, judging from the symptoms.

Something else to try would be to connect the computers directly, without a switch, just a cable.

Richard de Lhorbe

Joined: 15 Dec 05

Posts: 46

Credit: 9518538727

RAC: 800706

Actually, I have just noticed

20 May 2011 22:45:27 UTC

Message 104148

(moderation:

)

Actually, I have just noticed that it is far far worse a situation. While it is true that out of approximately 30 CUDA workunits per day that this one computer uploads, about one third are immediately tagged as Invalid (error code (2:00000010), it turns out that of the remaining two thirds, ALL are eventually marked as invalid, so every single CUDA workunit I have processed over the past week and a half has been a waste of time. And this same computer, same BOINC version, same OS, same Nvidia card and same driver was 100 % successful up until two weeks ago !!! Once the CUDA workunits are eventually tagged as invalid, the error code seems to be changed to Validate error (8:00001000), again, I have no idea what this one means. Comments please ! At least I now know why my daily credit continues to collapse and won't recover ...

Regards
Richard

Linux CUDA validation errors

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports