Linux CUDA validation errors

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4313
Credit: 250741417
RAC: 34819

Replacing the file upload

Replacing the file upload handlers will take a few hours, but should be finished by 16:00 UTC (18:00 CEST). To avoid further validation errors from upload retries I suggest you suspend your network connection until then.

BM

BM

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: This looks related to a

Quote:

This looks related to a bug in the file upload handler that has been fixed in February this year, although I don't (yet) understand why it only occurs with Linux Core Clients and the BRP App. Anyway, we'll update the FUHs on both machines.

BM


My C/C++ knowledge is close to zero, but shouldn't there be a flag/constant/property in line 161 like "O_OVERWRITE"?
In case the Client is uploading one or more files of the BRP tasks with the wrong file size(in case the manager is showing the truth), the FUH might send it to the scheduler and it is marked "invalid". The normal behavior should be that the FUH is waiting for the next try of the Client to upload the file(s). Just a guess. ;)

It's still is a miracle to me why the Phenom in the new configuration is running without errors.

Btw. No CUDA errors with Milkyway, Primegrid and GPUGRID with the i7 so far.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4313
Credit: 250741417
RAC: 34819

RE: In case the Client is

Quote:
In case the Client is uploading one or more files of the BRP tasks with the wrong file size(in case the manager is showing the truth), the FUH might send it to the scheduler and it is marked "invalid". The normal behavior should be that the FUH is waiting for the next try of the Client to upload the file(s). Just a guess. ;)

It's a bit more complicated.

The FUH doesn't talk to the "scheduler", it doesn't even talk to the DB.

The client tries to upload the files until it succeeds or times out. This is a communication only between the Client and the FUH. Then, on the next scheduler contact triggered by the need to fetch work or approaching the deadline, the Client reports the results it uploaded, either as "success" or as "upload error" (or other "client error" if applicable). An "error" result is marked invalid by the scheduler immediately.

When a quorum for a workunit is reached (on E@H if two successful results were reported), the validator examines the results of a workunit. First it checks each result individually for syntactic properties, e.g. the number of lines. If the result fails the check, it is marked "validate error" (outcome) and "invalid" (validate state). The results that "survive" this check are compared. If there are found two that "agree", these are marked valid, credit is granted, and one is chosen as the "canonical result". The remaining results (that don't "agree") are marked "invalid".

If the "quorum is lost" during validation, i.e. there an not two results "surviving" the syntax check and "agreeing", the validator triggers another result for this workunit to be generated and sent out. It examines the results of this workunit again in the same way when the additional result is reported.

Quote:

It's still is a miracle to me why the Phenom in the new configuration is running without errors.

Btw. No CUDA errors with Milkyway, Primegrid and GPUGRID with the i7 so far.

There might be another problem hiding behind the upload problem, but that I can't see yet.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4313
Credit: 250741417
RAC: 34819

Both file upload handlers

Both file upload handlers (BRP & GW) have been updated.

I'll take another look tomorrow on the validate errors we still get from BRP. If all works ok, it should be below 1/3 of what we have now.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4313
Credit: 250741417
RAC: 34819

The BRP validate error rate

The BRP validate error rate is certainly dropping, though not as fast as I hoped (down by 1/3 so far).

Ziegenmelker, does your problem persist?

From the configurations you posted I'd suspect your network problems to be an issue of compatibility between the linux driver, the interface (of the Phenom?) and the switch. A common issue is "jumbo frames", large IP frames that the Linux driver may send but the switch can't handle. You may want to fiddle with the MTU size settings.

BM

BM

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

Bernd, I still have a lot of

Bernd, I still have a lot of Primegrid and Milkyway tasks. Some Primegrid tasks take about 200h and I did a CUDA-monster task(~25h) at GPUGRID without error(2 files upload without 'transient upload error'), but today I did 5 BRP tasks on the i7 and no validate errors so far:
3 valid and 2 pending, but some of them got that mysterious 'transient upload error'. I will report if any errors show up. Atm it looks good.

About the MTU settings. I left them alone, cause there are only a few options e.g. for 1500(Ethernet, DSL-broadband), but I can add any value and there are options for Ethtool. What MTU setting would you suggest? In the past I tried 1500, but it made no difference.
The switch should not be the problem, cause Win 7 crunches the BRP tasks without any error, or do you think Windows uses a different package size?
But if I'm right, DSL is limited to 1492 anyway, so this might be a good value to test? I will try it tomorrow.

Sorry for my late reply.

Richard de Lhorbe
Richard de Lhorbe
Joined: 15 Dec 05
Posts: 46
Credit: 9518538727
RAC: 800706

My invalid CUDA error rate

My invalid CUDA error rate for May 19th, which was I think the first full day with the reloaded FUH was no different, at 11 invalids out of about 30 CUDA workunits completed (it's been between 10 and 15 per day recently, 11 is about average). Today, May 20th, I have only 5 invalid with about three more possible calculations before the date changes over to the 21st. So, an improvement, but still errors compared to two weeks ago when I was generating zero errors.

I have been getting a different error code than ziegenmelker does, being shown as Validate error (2:00000010), no one has yet said what this code represents.

Regards
Richard

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4313
Credit: 250741417
RAC: 34819

RE: Validate error

Quote:
Validate error (2:00000010), no one has yet said what this code represents.

It means that these result files contain something else than only the printed representation of finite floating-point numbers. Probably NaNs ("not a number") or "INF" (infinity). This points to real calculation errors, not network problems.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4313
Credit: 250741417
RAC: 34819

RE: The switch should not

Quote:
The switch should not be the problem, cause Win 7 crunches the BRP tasks without any error, or do you think Windows uses a different package size?

This is my suspicion, judging from the symptoms.

Something else to try would be to connect the computers directly, without a switch, just a cable.

BM

BM

Richard de Lhorbe
Richard de Lhorbe
Joined: 15 Dec 05
Posts: 46
Credit: 9518538727
RAC: 800706

Actually, I have just noticed

Actually, I have just noticed that it is far far worse a situation. While it is true that out of approximately 30 CUDA workunits per day that this one computer uploads, about one third are immediately tagged as Invalid (error code (2:00000010), it turns out that of the remaining two thirds, ALL are eventually marked as invalid, so every single CUDA workunit I have processed over the past week and a half has been a waste of time. And this same computer, same BOINC version, same OS, same Nvidia card and same driver was 100 % successful up until two weeks ago !!! Once the CUDA workunits are eventually tagged as invalid, the error code seems to be changed to Validate error (8:00001000), again, I have no idea what this one means. Comments please ! At least I now know why my daily credit continues to collapse and won't recover ...

Regards
Richard

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.