Replacing the file upload handlers will take a few hours, but should be finished by 16:00 UTC (18:00 CEST). To avoid further validation errors from upload retries I suggest you suspend your network connection until then.
This looks related to a bug in the file upload handler that has been fixed in February this year, although I don't (yet) understand why it only occurs with Linux Core Clients and the BRP App. Anyway, we'll update the FUHs on both machines.
BM
My C/C++ knowledge is close to zero, but shouldn't there be a flag/constant/property in line 161 like "O_OVERWRITE"?
In case the Client is uploading one or more files of the BRP tasks with the wrong file size(in case the manager is showing the truth), the FUH might send it to the scheduler and it is marked "invalid". The normal behavior should be that the FUH is waiting for the next try of the Client to upload the file(s). Just a guess. ;)
It's still is a miracle to me why the Phenom in the new configuration is running without errors.
Btw. No CUDA errors with Milkyway, Primegrid and GPUGRID with the i7 so far.
In case the Client is uploading one or more files of the BRP tasks with the wrong file size(in case the manager is showing the truth), the FUH might send it to the scheduler and it is marked "invalid". The normal behavior should be that the FUH is waiting for the next try of the Client to upload the file(s). Just a guess. ;)
It's a bit more complicated.
The FUH doesn't talk to the "scheduler", it doesn't even talk to the DB.
The client tries to upload the files until it succeeds or times out. This is a communication only between the Client and the FUH. Then, on the next scheduler contact triggered by the need to fetch work or approaching the deadline, the Client reports the results it uploaded, either as "success" or as "upload error" (or other "client error" if applicable). An "error" result is marked invalid by the scheduler immediately.
When a quorum for a workunit is reached (on E@H if two successful results were reported), the validator examines the results of a workunit. First it checks each result individually for syntactic properties, e.g. the number of lines. If the result fails the check, it is marked "validate error" (outcome) and "invalid" (validate state). The results that "survive" this check are compared. If there are found two that "agree", these are marked valid, credit is granted, and one is chosen as the "canonical result". The remaining results (that don't "agree") are marked "invalid".
If the "quorum is lost" during validation, i.e. there an not two results "surviving" the syntax check and "agreeing", the validator triggers another result for this workunit to be generated and sent out. It examines the results of this workunit again in the same way when the additional result is reported.
Quote:
It's still is a miracle to me why the Phenom in the new configuration is running without errors.
Btw. No CUDA errors with Milkyway, Primegrid and GPUGRID with the i7 so far.
There might be another problem hiding behind the upload problem, but that I can't see yet.
The BRP validate error rate is certainly dropping, though not as fast as I hoped (down by 1/3 so far).
Ziegenmelker, does your problem persist?
From the configurations you posted I'd suspect your network problems to be an issue of compatibility between the linux driver, the interface (of the Phenom?) and the switch. A common issue is "jumbo frames", large IP frames that the Linux driver may send but the switch can't handle. You may want to fiddle with the MTU size settings.
Bernd, I still have a lot of Primegrid and Milkyway tasks. Some Primegrid tasks take about 200h and I did a CUDA-monster task(~25h) at GPUGRID without error(2 files upload without 'transient upload error'), but today I did 5 BRP tasks on the i7 and no validate errors so far:
3 valid and 2 pending, but some of them got that mysterious 'transient upload error'. I will report if any errors show up. Atm it looks good.
About the MTU settings. I left them alone, cause there are only a few options e.g. for 1500(Ethernet, DSL-broadband), but I can add any value and there are options for Ethtool. What MTU setting would you suggest? In the past I tried 1500, but it made no difference.
The switch should not be the problem, cause Win 7 crunches the BRP tasks without any error, or do you think Windows uses a different package size?
But if I'm right, DSL is limited to 1492 anyway, so this might be a good value to test? I will try it tomorrow.
My invalid CUDA error rate for May 19th, which was I think the first full day with the reloaded FUH was no different, at 11 invalids out of about 30 CUDA workunits completed (it's been between 10 and 15 per day recently, 11 is about average). Today, May 20th, I have only 5 invalid with about three more possible calculations before the date changes over to the 21st. So, an improvement, but still errors compared to two weeks ago when I was generating zero errors.
I have been getting a different error code than ziegenmelker does, being shown as Validate error (2:00000010), no one has yet said what this code represents.
Validate error (2:00000010), no one has yet said what this code represents.
It means that these result files contain something else than only the printed representation of finite floating-point numbers. Probably NaNs ("not a number") or "INF" (infinity). This points to real calculation errors, not network problems.
Actually, I have just noticed that it is far far worse a situation. While it is true that out of approximately 30 CUDA workunits per day that this one computer uploads, about one third are immediately tagged as Invalid (error code (2:00000010), it turns out that of the remaining two thirds, ALL are eventually marked as invalid, so every single CUDA workunit I have processed over the past week and a half has been a waste of time. And this same computer, same BOINC version, same OS, same Nvidia card and same driver was 100 % successful up until two weeks ago !!! Once the CUDA workunits are eventually tagged as invalid, the error code seems to be changed to Validate error (8:00001000), again, I have no idea what this one means. Comments please ! At least I now know why my daily credit continues to collapse and won't recover ...
Replacing the file upload
)
Replacing the file upload handlers will take a few hours, but should be finished by 16:00 UTC (18:00 CEST). To avoid further validation errors from upload retries I suggest you suspend your network connection until then.
BM
BM
RE: This looks related to a
)
My C/C++ knowledge is close to zero, but shouldn't there be a flag/constant/property in line 161 like "O_OVERWRITE"?
In case the Client is uploading one or more files of the BRP tasks with the wrong file size(in case the manager is showing the truth), the FUH might send it to the scheduler and it is marked "invalid". The normal behavior should be that the FUH is waiting for the next try of the Client to upload the file(s). Just a guess. ;)
It's still is a miracle to me why the Phenom in the new configuration is running without errors.
Btw. No CUDA errors with Milkyway, Primegrid and GPUGRID with the i7 so far.
RE: In case the Client is
)
It's a bit more complicated.
The FUH doesn't talk to the "scheduler", it doesn't even talk to the DB.
The client tries to upload the files until it succeeds or times out. This is a communication only between the Client and the FUH. Then, on the next scheduler contact triggered by the need to fetch work or approaching the deadline, the Client reports the results it uploaded, either as "success" or as "upload error" (or other "client error" if applicable). An "error" result is marked invalid by the scheduler immediately.
When a quorum for a workunit is reached (on E@H if two successful results were reported), the validator examines the results of a workunit. First it checks each result individually for syntactic properties, e.g. the number of lines. If the result fails the check, it is marked "validate error" (outcome) and "invalid" (validate state). The results that "survive" this check are compared. If there are found two that "agree", these are marked valid, credit is granted, and one is chosen as the "canonical result". The remaining results (that don't "agree") are marked "invalid".
If the "quorum is lost" during validation, i.e. there an not two results "surviving" the syntax check and "agreeing", the validator triggers another result for this workunit to be generated and sent out. It examines the results of this workunit again in the same way when the additional result is reported.
There might be another problem hiding behind the upload problem, but that I can't see yet.
BM
BM
Both file upload handlers
)
Both file upload handlers (BRP & GW) have been updated.
I'll take another look tomorrow on the validate errors we still get from BRP. If all works ok, it should be below 1/3 of what we have now.
BM
BM
The BRP validate error rate
)
The BRP validate error rate is certainly dropping, though not as fast as I hoped (down by 1/3 so far).
Ziegenmelker, does your problem persist?
From the configurations you posted I'd suspect your network problems to be an issue of compatibility between the linux driver, the interface (of the Phenom?) and the switch. A common issue is "jumbo frames", large IP frames that the Linux driver may send but the switch can't handle. You may want to fiddle with the MTU size settings.
BM
BM
Bernd, I still have a lot of
)
Bernd, I still have a lot of Primegrid and Milkyway tasks. Some Primegrid tasks take about 200h and I did a CUDA-monster task(~25h) at GPUGRID without error(2 files upload without 'transient upload error'), but today I did 5 BRP tasks on the i7 and no validate errors so far:
3 valid and 2 pending, but some of them got that mysterious 'transient upload error'. I will report if any errors show up. Atm it looks good.
About the MTU settings. I left them alone, cause there are only a few options e.g. for 1500(Ethernet, DSL-broadband), but I can add any value and there are options for Ethtool. What MTU setting would you suggest? In the past I tried 1500, but it made no difference.
The switch should not be the problem, cause Win 7 crunches the BRP tasks without any error, or do you think Windows uses a different package size?
But if I'm right, DSL is limited to 1492 anyway, so this might be a good value to test? I will try it tomorrow.
Sorry for my late reply.
My invalid CUDA error rate
)
My invalid CUDA error rate for May 19th, which was I think the first full day with the reloaded FUH was no different, at 11 invalids out of about 30 CUDA workunits completed (it's been between 10 and 15 per day recently, 11 is about average). Today, May 20th, I have only 5 invalid with about three more possible calculations before the date changes over to the 21st. So, an improvement, but still errors compared to two weeks ago when I was generating zero errors.
I have been getting a different error code than ziegenmelker does, being shown as Validate error (2:00000010), no one has yet said what this code represents.
Regards
Richard
RE: Validate error
)
It means that these result files contain something else than only the printed representation of finite floating-point numbers. Probably NaNs ("not a number") or "INF" (infinity). This points to real calculation errors, not network problems.
BM
BM
RE: The switch should not
)
This is my suspicion, judging from the symptoms.
Something else to try would be to connect the computers directly, without a switch, just a cable.
BM
BM
Actually, I have just noticed
)
Actually, I have just noticed that it is far far worse a situation. While it is true that out of approximately 30 CUDA workunits per day that this one computer uploads, about one third are immediately tagged as Invalid (error code (2:00000010), it turns out that of the remaining two thirds, ALL are eventually marked as invalid, so every single CUDA workunit I have processed over the past week and a half has been a waste of time. And this same computer, same BOINC version, same OS, same Nvidia card and same driver was 100 % successful up until two weeks ago !!! Once the CUDA workunits are eventually tagged as invalid, the error code seems to be changed to Validate error (8:00001000), again, I have no idea what this one means. Comments please ! At least I now know why my daily credit continues to collapse and won't recover ...
Regards
Richard