Validate error - What this really means!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118369478855
RAC: 25530870
Topic 196089

Regular readers of this forum will probably be aware of the long running "Validate errors [Pls post here]" sticky thread - now closed. I've decided to replace that thread with this new one so please don't post there any more. I'm going to start off with a full explanation of what a 'Validate error' really is. I would ask that people who wish to post in this thread at least make the effort to read this information and then check that the error they intend to post about is really a 'Validate error' rather than any other type of error that has caused a result to be marked as 'Invalid'. If you view your tasks list on the website and see a status of anything other than exactly "Validate error" you shouldn't be posting here. I will quite likely just delete posts that ignore this request.

So what exactly is a 'Validate error'. It's simply an otherwise successfully completed result that the validator daemon is immediately recognising as having something wrong without even needing to compare with any other result in the workunit quorum.* For example, result files need to conform with certain specifications, such as number of columns, number of rows, elements to be numerical, elements to lie within certain value ranges, etc. Any result that fails the format specification will be marked as a 'Validate error' without further ado.

* EDIT: The validator doesn't get called until there is something to validate - ie usually two results to check against each other. Before actually checking, the validator will do a 'sanity check' of each result and it's at that point it will create a 'Validate error' if it doesn't like what it finds. END OF EDIT:

In the past, bugs in the validator code itself and various other server-side 'problems' of one form or another have caused the validator to mark results as 'Validate error' incorrectly. If you care to carefully peruse the full long running thread, you will see some examples of these from some time ago. I seem to remember an example where non-ASCII characters in a user name caused the validator to choke on the result. Of course, it's possible for some further problem like this to be found but that is said to be rather unlikely as the validator is now considered to be mature and relatively stable.

I am informed that these days, a lot of 'validate errors' are a sign of problems at the client side. Such problems might be caused by user correctable things like excessive overclocking, overheating, faulty PSUs, swollen capacitors, faulty RAM modules, faulty GPUs, etc. If you start to see a lot of 'Validate errors' for the first time, please stress test your hardware and check obvious things like turning off overclocking, checking heat sinks for blockage and cooling fans for 'dry' bearings, etc.

My own belief is that there is possibly still something to be found with the validation of FGRP1 tasks on certain platforms. Here are some stats that I find quite interesting. The rate of Validate errors in the last week for FGRP1 tasks is highly platform dependent:

  • * Windows/x86 - 0.25% ie 648 errors out of 254931 results
    * Linux/x86 - 6.4% ie 4384 errors out of 68525 results
    * Mac OS X/Intel - 8.4% ie 3399 errors out of 40333 results
    * Unknown platform - 4.4% ie 119 errors out of 2691 results

I have a number of systems running OS X and the rate of 'Validate errors' with FGRP1 tasks has been quite concerning. I just setup a brand new system and it got 7 errors in the first two days of operation, all on FGRP1 tasks. It's possible for me to actually see the reason that the validator decided to mark them as 'Validate error'. In all 7 cases, the reason was given as
- result file has entries that aren't numbers
The result files aren't plain text so it's not possible to just browse them to find what the validator is upset about. Bernd has told me that he might be able to give me a tool to pursue this further so I'll post here again if something more can be found.

Before finally posting this message, I've just had a look at the latest results for my new host and a further 'Validate error' has now appeared. This time there are two reasons given by the validator

- result file has entries that aren't numbers
- a number is out of valid range for this result


It's fairly easy for me to get the error message from the validator daemon for 'Validate errors' like this. All I need is the resultID (the TaskID and NOT the Workunit ID) as a number, NOT a link. I'll grab a few IDs from the old thread and see what type of messages turn up.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118369478855
RAC: 25530870

Validate error - What this really means!

I've had a look at the validate errors on one of joe areeda's hosts (this list of errors) and all 4 of them currently showing have the following error message from the validator

- result file has entries that aren't numbers
- a number is out of valid range for this result


At this point, I'll wait to see if Bernd can give me some sort of tool or technique to unscramble result files and so check if the contents of these files can give further clues. It's also possible that the large difference in the error rates of other platforms when compared to Windows might help to justify the admins in spending some time investigating this.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118369478855
RAC: 25530870

Until quite recently, I've

Until quite recently, I've had no hosts with a CUDA capable GPU and in my preferences I've deliberately selected NOT to run CPU apps for tasks where GPU versions are available. So, until now, I've not had anything to do with the particular peculiarities of running BRP tasks at all. I've recently purchased a GTX550Ti just to have a bit of a play. It's been running now for a couple of weeks in a quad core host where all cores are running E@H CPU tasks and I've not seen any errors at all and a search just now shows there are none in the online database.

I've seen some comments about Validate errors for BRP4 tasks so I thought I'd look at error rates per platform in the same way as I did for FGRP1 tasks. Because of the different plan classes, there are quite a few more entries in the table of error rates but on the whole the rates for Validate errors are quite low when compared to the same rates for FGRP1 tasks. Here are some example error rate percentages:

  • * Windows/x86 - BRP3SSE --- 0.187%
    * Windows/x86 - BRP3cuda32 -- 0.799%
    * Mac OS X on Intel - BRP3SSE -- 0.026%
    * Mac OS X on Intel - BRP3cuda32OSX -- 0.498%
    * Linux/x86 - BRP3SSE -- 0.072%
    * Linux/x86 - BRP3cuda32nv270 -- 1.10%

On the whole, these seem quite low with a bit on an increase when a GPU is involved in the crunching. Not surprising when you consider what can go wrong with GPUs particularly when people start pushing them.

So, when thinking about the FGRP1 data I included in the first post, it seems to me like the much higher error rates on OS X or Linux really do need investigating.

Cheers,
Gary.

Nigel Garvey
Nigel Garvey
Joined: 4 Oct 10
Posts: 51
Credit: 35247185
RAC: 86045

Thanks for pursuing this,

Thanks for pursuing this, Gary.

Quote:
So what exactly is a 'Validate error'. It's simply an otherwise successfully completed result that the validator daemon is immediately recognising as having something wrong without even needing to compare with any other result in the workunit quorum.

But presumably the validator doesn't actually look at it until there are other results to compare, since that's when the errors seem to appear.

To start the new collection, here's a "Validate error" Gamma-ray pulsar task (Mac platform) from last week, which I was saving up to post in the other thread:

259742414

The majority of my GRP tasks do validate, but the ones which don't are still annoying as each represents a few hours' processor time which could have been used more effectively. If you do manage to find the cause of this elusive problem and it's subsequently fixed, that would be brilliant.

NG

NG

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

RE: I've had a look at the

Quote:

I've had a look at the validate errors on one of joe areeda's hosts (this list of errors) and all 4 of them currently showing have the following error message from the validator

- result file has entries that aren't numbers
- a number is out of valid range for this result

At this point, I'll wait to see if Bernd can give me some sort of tool or technique to unscramble result files and so check if the contents of these files can give further clues. It's also possible that the large difference in the error rates of other platforms when compared to Windows might help to justify the admins in spending some time investigating this.


Thanks Gary,

Is it possible for me to see the error messages from the validator?

For the record that machine is not overclocked and the validation error rate is pretty low. I'll count the errors vs tasks as soon as I get a chance.

Joe

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118369478855
RAC: 25530870

RE: ... But presumably the

Quote:
... But presumably the validator doesn't actually look at it until there are other results to compare, since that's when the errors seem to appear.


Yes, I should have explained it better. I'll fix that in the original post. Thanks!

Quote:

To start the new collection, here's a "Validate error" Gamma-ray pulsar task (Mac platform) from last week, which I was saving up to post in the other thread:

259742414


The validator gave the following reasons for marking it as a validate error:

- result file has entries that aren't numbers
- a number is out of valid range for this result


As you can see, it's the same set of reasons as for Joe Areeda and for the last one of mine.

Quote:
The majority of my GRP tasks do validate, but the ones which don't are still annoying as each represents a few hours' processor time which could have been used more effectively. If you do manage to find the cause of this elusive problem and it's subsequently fixed, that would be brilliant.

My sentiments too! Hopefully we might get a resolution shortly.

Cheers,
Gary.

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

The history on that computer

The history on that computer (it's AMD Phenom II x6 with nVidia 240 running Ubuntu 11.04 no over clocking)

4 validate errors on Gamma-ray pulsar search #1 v0.23
Total tasks:
Gamma-ray pulsar search #1 v0.23: 83 pending, 61 valid, 36 valid Gamma-ray pulsar

So about 11% (4/36) of my Gamma Pulsar are validate errors.

I do have almost a page of computing errors but that was my fault for updating the nVidia drivers without stopping BOINC.

If it would help, I'm more than willing to run a debug version with breakpoints or an instrumented version.

Joe

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

Another computer (I7 with GTX

Another computer (I7 with GTX 560 running Ubuntu 11.10, also dual boots Win 7 Pro but the following are all Ubuntu)

16 validate errors on GPS v 0.23, 355 valid tasks, 106 valid GPS v0.23

Validate errors = 15% (16/106) pretty close to the other computer.

Joe

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 251651711
RAC: 35676

RE: ... it's AMD Phenom II

Quote:
... it's AMD Phenom II x6 with nVidia 240 running Ubuntu 11.04 ...

32 or 64 Bit?

My current suspicion is that these validate errors do happen 'preferably' on 64Bit machines, either Linux or recent Mac OS versions.

BM

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118369478855
RAC: 25530870

RE: Is it possible for me

Quote:
Is it possible for me to see the error messages from the validator?


Only through an Admin or a Mod.

I've taken on the job of running the queries and trying to keep participants informed. I'll try to keep a close watch on this.

Cheers,
Gary.

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

RE: RE: ... it's AMD

Quote:
Quote:
... it's AMD Phenom II x6 with nVidia 240 running Ubuntu 11.04 ...

32 or 64 Bit?

My current suspicion is that these validate errors do happen 'preferably' on 64Bit machines, either Linux or recent Mac OS versions.

BM


Bernd

Both machines I counted the errors for are 64 bit Linux. The AMD based one is computer #3805542 and the I7 is 4237123

I am decent at C/C++ and have some time, I'm happy to help find a case that repeatably fails. I will need some help getting started.

Joe

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.