Validate error - What this really means!

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117579513207

RAC: 35172892

RE: RE: RE: This one

19 Mar 2012 12:35:03 UTC

Message 107819 in response to message 107818

(moderation:

)

Quote:

Quote:
Quote:
This one has wasted a lot of time. So far 10 validate errors for this one work unit and it has been sent out 2 more times! Something is not right.
Task ID = 278234847

This is obviously due to a problem with the data being crunched. These occur occasionally but unfortunately can't be predicted. The Devs rely on user reports, like yours, thank you. There are previous reported cases in this very thread - like this one, for example. I try to notice such reports and send the details to the Devs.

OK. I think I understand. Like this?

....

I've left out your list since those validate errors are nothing like the one above reported by Betreger. Apart from the fact that it's a different app (BRP4), the real issue is that all tasks in the quorum end up with a validate error. This is most likely a problem with the task data.

I've looked at several in your list and it doesn't seem to be a problem with the data, since there aren't multiple validate errors in each quorum. In most cases, each quorum contains only a single validate error, coming from your host.

I suspect you may need to investigate what is happening on your host. If you read the opening post in this thread, the indications are that there is an issue that causes FRGP tasks on Mac OS X and Linux to fail with a validate error at perhaps somewhere around the 5% - 10% rate. However, in your case, the failure rate seems to be much higher than that, judging by the last couple of pages from your results list. Also you should ask yourself why there is such a large difference between CPU time and Run Time for all the E@H CPU apps running on your machine. It's not like you are running CUDA tasks which need a lot of CPU support and so might be stealing CPU cycles from CPU tasks and causing the Run time to blow out.

BOINC sees your host as having 8 cores so I wonder if it's got anything to do with HT? As an experiment, why don't you set the pref to use 50% of your CPUs and see what sort of difference to run times and error rates that might make? I suspect you might get quite an improvement. Maybe the machine is overheating and is throttling itself in some way, and that may be causing the long run times.

Cheers,
Gary.

Logforme

Joined: 13 Aug 10

Posts: 332

Credit: 1714373961

RAC: 0

Lot of people wasting time on

19 Mar 2012 17:23:23 UTC

Message 107820

(moderation:

)

Lot of people wasting time on this one: http://einsteinathome.org/workunit/117989028

Bill & Patsy

Joined: 8 Sep 07

Posts: 17

Credit: 5242914

RAC: 0

RE: RE: RE: RE: This

19 Mar 2012 19:56:28 UTC

Message 107821 in response to message 107819

(moderation:

)

Quote:

Quote:
Quote:
Quote:
This one has wasted a lot of time. So far 10 validate errors for this one work unit and it has been sent out 2 more times! Something is not right.
Task ID = 278234847

This is obviously due to a problem with the data being crunched. These occur occasionally but unfortunately can't be predicted. The Devs rely on user reports, like yours, thank you. There are previous reported cases in this very thread - like this one, for example. I try to notice such reports and send the details to the Devs.

OK. I think I understand. Like this?

....

I've left out your list since those validate errors are nothing like the one above reported by Betreger. Apart from the fact that it's a different app (BRP4), the real issue is that all tasks in the quorum end up with a validate error. This is most likely a problem with the task data.

I've looked at several in your list and it doesn't seem to be a problem with the data, since there aren't multiple validate errors in each quorum. In most cases, each quorum contains only a single validate error, coming from your host.

I suspect you may need to investigate what is happening on your host. If you read the opening post in this thread, the indications are that there is an issue that causes FRGP tasks on Mac OS X and Linux to fail with a validate error at perhaps somewhere around the 5% - 10% rate. However, in your case, the failure rate seems to be much higher than that, judging by the last couple of pages from your results list. Also you should ask yourself why there is such a large difference between CPU time and Run Time for all the E@H CPU apps running on your machine. It's not like you are running CUDA tasks which need a lot of CPU support and so might be stealing CPU cycles from CPU tasks and causing the Run time to blow out.

BOINC sees your host as having 8 cores so I wonder if it's got anything to do with HT? As an experiment, why don't you set the pref to use 50% of your CPUs and see what sort of difference to run times and error rates that might make? I suspect you might get quite an improvement. Maybe the machine is overheating and is throttling itself in some way, and that may be causing the long run times.

Thanks, Gary, for looking at it. I appreciate your help.

I'd like to throw it back your way. Here's why:

Yes, I'm really pushing that machine. It's an iMac quad core that came from the factory configured with 8 logical cores - their idea, not mine. And I've got a LOT of stuff running on it, including several virtual machines. So both the CPU and the RAM are maxed out. Not efficient, I know, but the only way I can allocate resources the way I want to. (And it's likely not a thermal problem. Yes, it runs pretty hot, but I monitor that, and it's safely within limits.)

Anyway, the point is that I'm supporting lots of other BOINC projects and ALL the other Einstein applications in a very intense environment. Nevertheless, despite the heavy utilization on this machine, Gamma-ray pulsar search #1 v0.23 is the ONLY place I see errors. Nowhere else.

So - go figure. How can the problem be in my machine if it doesn't error anywhere else, including other Einstein work? And if it somehow is my machine's fault, then I submit that Gamma-ray pulsar search #1 v0.23 is not properly designed (too brittle), since everything else runs fine.

Just trying to help. My fix is easy: stay away from Gamma-ray pulsar search #1 v0.23. Is that what you want? I would think not, since this can happen again...

--Bill

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117579513207

RAC: 35172892

RE: Lot of people wasting

19 Mar 2012 20:43:16 UTC

Message 107822 in response to message 107820

(moderation:

)

Quote:

Lot of people wasting time on this one: http://einsteinathome.org/workunit/117989028

As mentioned in a previous message, I have reported this to the Devs. I've been copied on an email exchange about this and the current thinking seems to be that it might be RFI in the original telescope data.

The problem seems mainly confined to WUs whose name is of the form p2030.20111018.G35.89*. This is the case for the examples recently posted here and also for your report as well. The most recent email I've received advises that this dataset has now been withdrawn. I imagine that the next time an affected host contacts the scheduler, any tasks for this data that have not been started will be aborted by the server. If you have such a task that is currently crunching, you should abort it to save wasting further time.

Cheers,
Gary.

Public0x05bf

Joined: 16 Oct 11

Posts: 3

Credit: 873879

RAC: 0

"NAN": not a

21 Mar 2012 0:09:25 UTC

Message 107823 in response to message 107785

(moderation:

)

"NAN": not a number
Floating-point-numbers on an x86-processor are represented by an e.g. 64-bit-
representation; not all of these representations are valid numbers. The
invalid representations are called NANs. A special NAN is returned as a
result of every invalid floating-poing-number-operation e.g.
( [+infinity] + [-infinity] ).
(see e.g. the manual of the 8087 numeric processor extension)

Hope this helps,

Sincerely
Thomas

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1589392391

RAC: 763604

Happy Easter, here is another

8 Apr 2012 19:58:57 UTC

Message 107824

(moderation:

)

Happy Easter, here is another one.

WUID = http://einsteinathome.org/task/281746462

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1589392391

RAC: 763604

another one:

10 Apr 2012 17:51:04 UTC

Message 107825

(moderation:

)

another one: http://einsteinathome.org/task/282072586,

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1589392391

RAC: 763604

Why am I so lucky to get

12 Apr 2012 20:24:22 UTC

Message 107826

(moderation:

)

Why am I so lucky to get wingmen who create validate errors?

http://einsteinathome.org/task/282522470

Melanie

Joined: 20 May 11

Posts: 1

Credit: 27724

RAC: 0

280230426 I don't normally

13 Apr 2012 6:42:37 UTC

Message 107827

(moderation:

)

280230426

I don't normally check these things (complete amateur), so I only just noticed there was an issue and decided to check the forum. I have no idea how many of these have failed on my computer. Based on this thread, however, I've turned off 'Gamma-ray pulsar search #1' in my account. Losing that many hours makes me sad.

Yet another OS X, here. Intel Core 2 Duo on a MacBook Pro. I'm not overclocking. The closest thing to special I'm doing is using the GPU. I have been pushing things the last few days (possible overheating), but I'm pretty sure I wasn't when this task was run. Everything else seems fine.

Thanks!
-M

steffen_moeller

Joined: 9 Feb 05

Posts: 78

Credit: 1773655132

RAC: 0

The Gamma-ray pulsar search

14 Apr 2012 10:32:37 UTC

Message 107828

(moderation:

)

The Gamma-ray pulsar search #1 v0.23 is the only app bringing validation problems for me, had about 60 of those on various machines over the last 30 days. But I have many more that just work, so I am not too concerned.

http://einsteinathome.org/account/tasks&offset=0&show_names=0&state=4

All machines come with a regular non-overclocked setup. All 64 bit Linux.

I am not too surprised about platform differences, e.g. there could be something compared against a random distribution to find it to be special and if falling back to the OS' random generator, take it as a metaphor for any math function, there could easily come the one or other difference between platform. And there are differences in how doubles are handled between 64 and 32 bit platforms, which may contribute, too. Maybe those platform differences are even helpful for investigating what is happening, and to give some extra confidence into those results that are flagged as "valid".

What is unfair is that there is no credit for such bene volent invalid results. This is where the easiest to fix bug is IMHO - the distinction between technical invalidity (as in cheaters, no credits and kick butt) and results one does not like (full credits). How to do that - no idea. Anyone willing to take some life time into their hand and implement that? Somehow I am also not so unhappy about people standing up and complaining about something not working right. So maybe we should just leave it as it is?

Steffen

Validate error - What this really means!

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports