FGRP4 App version 1.14 vs 1.15, was: There's no CPU work available

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 987
Credit: 25171438
RAC: 0

RE: A pity resources are

Quote:

A pity resources are (need to be?) wasted like that though.


Well, the wasted resources would have been much more had we not deployed the patch which would have meant a full reanalysis of a lot more tasks. FYI, the results returned so far aren't wrong but not as good as they could have been. Fixing the bug now while trying to limit the negative effects of comparatively few technically "invalid" results seemed the best compromise considering all facts and alternatives - for all stakeholders, that includes our volunteers.

Thanks,
Oliver

Einstein@Home Project

Jasper
Jasper
Joined: 14 Feb 12
Posts: 63
Credit: 4032891
RAC: 0

That´s what I was thinking:

That´s what I was thinking: either it wasn´t all that important and it would have been better to let the run finish, or it really needed to be done (I assumed this to hold true). However, in the latter case, I would suppose there is concern with the validity of 80%+ of this run already done. Did you have the opportunity to check for how much impact it has? I mean, seeing impact on only a couple of weeks done does not seem a good measure to me, but is rather worrying. I don´t know about all results, but if every older WU is going to produce invalids when crunched with the newer 1.15 application, how reliable then are those 80%+ already done? Are there any older WUs left in pending state with the older application version, that managed to get validated with 1.15? What about much older, already validated WUs? Wouldn´t these, if checked, turn out to be invalids too? Your initial reaction Thursday looked rather one of surprise to me: you sounded like 100% sure that this could not happen: http://einsteinathome.org/node/198054&nowrap=true#144600

Quote:
Quote:

Another thing is that FGRP4 executables changed to version 1.15 after the short outage. Again, maybe related, maybe not.

Nope, that was just a scientific bug fix which also required to let the task pool to run dry.

Oliver


I had two invalids meanwhile (I expect more to come):
- one all 1.14: I don´t remember ever seeing such, last WU completed was Thursday, October 1st. by someone else, on 1.14 as well;
- another one waiting forever for a wingman´s result and crunched again, twice, with 1.15.
I don´t like to see that, I am just not used to it! I know, I´m only running everything on a single, older iMac now, but still, each one means trashing half a day of work which put in perspective, for me really means quite a waste. Others will likely care a lot less about that. However, I was really happily crunching away on Einstein, but that feeling has got a little dent at this point.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 987
Credit: 25171438
RAC: 0

Let's see... RE: I would

Let's see...

Quote:
I would suppose there is concern with the validity of 80%+ of this run already done.


The WUs crunched so far aren't invalid but incomplete. The current plan is to analyze the remaining data locally. This is also why we needed to make the cut on a dataset boundary.

Quote:

but if every older WU is going to produce invalids when crunched with the newer 1.15 application


The WUs themselves are fine and can be analyzed with both app versions. The "just" won't validate across the two app versions.

Quote:
What about much older, already validated WUs? Wouldn´t these, if checked, turn out to be invalids too?


See above. Validated WUs are still valid.

Quote:

However, I was really happily crunching away on Einstein, but that feeling has got a little dent at this point.


As I tried to explain, we know that some of your tasks won't validate and thus there is a certain waste - and we're sorry about that. Please keep in mind that this affects only a very limited number of volunteers: only those WUs that started with 1.14 and where one of its tasks errored out (or was found invalid) after 1.15 got deployed are affected. The current weighted total error rate for FGRP is ~ 3.5% (which includes the validation error discussed here) so that should give you an idea about the impact.

Regarding the 80%: we still have more data to crunch for FGRP but that hasn't yet been enqueued into the pipeline, so the 80% figure is, despite technically correct, not the whole picture.

If you don't want to risk any potentially (!) wasted cycles you may of course opt-out of FGRP for ten more days. By then all 1.14 tasks are finished or timed-out such that only 1.15 tasks will be in flight.

Anyhow, we should have announced the expected validation issue alongside the app deployment such that everyone caring about that could have reacted accordingly. We failed to do that and thus are ready to take the heat. Again, sorry, even at an otherwise rather rock-solid project mishaps/miscommunication can happen.

HTH,
Oliver

Einstein@Home Project

Der Mann mit der Ledertasche
Der Mann mit de...
Joined: 12 Dec 05
Posts: 151
Credit: 302594178
RAC: 0

...hm, so far 15 WU's for

...hm,

so far 15 WU's for the garbage can counting up! That sucks me a lot whatever that will be only concerning a couple of 3,5%. :-(

Greetings from the North

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 987
Credit: 25171438
RAC: 0

RE: That sucks me a

Quote:
That sucks me a lot

Please feel free to pause crunching FGRP until Oct. 15th. By then all 1.14 wingmen will have finished one way or another.

Best,
Oliver

Einstein@Home Project

Der Mann mit der Ledertasche
Der Mann mit de...
Joined: 12 Dec 05
Posts: 151
Credit: 302594178
RAC: 0

...joking! That will not

...joking!
That will not resolve the Problem for the 58 WU's finished and waiting for Validation! In Addition to the 15 WU's so far make 73 WU's for dev null!

Greetings from the North

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 987
Credit: 25171438
RAC: 0

RE: That will not resolve

Quote:

That will not resolve the Problem for the 58 WU's finished and waiting for Validation!

All of them have 1.14 wingmen?

Anyhow, we're looking into re-validating such 1.15 tasks which would get you the deserved credit once two 1.15 validated later-on and produced the canonical result.

Stay tuned,
Oliver

Einstein@Home Project

Der Mann mit der Ledertasche
Der Mann mit de...
Joined: 12 Dec 05
Posts: 151
Credit: 302594178
RAC: 0

...it is mixed; I have

...it is mixed; I have pending 1.14 WU's with 1.15 wingmen and pending 1.15 with 1.14 wingmen.

BTW The Credit is not the Problem, I don't like to waste time and power. :-)

Greetings from the North

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5887
Credit: 119232066874
RAC: 25224786

RE: RE: That will not

Quote:
Quote:

That will not resolve the Problem for the 58 WU's finished and waiting for Validation!

All of them have 1.14 wingmen?


You guys shouldn't be wasting time on this.

The problem was for those 1.14 quorums where one task failed and there was a 1.15 resend. Two possibilities might arise. If the failed 1.14 'rises from the ashes' then the 1.15 resend will miss out. Otherwise, the 1.14/1.15 combo will fail validation and a further 1.15 resend will seal the fate of the poor old long suffering original 1.14 left standing.

There was never a problem for quorums with 1.15 original tasks (_0 or _1 extensions on the task name) since they cannot subsequently be paired with a 1.14 resend.

If DMMDL takes a look at the 1.15 tasks in the 58 'pendings' he mentions, he can assume that all of those with _0 or _1 (surely the majority) will validate at some point in the future. The only possible problem is for any that have a _2 or higher extension on the name. If there are any of these, they can ONLY fail validation if they actually are invalid, or if a previously missing 1.14 suddenly gets sent in now.

So of the 58, how many are 1.14s? Those are the ones likely to fail. Of the 1.15s, I would be very surprised if more than a couple miss out. Maybe DMMDL would like to count how many of the 1.15 pendings are _2 or above tasks with a 1.14 partner AND a further 1.14 task that hasn't actually failed and is just late. That is his maximum 'exposure' :-).

Of course, my thinking may be totally muddled so please correct me if I'm wrong.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5887
Credit: 119232066874
RAC: 25224786

RE: Anyhow, we're looking

Quote:
Anyhow, we're looking into re-validating such 1.15 tasks which would get you the deserved credit once two 1.15 validated later-on and produced the canonical result.


I not sure I fully understand this bit. A 1.15 task can only fail if it was paired with a 1.14 because of a timeout of the other 1.14. The 1.15 fails only if the timed out 1.14 unexpectedly revives. Are you saying that any completed quorums that are 1.14,1.14,1.15, where the 1.15 has been excluded in this way, will be repeated and the failed 1.15 will then be matched against the new canonical result? Wont the failed 1.15 be gone before the new canonical result arrives? I guess you must be looking at retaining them all for however long it takes?

Seems like a lot of extra effort for you guys. Not wasting your time is more important than worrying about a few lost credits.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.