FGRP4 App version 1.14 vs 1.15, was: There's no CPU work available

chase1902

Joined: 13 Aug 11

Posts: 37

Credit: 1264094642

RAC: 0

Think I've been unlucky here

8 Oct 2015 22:56:37 UTC

Message 134264

(moderation:

)

Think I've been unlucky here as I have got 20 invalids and possibly another 10 to 30 to come. Not that I am complaining. Its pretty obvious that there wasn't much way around it. unless you wasn't going to use your CPU for a few weeks (or use a backup project).
Only problem is it takes longer to check if the invalids are your computer having a wobbly or just the project.
Normally a quick glance at hopefully only one or two invalids, preferably no invalids or error (seems to have been a while since that).
Bit of a shock when you got a page full, I was thinking dam thought I had fixed all the recent problems, which one is miss behaving now, but all O.K as it does not look like there's any problems.

Not bothered about credits, they are just handy to see if a computer is doing well or not.

I don't see that this is any different to when your own computer has a fit and spits out a load of invalids or errors.

Now what I would like is one of those certificates, so if I could have some tasks which will prove fruitful that would be really nice, otherwise I shall continue as before crunching away.

John

archae86

Joined: 6 Dec 05

Posts: 3163

Credit: 7360421687

RAC: 2277436

chase1902 wrote:Think I've

8 Oct 2015 23:11:45 UTC

Message 134265 in response to message 134264

(moderation:

)

chase1902 wrote:

Think I've been unlucky here

handy to see if a computer is doing well or not.

I just looked at one of your computers, which only showed three invalids. But one of those appears not to have arisen from the matter under discussion.

WU 229053630 started life as a 1.15 unit. The first quorum failed to match, so a third 1.15 was distributed. The two in that quorum which were not your machine did match, so you were voted off the island.

Possibly that machine had an unhealthy moment, possibly both of the other machines somehow were unhealthy in a matched way, or just maybe the application has a wee bit of a problem. But in any case this particular case was not the 1.14 vs. 1.15 matter.

It does illustrate your point, that the presence of the 1.14/1.15 matter clouds up machine health assessment, now and for a while longer.

chase1902

Joined: 13 Aug 11

Posts: 37

Credit: 1264094642

RAC: 0

Yes that computer got a bit

8 Oct 2015 23:41:04 UTC

Message 134266

(moderation:

)

Yes that computer got a bit touchy, use to only run GPU tasks and was happy as Larry, added some cpu tasks and it all went pear shaped.
Wouldn't run any of S6 without loads of invalids.
Lowered the ram speed a bit and it seems O.K with the FGRP4, but I like to keep an eye on it in case its something else.
Gave it a good clean and monitor the temperatures (which are always well within the norm, about 60/65C. the temp program also keeps max/min, never gone over 65C).
Can't find anything else wrong with it, apart from its has it's moments every now and then with CPU tasks and throws out a few invalids.
Now my oldest computer which you could forgive a few invalids has it's managed to go through 2 CPU coolers, could be 3, 2 power supplies, runs fine.
The only reason I spotted the failing CPU cooler was the run times shot up were it was throttling back so much, as it had a water cooled cooler there's nothing to see when they start going wrong, at lest they changed it under warranty.

Jasper

Joined: 14 Feb 12

Posts: 63

Credit: 4032891

RAC: 0

RE: ... Are you saying

9 Oct 2015 6:07:45 UTC

Message 134267 in response to message 134263

(moderation:

)

Quote:

...
Are you saying that any completed quorums that are 1.14,1.14,1.15, where the 1.15 has been excluded in this way, will be repeated and the failed 1.15 will then be matched against the new canonical result?
...

I have seen one of those here: http://einsteinathome.org/workunit/228236227. When the wingman on 1.14 timed out, a first 1.15 was sent out. When that completed, a second. However, before that last one finished, the initial timed out 1.14 completed as well, validating both 1.14 tasks and invalidating the 1.15 one. When the last issued 1.15 task finished, it was invalidated as well, so here you have two valid 1.14 and two invalid 1.15 WUs. In this particular case, I was one of the 1.15 volunteers.

As said by others, forget about credits, thatÂ´s not the point and not worth the effort, at least AFAIC. I am merely a bit dismayed by waste of time and useless power off the wall.

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 987

Credit: 25171438

RAC: 0

RE: Are you saying that any

9 Oct 2015 7:42:48 UTC

Message 134268 in response to message 134263

(moderation:

)

Quote:

Are you saying that any completed quorums that are 1.14,1.14,1.15, where the 1.15 has been excluded in this way, will be repeated and the failed 1.15 will then be matched against the new canonical result?

No. I was referring to the case 1.14/1.15/1.15/1.15 where the first 1.15 got marked as invalid (not inconclusive). If such a case exists...

Quote:

I guess you must be looking at retaining them all for however long it takes?

We retain all results right now anyway because of another unrelated test we want to run :-)

Best,
Oliver

Einstein@Home Project

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 987

Credit: 25171438

RAC: 0

RE: Now what I would like

9 Oct 2015 7:44:40 UTC

Message 134269 in response to message 134264

(moderation:

)

Quote:

Now what I would like is one of those certificates, so if I could have some tasks which will prove fruitful that would be really nice, otherwise I shall continue as before crunching away.

We're trying hard to pack enough discoveries for everyone into our data ;-) Thanks for trying hard to find 'em!

Oliver

Einstein@Home Project

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119232046874

RAC: 25228283

RE: No. I was referring to

9 Oct 2015 9:21:06 UTC

Message 134270 in response to message 134268

(moderation:

)

Quote:

No. I was referring to the case 1.14/1.15/1.15/1.15 where the first 1.15 got marked as invalid (not inconclusive). If such a case exists...

I don't believe such a case would be possible because the first 1.15 could not be marked as invalid unless it really was.

The initial quorum would be 2x1.14s. The first 1.15 would happen if the version transition had occurred and afterwards one of the 1.14s failed. So when that first 1.15 completed, both remaining tasks would become 'inconclusive' (since 1.14 and 1.15 don't match) and the 2nd 1.15 would be sent out. When it was returned, there would be a 1.14, a 1.15, and a 2nd 1.15 all being checked against each other. For a 3rd 1.15 to be needed, there would have to be no agreement between the current three. They would still all be 'inconclusive'.

When the 3rd 1.15 was returned, 4 results would be checked. If any two 1.15s did agree, the remaining 1.15 (along with the 1.14) would be marked invalid. The 1.15 marked as invalid would have deserved its status (presumably) :-). It really couldn't have been prematurely marked as invalid - just inconclusive right up to the final chop :-).

Unless I'm totally misunderstanding how validation works :-).

Cheers,
Gary.

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 987

Credit: 25171438

RAC: 0

Agreed. If we manually

9 Oct 2015 9:30:56 UTC

Message 134271 in response to message 134270

(moderation:

)

Agreed.

If we manually grant credit then to those 1.14ers who eventually ended up with 1.15 wingmen. While the latter will eventually find each other to be valid, the original 1.14 task will end up as invalid, despite being potentially valid from a 1.14 point of view...

Oliver

Einstein@Home Project

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119232046874

RAC: 25228283

Whatever :-). You were

9 Oct 2015 13:14:15 UTC

Message 134272 in response to message 134271

(moderation:

)

Whatever :-).

You were talking about 1.15s perhaps being marked invalid which should only happen if an expired 1.14 comes back from the dead and spoils the party. Now you are talking about doing something about 1.14s which mostly would have been valid (if paired against another 1.14) but can't be if a couple of 1.15s happen to trump them.

This is all great, if you have the resources to sift through it all and work out how to compensate those affected for the loss. Please don't waste time if it's not a trivial exercise. The vast majority of people will get over the loss very quickly when it sinks in that this was just an unfortunate side effect arising from an unavoidable action that was better to be done sooner rather than later.

The biggest problem is that what was done was not announced in advance. If a small heads-up had been given, warning that there would be a bugfix new app version deployed at the end of the current data set which wouldn't validate with the previous app, at least those paying attention could have made a decision.

Any who were paranoid about wasting electricity on results that might fail could be advised to set NNT and abort or complete the current cache as quickly as possible. After the changeover, allow new work but immediately abort any that were 1.15 resends because of the risk they might be matched against a 1.14 quorum where the 'dead' 1.14 suddenly revived itself.

The other option to mention would be to stock up with 1.14 and then set NNT right at the onset of 1.15 and wait for the cache to drain (and some of the risk to pass) before getting any 1.15. You would always be at the mercy of your 1.14 wingman so if they fail you will ultimately be trumped by two 1.15 resends but if you have a large cache and are determined to escape from any such 'traps' you could theoretically delete your unstarted task if your wingman's one shows up as a failure. That way you could save yourself a wasted crunch and leave it to two 1.15s to do the job.

I very much doubt that there would be many people prepared to make the effort to think through and implement any such 'schemes', simply because they are complicated and require time and effort to track tasks and manual intervention to minimise waste. However you would get lots of kudos for keeping people informed about the problem. People will put up with all sorts of stuff if they are being informed and don't feel they are being taken for granted.

None of the above is in any way intended to be a criticism. The most important point I wanted to make was not to waste more time 'fixing' things if the 'fix' isn't trivial.

Cheers,
Gary.

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 987

Credit: 25171438

RAC: 0

I agree in every aspect. So

9 Oct 2015 13:52:44 UTC

Message 134273 in response to message 134272

(moderation:

)

I agree in every aspect. So just to repeat my earlier mea culpa, let me quote myself:

Quote:

Anyhow, we should have announced the expected validation issue alongside the app deployment such that everyone caring about that could have reacted accordingly. We failed to do that and thus are ready to take the heat. Again, sorry, even at an otherwise rather rock-solid project mishaps/miscommunication can happen.

A nice weekend to you all,

Oliver

Einstein@Home Project

FGRP4 App version 1.14 vs 1.15, was: There's no CPU work available

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports