Think I've been unlucky here as I have got 20 invalids and possibly another 10 to 30 to come. Not that I am complaining. Its pretty obvious that there wasn't much way around it. unless you wasn't going to use your CPU for a few weeks (or use a backup project).
Only problem is it takes longer to check if the invalids are your computer having a wobbly or just the project.
Normally a quick glance at hopefully only one or two invalids, preferably no invalids or error (seems to have been a while since that).
Bit of a shock when you got a page full, I was thinking dam thought I had fixed all the recent problems, which one is miss behaving now, but all O.K as it does not look like there's any problems.
Not bothered about credits, they are just handy to see if a computer is doing well or not.
I don't see that this is any different to when your own computer has a fit and spits out a load of invalids or errors.
Now what I would like is one of those certificates, so if I could have some tasks which will prove fruitful that would be really nice, otherwise I shall continue as before crunching away.
I just looked at one of your computers, which only showed three invalids. But one of those appears not to have arisen from the matter under discussion.
WU 229053630 started life as a 1.15 unit. The first quorum failed to match, so a third 1.15 was distributed. The two in that quorum which were not your machine did match, so you were voted off the island.
Possibly that machine had an unhealthy moment, possibly both of the other machines somehow were unhealthy in a matched way, or just maybe the application has a wee bit of a problem. But in any case this particular case was not the 1.14 vs. 1.15 matter.
It does illustrate your point, that the presence of the 1.14/1.15 matter clouds up machine health assessment, now and for a while longer.
Yes that computer got a bit touchy, use to only run GPU tasks and was happy as Larry, added some cpu tasks and it all went pear shaped.
Wouldn't run any of S6 without loads of invalids.
Lowered the ram speed a bit and it seems O.K with the FGRP4, but I like to keep an eye on it in case its something else.
Gave it a good clean and monitor the temperatures (which are always well within the norm, about 60/65C. the temp program also keeps max/min, never gone over 65C).
Can't find anything else wrong with it, apart from its has it's moments every now and then with CPU tasks and throws out a few invalids.
Now my oldest computer which you could forgive a few invalids has it's managed to go through 2 CPU coolers, could be 3, 2 power supplies, runs fine.
The only reason I spotted the failing CPU cooler was the run times shot up were it was throttling back so much, as it had a water cooled cooler there's nothing to see when they start going wrong, at lest they changed it under warranty.
...
Are you saying that any completed quorums that are 1.14,1.14,1.15, where the 1.15 has been excluded in this way, will be repeated and the failed 1.15 will then be matched against the new canonical result?
...
I have seen one of those here: http://einsteinathome.org/workunit/228236227. When the wingman on 1.14 timed out, a first 1.15 was sent out. When that completed, a second. However, before that last one finished, the initial timed out 1.14 completed as well, validating both 1.14 tasks and invalidating the 1.15 one. When the last issued 1.15 task finished, it was invalidated as well, so here you have two valid 1.14 and two invalid 1.15 WUs. In this particular case, I was one of the 1.15 volunteers.
As said by others, forget about credits, that´s not the point and not worth the effort, at least AFAIC. I am merely a bit dismayed by waste of time and useless power off the wall.
Are you saying that any completed quorums that are 1.14,1.14,1.15, where the 1.15 has been excluded in this way, will be repeated and the failed 1.15 will then be matched against the new canonical result?
No. I was referring to the case 1.14/1.15/1.15/1.15 where the first 1.15 got marked as invalid (not inconclusive). If such a case exists...
Quote:
I guess you must be looking at retaining them all for however long it takes?
We retain all results right now anyway because of another unrelated test we want to run :-)
Now what I would like is one of those certificates, so if I could have some tasks which will prove fruitful that would be really nice, otherwise I shall continue as before crunching away.
We're trying hard to pack enough discoveries for everyone into our data ;-) Thanks for trying hard to find 'em!
No. I was referring to the case 1.14/1.15/1.15/1.15 where the first 1.15 got marked as invalid (not inconclusive). If such a case exists...
I don't believe such a case would be possible because the first 1.15 could not be marked as invalid unless it really was.
The initial quorum would be 2x1.14s. The first 1.15 would happen if the version transition had occurred and afterwards one of the 1.14s failed. So when that first 1.15 completed, both remaining tasks would become 'inconclusive' (since 1.14 and 1.15 don't match) and the 2nd 1.15 would be sent out. When it was returned, there would be a 1.14, a 1.15, and a 2nd 1.15 all being checked against each other. For a 3rd 1.15 to be needed, there would have to be no agreement between the current three. They would still all be 'inconclusive'.
When the 3rd 1.15 was returned, 4 results would be checked. If any two 1.15s did agree, the remaining 1.15 (along with the 1.14) would be marked invalid. The 1.15 marked as invalid would have deserved its status (presumably) :-). It really couldn't have been prematurely marked as invalid - just inconclusive right up to the final chop :-).
Unless I'm totally misunderstanding how validation works :-).
If we manually grant credit then to those 1.14ers who eventually ended up with 1.15 wingmen. While the latter will eventually find each other to be valid, the original 1.14 task will end up as invalid, despite being potentially valid from a 1.14 point of view...
You were talking about 1.15s perhaps being marked invalid which should only happen if an expired 1.14 comes back from the dead and spoils the party. Now you are talking about doing something about 1.14s which mostly would have been valid (if paired against another 1.14) but can't be if a couple of 1.15s happen to trump them.
This is all great, if you have the resources to sift through it all and work out how to compensate those affected for the loss. Please don't waste time if it's not a trivial exercise. The vast majority of people will get over the loss very quickly when it sinks in that this was just an unfortunate side effect arising from an unavoidable action that was better to be done sooner rather than later.
The biggest problem is that what was done was not announced in advance. If a small heads-up had been given, warning that there would be a bugfix new app version deployed at the end of the current data set which wouldn't validate with the previous app, at least those paying attention could have made a decision.
Any who were paranoid about wasting electricity on results that might fail could be advised to set NNT and abort or complete the current cache as quickly as possible. After the changeover, allow new work but immediately abort any that were 1.15 resends because of the risk they might be matched against a 1.14 quorum where the 'dead' 1.14 suddenly revived itself.
The other option to mention would be to stock up with 1.14 and then set NNT right at the onset of 1.15 and wait for the cache to drain (and some of the risk to pass) before getting any 1.15. You would always be at the mercy of your 1.14 wingman so if they fail you will ultimately be trumped by two 1.15 resends but if you have a large cache and are determined to escape from any such 'traps' you could theoretically delete your unstarted task if your wingman's one shows up as a failure. That way you could save yourself a wasted crunch and leave it to two 1.15s to do the job.
I very much doubt that there would be many people prepared to make the effort to think through and implement any such 'schemes', simply because they are complicated and require time and effort to track tasks and manual intervention to minimise waste. However you would get lots of kudos for keeping people informed about the problem. People will put up with all sorts of stuff if they are being informed and don't feel they are being taken for granted.
None of the above is in any way intended to be a criticism. The most important point I wanted to make was not to waste more time 'fixing' things if the 'fix' isn't trivial.
I agree in every aspect. So just to repeat my earlier mea culpa, let me quote myself:
Quote:
Anyhow, we should have announced the expected validation issue alongside the app deployment such that everyone caring about that could have reacted accordingly. We failed to do that and thus are ready to take the heat. Again, sorry, even at an otherwise rather rock-solid project mishaps/miscommunication can happen.
Think I've been unlucky here
)
Think I've been unlucky here as I have got 20 invalids and possibly another 10 to 30 to come. Not that I am complaining. Its pretty obvious that there wasn't much way around it. unless you wasn't going to use your CPU for a few weeks (or use a backup project).
Only problem is it takes longer to check if the invalids are your computer having a wobbly or just the project.
Normally a quick glance at hopefully only one or two invalids, preferably no invalids or error (seems to have been a while since that).
Bit of a shock when you got a page full, I was thinking dam thought I had fixed all the recent problems, which one is miss behaving now, but all O.K as it does not look like there's any problems.
Not bothered about credits, they are just handy to see if a computer is doing well or not.
I don't see that this is any different to when your own computer has a fit and spits out a load of invalids or errors.
Now what I would like is one of those certificates, so if I could have some tasks which will prove fruitful that would be really nice, otherwise I shall continue as before crunching away.
John
chase1902 wrote:Think I've
)
I just looked at one of your computers, which only showed three invalids. But one of those appears not to have arisen from the matter under discussion.
WU 229053630 started life as a 1.15 unit. The first quorum failed to match, so a third 1.15 was distributed. The two in that quorum which were not your machine did match, so you were voted off the island.
Possibly that machine had an unhealthy moment, possibly both of the other machines somehow were unhealthy in a matched way, or just maybe the application has a wee bit of a problem. But in any case this particular case was not the 1.14 vs. 1.15 matter.
It does illustrate your point, that the presence of the 1.14/1.15 matter clouds up machine health assessment, now and for a while longer.
Yes that computer got a bit
)
Yes that computer got a bit touchy, use to only run GPU tasks and was happy as Larry, added some cpu tasks and it all went pear shaped.
Wouldn't run any of S6 without loads of invalids.
Lowered the ram speed a bit and it seems O.K with the FGRP4, but I like to keep an eye on it in case its something else.
Gave it a good clean and monitor the temperatures (which are always well within the norm, about 60/65C. the temp program also keeps max/min, never gone over 65C).
Can't find anything else wrong with it, apart from its has it's moments every now and then with CPU tasks and throws out a few invalids.
Now my oldest computer which you could forgive a few invalids has it's managed to go through 2 CPU coolers, could be 3, 2 power supplies, runs fine.
The only reason I spotted the failing CPU cooler was the run times shot up were it was throttling back so much, as it had a water cooled cooler there's nothing to see when they start going wrong, at lest they changed it under warranty.
RE: ... Are you saying
)
I have seen one of those here: http://einsteinathome.org/workunit/228236227. When the wingman on 1.14 timed out, a first 1.15 was sent out. When that completed, a second. However, before that last one finished, the initial timed out 1.14 completed as well, validating both 1.14 tasks and invalidating the 1.15 one. When the last issued 1.15 task finished, it was invalidated as well, so here you have two valid 1.14 and two invalid 1.15 WUs. In this particular case, I was one of the 1.15 volunteers.
As said by others, forget about credits, that´s not the point and not worth the effort, at least AFAIC. I am merely a bit dismayed by waste of time and useless power off the wall.
RE: Are you saying that any
)
No. I was referring to the case 1.14/1.15/1.15/1.15 where the first 1.15 got marked as invalid (not inconclusive). If such a case exists...
We retain all results right now anyway because of another unrelated test we want to run :-)
Best,
Oliver
Einstein@Home Project
RE: Now what I would like
)
We're trying hard to pack enough discoveries for everyone into our data ;-) Thanks for trying hard to find 'em!
Oliver
Einstein@Home Project
RE: No. I was referring to
)
I don't believe such a case would be possible because the first 1.15 could not be marked as invalid unless it really was.
The initial quorum would be 2x1.14s. The first 1.15 would happen if the version transition had occurred and afterwards one of the 1.14s failed. So when that first 1.15 completed, both remaining tasks would become 'inconclusive' (since 1.14 and 1.15 don't match) and the 2nd 1.15 would be sent out. When it was returned, there would be a 1.14, a 1.15, and a 2nd 1.15 all being checked against each other. For a 3rd 1.15 to be needed, there would have to be no agreement between the current three. They would still all be 'inconclusive'.
When the 3rd 1.15 was returned, 4 results would be checked. If any two 1.15s did agree, the remaining 1.15 (along with the 1.14) would be marked invalid. The 1.15 marked as invalid would have deserved its status (presumably) :-). It really couldn't have been prematurely marked as invalid - just inconclusive right up to the final chop :-).
Unless I'm totally misunderstanding how validation works :-).
Cheers,
Gary.
Agreed. If we manually
)
Agreed.
If we manually grant credit then to those 1.14ers who eventually ended up with 1.15 wingmen. While the latter will eventually find each other to be valid, the original 1.14 task will end up as invalid, despite being potentially valid from a 1.14 point of view...
Oliver
Einstein@Home Project
Whatever :-). You were
)
Whatever :-).
You were talking about 1.15s perhaps being marked invalid which should only happen if an expired 1.14 comes back from the dead and spoils the party. Now you are talking about doing something about 1.14s which mostly would have been valid (if paired against another 1.14) but can't be if a couple of 1.15s happen to trump them.
This is all great, if you have the resources to sift through it all and work out how to compensate those affected for the loss. Please don't waste time if it's not a trivial exercise. The vast majority of people will get over the loss very quickly when it sinks in that this was just an unfortunate side effect arising from an unavoidable action that was better to be done sooner rather than later.
The biggest problem is that what was done was not announced in advance. If a small heads-up had been given, warning that there would be a bugfix new app version deployed at the end of the current data set which wouldn't validate with the previous app, at least those paying attention could have made a decision.
Any who were paranoid about wasting electricity on results that might fail could be advised to set NNT and abort or complete the current cache as quickly as possible. After the changeover, allow new work but immediately abort any that were 1.15 resends because of the risk they might be matched against a 1.14 quorum where the 'dead' 1.14 suddenly revived itself.
The other option to mention would be to stock up with 1.14 and then set NNT right at the onset of 1.15 and wait for the cache to drain (and some of the risk to pass) before getting any 1.15. You would always be at the mercy of your 1.14 wingman so if they fail you will ultimately be trumped by two 1.15 resends but if you have a large cache and are determined to escape from any such 'traps' you could theoretically delete your unstarted task if your wingman's one shows up as a failure. That way you could save yourself a wasted crunch and leave it to two 1.15s to do the job.
I very much doubt that there would be many people prepared to make the effort to think through and implement any such 'schemes', simply because they are complicated and require time and effort to track tasks and manual intervention to minimise waste. However you would get lots of kudos for keeping people informed about the problem. People will put up with all sorts of stuff if they are being informed and don't feel they are being taken for granted.
None of the above is in any way intended to be a criticism. The most important point I wanted to make was not to waste more time 'fixing' things if the 'fix' isn't trivial.
Cheers,
Gary.
I agree in every aspect. So
)
I agree in every aspect. So just to repeat my earlier mea culpa, let me quote myself:
A nice weekend to you all,
Oliver
Einstein@Home Project