Abort job when not needed

Mr Anderson
Mr Anderson
Joined: 28 Oct 17
Posts: 39
Credit: 150358722
RAC: 43754
Topic 215793

Several times I happened to click on a completed workunit only to find that two other computers had completed the job ahead of me thus making my contribution essentially a wasted effort. Wouldn't it be better in such a case that the computer is informed that the job is no longer needed (particularly if work hasn't yet started on a downloaded job) so it can "forget it" and get on with other things? An example of this is workunit 358436788 (which actually didn't validate) where two other PCs had finished the job two days before mine did.

Another thing that I would find useful would be the ability to move downloaded and even started jobs off to another computer in the same account. So, for example, if an account has computers A and B and A has several jobs downloaded and may have been working on some of them but the owner knows that the computer will not be used for a time, then it would be useful to be able to transfer them to B where work can continue there.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7234111127
RAC: 1189350

These sound like BOINC

These sound like BOINC feature requests to me, not Einstein.  You might try posting on a BOINC forum.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117895084737
RAC: 34619323

Mr Anderson wrote:Several

Mr Anderson wrote:
Several times I happened to click on a completed workunit only to find that two other computers had completed the job ahead of me thus making my contribution essentially a wasted effort. Wouldn't it be better in such a case that the computer is informed that the job is no longer needed (particularly if work hasn't yet started on a downloaded job) so it can "forget it" and get on with other things? An example of this is workunit 358436788 (which actually didn't validate) where two other PCs had finished the job two days before mine did.

I hunted through your tasks list and found this completed workunit you mention.  This was not a "wasted effort" on the part of your machine and there is no way the result could have been any different.  You have misinterpreted what actually happened.  Here is the correct sequence of events. Check for yourself.

The first task in the quorum was issued on June 27 at 1:58:07 UTC.  It was the only one issued and it exceeded the deadline at 1:58:07 UTC on July 11 (14 days later).  As a result of that task expiring (and with no other task having been issued before the expiry), the scheduler decided to issue the two needed tasks almost simultaneously, and within seconds of the first task expiring - check for yourself.  Then the nasty bit happens :-).  The expired task gets completed the best part of a day later at 23:00:44 UTC.  The scheduler will always accept a late result IF the quorum is not already completed because there is no guarantee that the other two more recent tasks will actually get returned.  One of the two extra tasks had been returned so the 'expired' task was able to complete the quorum immediately when it arrived back.  Yours was the last to be returned and the scheduler will always wait for it (up to its own deadline).

If your result had contained valid information, it would have been accepted and credited as well.  Unfortunately, when the contents were inspected by the validator, the result didn't miss by just a little bit, otherwise it would have been marked as 'invalid'.  It failed a basic 'sanity' check without ever being directly compared to the other two.  That's what the term 'validate error' actually means.  There is a pinned thread explaining all this.

Usually, if there are a number of validate errors, it a sign of a hardware problem with your machine.  In this case there are no other such examples in the whole of the current online database (just one bad result out of a total of 128 showing) so it's perhaps some random unfortunate event - possibly the result of a power fluctuation for example.

Mr Anderson wrote:
Another thing that I would find useful would be the ability to move downloaded and even started jobs off to another computer in the same account. So, for example, if an account has computers A and B and A has several jobs downloaded and may have been working on some of them but the owner knows that the computer will not be used for a time, then it would be useful to be able to transfer them to B where work can continue there.

Not only is this something that the BOINC Devs would need to incorporate into their software, but also think through the problems for the online database of the particular project.  If you were able to shift tasks between hosts, willy nilly, how is the online database going to track the changes?  What happens if you make an inappropriate switch?  Maybe the recipient host doesn't have the correct hardware/drivers to process the added tasks.  Do you think it's a good idea for the user to make those sort of decisions and be responsible for 'getting it right'?  There is already an appropriate mechanism for this situation.  It's called the abort button.  Just get rid of excess tasks on the host that can't crunch them and let other hosts request new work when they actually need it.  Much simpler for you and the project.

Cheers,
Gary.

Mr Anderson
Mr Anderson
Joined: 28 Oct 17
Posts: 39
Credit: 150358722
RAC: 43754

Thank you for your comments.

Thank you for your comments. Regarding the validate error perhaps I should have been clearer. I did not mean to say that the validate error was because of the sequence of events, i.e. arriving late to the party. I accepted that there was an error made by my machine but I was curious and when I clicked on the job, I found that the quorum had already been completed almost two days earlier. It just seemed pointless that my machine had continued on with the job because the result was not needed (irrespective of the bad result that it wound up producing).

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117895084737
RAC: 34619323

Project servers don't

Project servers don't initiate contact with clients.   Firewalls at the client end should prevent that.  A server responds to requests from clients.   Whilst it would be possible to set up a system where the server would be able to determine which outstanding results are now redundant, the server would have to wait for a client to make contact first.

I believe such a system already exists within BOINC.  It would be up to each project as to whether they choose to adopt it or not.  I could be wrong but my understanding is that it adds a lot of load to the server/database so is likely not to be used, particularly for any project that is already struggling with load issues.

I also think that the existing system does not abort tasks that are already in progress.  It allows tasks that have not started, which are redundant, to be aborted.  However if the client isn't talking to the server first, the message wont be sent to the client.  So, in your case, it doesn't matter when the quorum was actually completed.  What could matter is when your client first made a request to the server after that event.  If, by that time, processing of the task had already started, it wouldn't have been aborted anyway.

I believe I had noticed this system in operation at Einstein quite a long time ago.  I haven't seen any sign of it in more recent times.  I have fairly small work caches so results get turned over quite quickly.  This means I would be unlikely to see it on my machines.  People probably would tend to complain if tasks start disappearing from their machines.  I haven't seen any such comments in a long time.  I don't think the system of server initiated aborts is currently in use.  However I don't know for sure.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4314
Credit: 250788497
RAC: 34430

Indeed there exists an option

Indeed there exists an option in BOINC that allows to cancel tasks on clients, meant for the case when the respective workunits have been canceled on the server. This is an option that we tried once, but indeed found that it stresses our DB too much. So we implemented something similar that requires some more manual work (editing regular expressions) instead of using the DB. However, as this requires manual work and it doesn't affect tasks that are already in progress, we only use this for its original purpose, i.e. when canceling (larger groups of) workunits.

Also too your suggested procedure would mean that people won't get "credit" for the computation they already spent on a task that turns out to be redundant, without anything they can do about it. I do know of a few participants who certainly won't like this.

BM

mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3442116540
RAC: 4063354

BOINC is awful at sending out

BOINC is awful at sending out a 3rd task prior to the 2nd tasks deadline. I've seen it at many projects. Starting up another project and NNT on an old one ends up pushing all the old tasks right out to the deadline only to have time wasted as the server sent the 3rd task to someone else.

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

mmonnin skrev:BOINC is awful

mmonnin wrote:
BOINC is awful at sending out a 3rd task prior to the 2nd tasks deadline. I've seen it at many projects. Starting up another project and NNT on an old one ends up pushing all the old tasks right out to the deadline only to have time wasted as the server sent the 3rd task to someone else.

I've never seen Boinc do this! On the other hand as soon as the deadline is expired for one of the initial tasks, Boinc server side does generate a 3rd task within seconds and sends it out as quickly as possible.

If you've seen and got proof of Boinc generating and sending out the 3rd task before one of the initial tasks has expired their deadlines (or reported some kind of problem) then that should be considered a bug and should be reported!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117895084737
RAC: 34619323

mmonnin wrote:BOINC is awful

mmonnin wrote:
BOINC is awful at sending out a 3rd task prior to the 2nd tasks deadline.

You seem to be implying that this happened in this particular case.  It certainly didn't.

I can only speak for what happens at Einstein.  Over the years, particularly the early days whilst trying to understand the ins and outs of locality scheduling, I spent quite a bit of time looking at lots of different work units and I never did see any example (in normal circumstances) of what you suggest.

There is a particular situation when the staff need to finish off the 'stragglers' for a particular run.  To achieve that they will increase the 'initial replication' so as to get the dregs finished as quickly as possible.  The hope is that someone will return one of the extra tasks quickly and the work units can be completed.  This can be done if there is a handful of results outstanding that are taking a long time to be returned.  It is by no means a regular practice.

 

Cheers,
Gary.

Mr Anderson
Mr Anderson
Joined: 28 Oct 17
Posts: 39
Credit: 150358722
RAC: 43754

Bernd Machenschalk wrote:Also

Bernd Machenschalk wrote:
Also too your suggested procedure would mean that people won't get "credit" for the computation they already spent on a task that turns out to be redundant, without anything they can do about it. I do know of a few participants who certainly won't like this.

It's probably a matter of personal preference. Although I like to see my credit going up, I care much more about the actual contribution that I am making so if a computation is essentially redundant because it's already completed by someone else then that is of more concern to me. I'd prefer my computer to stop work on any job no matter where it would be in the computation in this case because making it carry on to the end is of no benefit to anyone. Of course this is all somewhat moot since we don't have this functionality but if we did perhaps it would be an option in the user preferences.

Edit: To be clearer, in my opinion I find the obsessing over credit to be rather juvenile, sort of like "you get 100 points for being a good boy and doing your homework". It is however a measure of the useful work that has been done and that is the reason why I like to see my credit going up, because I like to be making a positive contribution. So if my computer completes work that wasn't needed then I don't see the point in getting any credit for it because no useful work was done.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4314
Credit: 250788497
RAC: 34430

It's probably a matter of

It's probably a matter of personal preference.

Indeed the importance of 'credit' for any given person is a matter of preference. 'credit' (or 'cobblestones') was what made SETI@Home more successful than any other volunteer computing project that existed at that time. This is why it was carried over to BOINC. The competition 'credit' induces is at least one more motivation at least for some people to donate more computing power, and that's good for us (as a project), so we will not neglect that (although I personally don't give a ***).

There are no means in BOINC to abort tasks from the server side once they are already in progress on a client. Implementing that would take a rather large amount of effort, since both client and server have to be changed, and would certainly raise the requirements for DB performance and network bandwidth for the projects. Sorry, I really don't think that t's worth that.

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.