Result refused

tullio
tullio
Joined: 22 Jan 05
Posts: 2118
Credit: 61407735
RAC: 0
Topic 195423

This is the first time a result of mine has been refused:
31-Oct-2010 03:48:24 [Einstein@Home] Message from server: Completed result h1_1237.25_S5R4__952_S5GC1a_0 refused: result already reported as success
31-Oct-2010 03:48:24 [Einstein@Home] Message from server: Resent lost task h1_1237.40_S5R4__961_S5GC1a_1

Tullio

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

Result refused

Quote:
This is the first time a result of mine has been refused


Yes, but it had been reported already. If you turn on the logging flag, you'll even see the acknowledgement:[pre]31/10/2010 01:23:05|Einstein@Home|Message from server: Completed result h1_1104.10_S5R4__236_S5GC1a_1 refused: result already reported as success
31/10/2010 01:23:05|Einstein@Home|Message from server: Resent lost task h1_1104.35_S5R4__52_S5GC1a_0
31/10/2010 01:23:05|Einstein@Home|Project requested delay of 60.000000 seconds
31/10/2010 01:23:05||[sched_op_debug] handle_scheduler_reply(): got ack for result h1_1104.10_S5R4__236_S5GC1a_1[/pre]

Computer sind nicht alles im Leben. (Kleiner Scherz)

tullio
tullio
Joined: 22 Jan 05
Posts: 2118
Credit: 61407735
RAC: 0

RE: RE: This is the first

Quote:
Quote:
This is the first time a result of mine has been refused

Yes, but it had been reported already. If you turn on the logging flag, you'll even see the acknowledgement:[pre]31/10/2010 01:23:05|Einstein@Home|Message from server: Completed result h1_1104.10_S5R4__236_S5GC1a_1 refused: result already reported as success
31/10/2010 01:23:05|Einstein@Home|Message from server: Resent lost task h1_1104.35_S5R4__52_S5GC1a_0
31/10/2010 01:23:05|Einstein@Home|Project requested delay of 60.000000 seconds
31/10/2010 01:23:05||[sched_op_debug] handle_scheduler_reply(): got ack for result h1_1104.10_S5R4__236_S5GC1a_1[/pre]


OK. I am not complaining. But I could not report it because Einstein was down.
Tullio

dramas
dramas
Joined: 20 Oct 10
Posts: 27
Credit: 31719
RAC: 0

RE: This is the first time

Quote:

This is the first time a result of mine has been refused:
31-Oct-2010 03:48:24 [Einstein@Home] Message from server: Completed result h1_1237.25_S5R4__952_S5GC1a_0 refused: result already reported as success
31-Oct-2010 03:48:24 [Einstein@Home] Message from server: Resent lost task h1_1237.40_S5R4__961_S5GC1a_1

Tullio

I have but turns out not to be a problem. well anyway this is how it panned out here.

http://einsteinathome.org/node/195410

tnx

SATMGR
SATMGR
Joined: 29 Oct 10
Posts: 1
Credit: 37387553
RAC: 0

Why does EAH send more work

Why does EAH send more work units than can be done prior to deadline time to me?
The message button section suggests I abort some! This will eventually lead to a refused credit (I refuse to abort) due to EAH not limiting a days work units to reasonable limit. This unintelligent method of (non) WU management should be addressed by EAH.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5883
Credit: 119041508176
RAC: 24803457

RE: Why does EAH send more

Quote:
Why does EAH send more work units than can be done prior to deadline time to me?
The message button section suggests I abort some! This will eventually lead to a refused credit (I refuse to abort) due to EAH not limiting a days work units to reasonable limit. This unintelligent method of (non) WU management should be addressed by EAH.


Unfortunately, this "unintelligent method ..." is something that is controlled by BOINC and is not something that the E@H Devs can do a lot about. I don't run any CUDA capable GPUs on Einstein myself so I have no direct experience of the problem but having looked through your tasks list, I can see that you do have a problem worthy of complaint. Unfortunately, it's the BOINC Devs who will need to solve the problem.

I believe the problem was caused by your BOINC client continually trying to get enough ABP2 tasks to keep your two GPUs fed. Even in normal times, the project has more gravity wave (GW) tasks than ABP tasks so when your client keeps asking, you are going to get an excess of GW tasks. There were only 2 CPU cores to do these tasks since the ABP tasks would have tied up the other 2 cores with the 2 GPUs. Now that ABP2 tasks are essentially finished for the time being (probably a week or more until Parkes work becomes available) I really don't know exactly what your BOINC client will do about the shortage of work for your GPUs. If left to its own devices, the client might be continually trying to get more (non-existent) ABP2 tasks and so end up with even more unwanted GW tasks.

If it were my host, I would do the following:-

  • * Reduce your cache size (temporarily) to say 1 day or less to encourage BOINC not to keep asking for work.
    * Suspend your remaining ABP2 tasks so that all 4 cores are available to crunch the backlog of GW tasks. I'm guessing you've already done this since those tasks have stopped being returned and you seem to be returning GW tasks in batches of 4.
    * Buy yourself a bit of leeway by immediately aborting enough GW tasks so that you aren't trying to crunch tasks that are already past deadline, or that will be by the time you get to start them. You don't need to be concerned about aborting excess tasks that would otherwise miss deadline anyway.
    * If there are still GW tasks that are going more than say a few hours past deadline without being started, abort them rather than risk starting a task for which you may get no credit. A good trick to use for an overdue task is to check if the third task has actually been issued. Sometimes the extra task is not issued for several days so you may have enough time to get yours done and so save the resend to the third host. If it has been issued, check the reliability of the 3rd wingman (ie how many days on average for a return from that host). If the 3rd host seems quick and reliable, you should abort your task immediately, rather than risk a wasted effort.

Hope these thoughts are of some use to you.

Cheers,
Gary.

tolafoph
tolafoph
Joined: 14 Sep 07
Posts: 122
Credit: 74659937
RAC: 0

Hi, I´ve a CUDA card and

Hi,

I´ve a CUDA card and a dual-core and know the problem.
The CUDA-tasks take under 3h but the GW around 6h.

When I set the cache to more the 1 day, I always got a new GW WU when BOINC asked for a CUDA-WU. So the ABP finished again after 3h and a GW-Wu is only at 50%. So the GW-Task kept piling up.
Usually I just aboarded them.

mikey
mikey
Joined: 22 Jan 05
Posts: 12857
Credit: 1884347953
RAC: 281651

RE: Why does EAH send more

Quote:
Why does EAH send more work units than can be done prior to deadline time to me?
The message button section suggests I abort some! This will eventually lead to a refused credit (I refuse to abort) due to EAH not limiting a days work units to reasonable limit. This unintelligent method of (non) WU management should be addressed by EAH.

By aborting units you are just putting them back into the cache for the rest of us BUT by keeping them and crunching them even after their deadlines you are wasting crunching time! Boinc, the Server side, is designed to automatically resend out any unit that is not returned before its deadline. The ONLY way you will get credit for a unit after the deadline is by returning it BEFORE the new person does and then the new person gets no credit for crunching it!! That is a lose lose situation!! Just abort the units you can't finish and the Server will automatically send them out to someone else without having to wait until the deadline to do it.

There are ways you can trick Boinc into sending you less work, but the easiest way is to just adjust your cache. Did you know you can make cache adjustments in as little as 0.1 increments?

mikey
mikey
Joined: 22 Jan 05
Posts: 12857
Credit: 1884347953
RAC: 281651

RE: Hi, I´ve a CUDA card

Quote:

Hi,

I´ve a CUDA card and a dual-core and know the problem.
The CUDA-tasks take under 3h but the GW around 6h.

When I set the cache to more the 1 day, I always got a new GW WU when BOINC asked for a CUDA-WU. So the ABP finished again after 3h and a GW-Wu is only at 50%. So the GW-Task kept piling up.
Usually I just aboarded them.

This will always be a problem until the Boinc programmers fix, or separate IMO, the gpu and cpu sides of the work request section. There is a Boinc Mailing List you can get on but it is run by Dr. David Anderson of Seti and since he is the creator and is still the main programmer of Boinc he doesn't always take suggestions as suggestions! Sometimes he takes them as criticisms, but hey without him there would be no Boinc!

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 3001481933
RAC: 699025

You could also use your

You could also use your project preferences to manage the cache.

If you set

Use CPU 
Enforced by version 6.10+  no


when you have enough GW tasks on hand, the effect should be that no new GW tasks are allocated. (The wording doesn't make it clear, but this setting won't affect the processing of any tasks you already have: the purpose is to separate out the requests for CPU and - using the next preference down - GPU work).

Allowing new tasks to be allocated, but aborting them later, is a bad idea because of the bandwidth that can be wasted - both your bandwidth, and the project's - if the allocated work is in a new frquency band that requires data files not already on your computer.

Mind you, all of this is a bit moot this week, because all the tasks for the ABP2 run has already been issued, and the next run isn't ready to start yet.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5883
Credit: 119041508176
RAC: 24803457

RE: ... The ONLY way you

Quote:
... The ONLY way you will get credit for a unit after the deadline is by returning it BEFORE the new person does ...


Correct.

Quote:
... and then the new person gets no credit for crunching it!!


Sorry, not correct. The new person will always get credit if they return the extra task within the deadline. There are actually situations where it is useful to crunch a 'deadline miss' task.

I had an unattended host recently that locked up without being noticed. When it was discovered and rebooted, its entire cache was something like 10-15+ days overdue. If I had simply just aborted all these (~100) tasks, the daily quota would have dropped to 1 per CPU core and the host would have had to wait the best part of another 24 hours to get a further daily quota. If you have received your daily quota, BOINC will put you on a countdown for the next day, unless you are able to manually update as soon as tasks finish and so start the quota recovery process. As I said, the host is normally unattended and I couldn't be around to do the manual update.

So I had a look through the cache of way overdue tasks and found several where the quorums were not completed, even after such a long time. There were some where the 3rd wingman had defaulted - in fact there was even one where there were several defaults. The key point was that these quorums were still active. So I decided to keep the most promising of these for my host to work on (one task per core) and aborted all the others, particularly those where the quorum was already completed. I was also able to download the one new task per core and the machine was off and crunching again with two tasks per core on board. Of course the next work request resulted in the long back-off being imposed but this is unavoidable.

To work around that back-off, I wrote a one line shell command to sleep for the required interval (until the first round of tasks would be finished) and then wake up and run boinccmd to update the project - the equivalent of clicking 'update' in BOINC Manager. This worked like a charm and these completely expired tasks were all reported and credited and were sufficient to get the daily quota back to a satisfactory state. When I checked a day or two later, I could see that two of the already assigned wingmen had also reported in after my host and had been awarded credit as well. So it was actually a win-win-win situation for my host, the other wingmen and the project. The project won because there were two of the 4 tasks my host completed that were not subsequently completed by the other wingmen in place at the time. If I had aborted those tasks, two further tasks would have ultimately been sent out.

You should keep in mind that there is always a substantial 'failure to satisfactorily return' for all tasks that are issued. I don't know the actual statistics but a figure of 10 - 20% or more wouldn't surprise. Quite often you can pick when there might be such a failure about to occur - eg a host that normally returns promptly that is currently not returning at all. By following the link to the wingman's host you can spot these quite easily.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.