Ghost WU and resending lost results

paperdragon

Joined: 8 Mar 05

Posts: 16

Credit: 2255341

RAC: 0

All these close deadlines may

29 Jul 2005 19:44:43 UTC

Message 14761

(moderation:

)

All these close deadlines may not be the norm. Since this feature has just been turned on, it is resending all the stuff that has been setting there for a while. In future the ghost work units should generally get resent well before the deadline; unless the host has been disconnected from the internet for some reason.

But I was thinking if a host should ask for xxxxx seconds of work, the server should subtract the amount of time of any ghost units and resend them. Then only send new work if the ghost units total time is less then that which was requested.

For Example:
Request 20,000 seconds of work. Have 5,000 seconds of ghost units. You would only need to send 15,000 seconds of new work.

You like Myst? Uru Live returns! www.urulive.com

Grenadier

Joined: 9 Feb 05

Posts: 14

Credit: 2823344

RAC: 0

The deduction of a resend

29 Jul 2005 19:48:50 UTC

Message 14762

(moderation:

)

The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days. Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

RE: The deduction of a

29 Jul 2005 22:29:31 UTC

Message 14763 in response to message 14762

(moderation:

)

Quote:

The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days.

This would have a bad consequence. A host which had a proxy problem and never received a work unit, but which kept contacting the scheduler, would cause that workunit to never finish.

Quote:

Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.

I don't know how to make this determination.

However I have just made the following changes. IF
- Work within 25% of deadline (42 hours for Einstein@Home), OR
- Work no longer needed (Canonical result already exists), OR
- Work unit has error flag set (something wrong), THEN
the scheduler no longer resends the workunit, but instead marks it as timed out in the database. The scheduler will then send an informational message to the client reporting that this WU has been 'expired'.

I'll test this over the next few hours, and see if it has undesirable side effects.

Bruce

Director, Einstein@Home

BugG

Joined: 23 Feb 05

Posts: 8

Credit: 1016682

RAC: 0

Two WUs resent were added to

30 Jul 2005 1:20:15 UTC

Message 14764

(moderation:

)

Two WUs resent were added to two WUs in "Ready to run" by which I experience two bad things:

1. The task "No new work" for Project was ignored.
2. The deadline of the WUs resent are earlier by one day than that of the WUs in "Ready to run." 4WUs in total must be finished within four days. (Besides EAH, I also participate in SAH and PAH with 1 pc)

DeBugman Tokyo,Japan

S@NL - jurgenb

Joined: 20 Feb 05

Posts: 2

Credit: 12285

RAC: 0

RE: Please report good

30 Jul 2005 12:29:19 UTC

Message 14765

(moderation:

)

Quote:

Please report good and/or bad experiences with this feature in this thread.
Bruce

Juli 26th 2005, I posted a thread to report about 'Spooky WU's.'
These 16 missing 'Ghost WU's' were resent a couple days later.
First of all, I never asked for 16 WU's; Since I am running 5 different Boinc-projects on my computer the "connect every X days" is set to 0.1 days.
Anyway, I got them resent and started running them. Endlessly:

I get the following error on a resent Ghost WU:
30/07/2005 13:30:27|Einstein@Home|Result l1_1480.5__1480.5_0.1_T00_S4lA_1 exited with zero status but no 'finished' file
30/07/2005 13:30:27|Einstein@Home|If this happens repeatedly you may need to reset the project.
30/07/2005 13:30:27||request_reschedule_cpus: process exited
This WU has been running for over 20 hours now (even though the Boinc-manager contradicts, and claims that CPU time is only 9 ours.)

I tend to abort all 16 resent Ghost-WU's.
Somehow this feels like a waste of time and effort.

Walt Gribben

Joined: 20 Feb 05

Posts: 219

Credit: 1645393

RAC: 0

RE: RE: Please report

30 Jul 2005 15:42:59 UTC

Message 14766 in response to message 14765

(moderation:

)

Quote:

Quote:

Please report good and/or bad experiences with this feature in this thread.
Bruce

Juli 26th 2005, I posted a thread to report about 'Spooky WU's.'
These 16 missing 'Ghost WU's' were resent a couple days later.
First of all, I never asked for 16 WU's; Since I am running 5 different Boinc-projects on my computer the "connect every X days" is set to 0.1 days.
Anyway, I got them resent and started running them. Endlessly:

I get the following error on a resent Ghost WU:
30/07/2005 13:30:27|Einstein@Home|Result l1_1480.5__1480.5_0.1_T00_S4lA_1 exited with zero status but no 'finished' file
30/07/2005 13:30:27|Einstein@Home|If this happens repeatedly you may need to reset the project.
30/07/2005 13:30:27||request_reschedule_cpus: process exited
This WU has been running for over 20 hours now (even though the Boinc-manager contradicts, and claims that CPU time is only 9 ours.)

I tend to abort all 16 resent Ghost-WU's.
Somehow this feels like a waste of time and effort.

If the WU is causing problems, go ahead and abort it, thats one of the reasons the abort function was added to BoincManager. Same with the extra WU's that were downloaded, if you have too much work, abort the "extra" ones. They'll be reissued and someone else can process them. And after aborting them, "update" the project so the status gets reported.

No idea why the running WU reports 9 hours after running for 20. Might be one tied with the reason its "exiting with no finished file". Check the stderr.txt file in the slots/n folder E@H is running in. Before aborting the WU that is, the reason the science app exits is written to that file. Usually you'll see something like "no heartbeat" meaning it lost communications with BOINC.

Walt

John McLeod VII

Moderator

Joined: 10 Nov 04

Posts: 547

Credit: 632255

RAC: 0

RE: RE: The deduction of

31 Jul 2005 21:38:00 UTC

Message 14767 in response to message 14763

(moderation:

)

Quote:

Quote:
The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days.

This would have a bad consequence. A host which had a proxy problem and never received a work unit, but which kept contacting the scheduler, would cause that workunit to never finish.
Quote:

Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.

I don't know how to make this determination.

However I have just made the following changes. IF
- Work within 25% of deadline (42 hours for Einstein@Home), OR
- Work no longer needed (Canonical result already exists), OR
- Work unit has error flag set (something wrong), THEN
the scheduler no longer resends the workunit, but instead marks it as timed out in the database. The scheduler will then send an informational message to the client reporting that this WU has been 'expired'.

I'll test this over the next few hours, and see if it has undesirable side effects.

Bruce

This is actually in development (sort of) at the moment. There is enough information (information about the deadline and remaining runtime) for each WU to determine slack time. This should be used when sending any work to make certain that there is enough slack time before the deadline for the WU to have a chance of getting it done.

BOINC WIKI

Archangel

Joined: 25 Mar 05

Posts: 2

Credit: 83829

RAC: 0

got no work and every time i

1 Aug 2005 11:13:52 UTC

Message 14768

(moderation:

)

got no work and every time i try to update i get this message group

01/08/2005 12:10:32|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:35|Einstein@Home|Unrecoverable error for result l1_1083.0__1083.1_0.1_T12_S4lA_2 (CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20))
01/08/2005 12:10:35||request_reschedule_cpus: start failed
01/08/2005 12:10:35||request_reschedule_cpus: process exit

anyone help ?

already tried resetting

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

Based on the feedback in this

1 Aug 2005 12:57:41 UTC

Message 14769

(moderation:

)

Based on the feedback in this forum, I've made some additional modifications to the scheduler policy on resending lost workunits. Details may be found here:
deadline_proposal.txt. This extends the deadlines (up to a total of an additional week) for machines that did not get the work when it was originally sent.

Bruce

Director, Einstein@Home

Walt Gribben

Joined: 20 Feb 05

Posts: 219

Credit: 1645393

RAC: 0

RE: got no work and every

1 Aug 2005 15:09:36 UTC

Message 14770 in response to message 14768

(moderation:

)

Quote:

got no work and every time i try to update i get this message group

01/08/2005 12:10:32|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:35|Einstein@Home|Unrecoverable error for result l1_1083.0__1083.1_0.1_T12_S4lA_2 (CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20))
01/08/2005 12:10:35||request_reschedule_cpus: start failed
01/08/2005 12:10:35||request_reschedule_cpus: process exit

anyone help ?

already tried resetting

Stop BOINC and restart it.

Did you get a download error earlier for the .exe file? Theres a bug in 4.45 where it doesn't close the file after a temporary download error. Or maybe is says transient, either way it retries the download after a minute. The retry opens a second instance of the file, and closes that second instance. But since that first one is still out there, Windows can't start the new process.

Windows will close the file when BOINC stops, that should fix the problem.

Walt

Ghost WU and resending lost results

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports