Ghost WU and resending lost results

paperdragon
paperdragon
Joined: 8 Mar 05
Posts: 16
Credit: 2255341
RAC: 0

All these close deadlines may

All these close deadlines may not be the norm. Since this feature has just been turned on, it is resending all the stuff that has been setting there for a while. In future the ghost work units should generally get resent well before the deadline; unless the host has been disconnected from the internet for some reason.

But I was thinking if a host should ask for xxxxx seconds of work, the server should subtract the amount of time of any ghost units and resend them. Then only send new work if the ghost units total time is less then that which was requested.

For Example:
Request 20,000 seconds of work. Have 5,000 seconds of ghost units. You would only need to send 15,000 seconds of new work.

You like Myst? Uru Live returns! www.urulive.com

Grenadier
Grenadier
Joined: 9 Feb 05
Posts: 14
Credit: 2823344
RAC: 0

The deduction of a resend

The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days. Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: The deduction of a

Message 14763 in response to message 14762

Quote:
The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days.


This would have a bad consequence. A host which had a proxy problem and never received a work unit, but which kept contacting the scheduler, would cause that workunit to never finish.

Quote:

Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.


I don't know how to make this determination.

However I have just made the following changes. IF
- Work within 25% of deadline (42 hours for Einstein@Home), OR
- Work no longer needed (Canonical result already exists), OR
- Work unit has error flag set (something wrong), THEN
the scheduler no longer resends the workunit, but instead marks it as timed out in the database. The scheduler will then send an informational message to the client reporting that this WU has been 'expired'.

I'll test this over the next few hours, and see if it has undesirable side effects.

Bruce

Director, Einstein@Home

BugG
BugG
Joined: 23 Feb 05
Posts: 8
Credit: 1016682
RAC: 0

Two WUs resent were added to

Two WUs resent were added to two WUs in "Ready to run" by which I experience two bad things:

1. The task "No new work" for Project was ignored.
2. The deadline of the WUs resent are earlier by one day than that of the WUs in "Ready to run." 4WUs in total must be finished within four days. (Besides EAH, I also participate in SAH and PAH with 1 pc)

DeBugman Tokyo,Japan

S@NL - jurgenb
S@NL - jurgenb
Joined: 20 Feb 05
Posts: 2
Credit: 12285
RAC: 0

RE: Please report good

Quote:

Please report good and/or bad experiences with this feature in this thread.
Bruce

Juli 26th 2005, I posted a thread to report about 'Spooky WU's.'
These 16 missing 'Ghost WU's' were resent a couple days later.
First of all, I never asked for 16 WU's; Since I am running 5 different Boinc-projects on my computer the "connect every X days" is set to 0.1 days.
Anyway, I got them resent and started running them. Endlessly:

I get the following error on a resent Ghost WU:
30/07/2005 13:30:27|Einstein@Home|Result l1_1480.5__1480.5_0.1_T00_S4lA_1 exited with zero status but no 'finished' file
30/07/2005 13:30:27|Einstein@Home|If this happens repeatedly you may need to reset the project.
30/07/2005 13:30:27||request_reschedule_cpus: process exited
This WU has been running for over 20 hours now (even though the Boinc-manager contradicts, and claims that CPU time is only 9 ours.)

I tend to abort all 16 resent Ghost-WU's.
Somehow this feels like a waste of time and effort.

Walt Gribben
Walt Gribben
Joined: 20 Feb 05
Posts: 219
Credit: 1645393
RAC: 0

RE: RE: Please report

Message 14766 in response to message 14765

Quote:
Quote:

Please report good and/or bad experiences with this feature in this thread.
Bruce

Juli 26th 2005, I posted a thread to report about 'Spooky WU's.'
These 16 missing 'Ghost WU's' were resent a couple days later.
First of all, I never asked for 16 WU's; Since I am running 5 different Boinc-projects on my computer the "connect every X days" is set to 0.1 days.
Anyway, I got them resent and started running them. Endlessly:

I get the following error on a resent Ghost WU:
30/07/2005 13:30:27|Einstein@Home|Result l1_1480.5__1480.5_0.1_T00_S4lA_1 exited with zero status but no 'finished' file
30/07/2005 13:30:27|Einstein@Home|If this happens repeatedly you may need to reset the project.
30/07/2005 13:30:27||request_reschedule_cpus: process exited
This WU has been running for over 20 hours now (even though the Boinc-manager contradicts, and claims that CPU time is only 9 ours.)

I tend to abort all 16 resent Ghost-WU's.
Somehow this feels like a waste of time and effort.

If the WU is causing problems, go ahead and abort it, thats one of the reasons the abort function was added to BoincManager. Same with the extra WU's that were downloaded, if you have too much work, abort the "extra" ones. They'll be reissued and someone else can process them. And after aborting them, "update" the project so the status gets reported.

No idea why the running WU reports 9 hours after running for 20. Might be one tied with the reason its "exiting with no finished file". Check the stderr.txt file in the slots/n folder E@H is running in. Before aborting the WU that is, the reason the science app exits is written to that file. Usually you'll see something like "no heartbeat" meaning it lost communications with BOINC.

Walt

John McLeod VII
John McLeod VII
Moderator
Joined: 10 Nov 04
Posts: 547
Credit: 632255
RAC: 0

RE: RE: The deduction of

Message 14767 in response to message 14763

Quote:
Quote:
The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days.

This would have a bad consequence. A host which had a proxy problem and never received a work unit, but which kept contacting the scheduler, would cause that workunit to never finish.
Quote:

Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.

I don't know how to make this determination.

However I have just made the following changes. IF
- Work within 25% of deadline (42 hours for Einstein@Home), OR
- Work no longer needed (Canonical result already exists), OR
- Work unit has error flag set (something wrong), THEN
the scheduler no longer resends the workunit, but instead marks it as timed out in the database. The scheduler will then send an informational message to the client reporting that this WU has been 'expired'.

I'll test this over the next few hours, and see if it has undesirable side effects.

Bruce


This is actually in development (sort of) at the moment. There is enough information (information about the deadline and remaining runtime) for each WU to determine slack time. This should be used when sending any work to make certain that there is enough slack time before the deadline for the WU to have a chance of getting it done.

Archangel
Archangel
Joined: 25 Mar 05
Posts: 2
Credit: 83829
RAC: 0

got no work and every time i

got no work and every time i try to update i get this message group

01/08/2005 12:10:32|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:35|Einstein@Home|Unrecoverable error for result l1_1083.0__1083.1_0.1_T12_S4lA_2 (CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20))
01/08/2005 12:10:35||request_reschedule_cpus: start failed
01/08/2005 12:10:35||request_reschedule_cpus: process exit

anyone help ?

already tried resetting

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

Based on the feedback in this

Based on the feedback in this forum, I've made some additional modifications to the scheduler policy on resending lost workunits. Details may be found here:
deadline_proposal.txt. This extends the deadlines (up to a total of an additional week) for machines that did not get the work when it was originally sent.

Bruce

Director, Einstein@Home

Walt Gribben
Walt Gribben
Joined: 20 Feb 05
Posts: 219
Credit: 1645393
RAC: 0

RE: got no work and every

Message 14770 in response to message 14768

Quote:

got no work and every time i try to update i get this message group

01/08/2005 12:10:32|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:35|Einstein@Home|Unrecoverable error for result l1_1083.0__1083.1_0.1_T12_S4lA_2 (CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20))
01/08/2005 12:10:35||request_reschedule_cpus: start failed
01/08/2005 12:10:35||request_reschedule_cpus: process exit

anyone help ?

already tried resetting

Stop BOINC and restart it.

Did you get a download error earlier for the .exe file? Theres a bug in 4.45 where it doesn't close the file after a temporary download error. Or maybe is says transient, either way it retries the download after a minute. The retry opens a second instance of the file, and closes that second instance. But since that first one is still out there, Windows can't start the new process.

Windows will close the file when BOINC stops, that should fix the problem.

Walt

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.