All these close deadlines may not be the norm. Since this feature has just been turned on, it is resending all the stuff that has been setting there for a while. In future the ghost work units should generally get resent well before the deadline; unless the host has been disconnected from the internet for some reason.
But I was thinking if a host should ask for xxxxx seconds of work, the server should subtract the amount of time of any ghost units and resend them. Then only send new work if the ghost units total time is less then that which was requested.
For Example:
Request 20,000 seconds of work. Have 5,000 seconds of ghost units. You would only need to send 15,000 seconds of new work.
The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days. Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.
The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days.
This would have a bad consequence. A host which had a proxy problem and never received a work unit, but which kept contacting the scheduler, would cause that workunit to never finish.
Quote:
Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.
I don't know how to make this determination.
However I have just made the following changes. IF
- Work within 25% of deadline (42 hours for Einstein@Home), OR
- Work no longer needed (Canonical result already exists), OR
- Work unit has error flag set (something wrong), THEN
the scheduler no longer resends the workunit, but instead marks it as timed out in the database. The scheduler will then send an informational message to the client reporting that this WU has been 'expired'.
I'll test this over the next few hours, and see if it has undesirable side effects.
Two WUs resent were added to two WUs in "Ready to run" by which I experience two bad things:
1. The task "No new work" for Project was ignored.
2. The deadline of the WUs resent are earlier by one day than that of the WUs in "Ready to run." 4WUs in total must be finished within four days. (Besides EAH, I also participate in SAH and PAH with 1 pc)
Please report good and/or bad experiences with this feature in this thread.
Bruce
Juli 26th 2005, I posted a thread to report about 'Spooky WU's.'
These 16 missing 'Ghost WU's' were resent a couple days later.
First of all, I never asked for 16 WU's; Since I am running 5 different Boinc-projects on my computer the "connect every X days" is set to 0.1 days.
Anyway, I got them resent and started running them. Endlessly:
I get the following error on a resent Ghost WU:
30/07/2005 13:30:27|Einstein@Home|Result l1_1480.5__1480.5_0.1_T00_S4lA_1 exited with zero status but no 'finished' file
30/07/2005 13:30:27|Einstein@Home|If this happens repeatedly you may need to reset the project.
30/07/2005 13:30:27||request_reschedule_cpus: process exited
This WU has been running for over 20 hours now (even though the Boinc-manager contradicts, and claims that CPU time is only 9 ours.)
I tend to abort all 16 resent Ghost-WU's.
Somehow this feels like a waste of time and effort.
Please report good and/or bad experiences with this feature in this thread.
Bruce
Juli 26th 2005, I posted a thread to report about 'Spooky WU's.'
These 16 missing 'Ghost WU's' were resent a couple days later.
First of all, I never asked for 16 WU's; Since I am running 5 different Boinc-projects on my computer the "connect every X days" is set to 0.1 days.
Anyway, I got them resent and started running them. Endlessly:
I get the following error on a resent Ghost WU:
30/07/2005 13:30:27|Einstein@Home|Result l1_1480.5__1480.5_0.1_T00_S4lA_1 exited with zero status but no 'finished' file
30/07/2005 13:30:27|Einstein@Home|If this happens repeatedly you may need to reset the project.
30/07/2005 13:30:27||request_reschedule_cpus: process exited
This WU has been running for over 20 hours now (even though the Boinc-manager contradicts, and claims that CPU time is only 9 ours.)
I tend to abort all 16 resent Ghost-WU's.
Somehow this feels like a waste of time and effort.
If the WU is causing problems, go ahead and abort it, thats one of the reasons the abort function was added to BoincManager. Same with the extra WU's that were downloaded, if you have too much work, abort the "extra" ones. They'll be reissued and someone else can process them. And after aborting them, "update" the project so the status gets reported.
No idea why the running WU reports 9 hours after running for 20. Might be one tied with the reason its "exiting with no finished file". Check the stderr.txt file in the slots/n folder E@H is running in. Before aborting the WU that is, the reason the science app exits is written to that file. Usually you'll see something like "no heartbeat" meaning it lost communications with BOINC.
The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days.
This would have a bad consequence. A host which had a proxy problem and never received a work unit, but which kept contacting the scheduler, would cause that workunit to never finish.
Quote:
Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.
I don't know how to make this determination.
However I have just made the following changes. IF
- Work within 25% of deadline (42 hours for Einstein@Home), OR
- Work no longer needed (Canonical result already exists), OR
- Work unit has error flag set (something wrong), THEN
the scheduler no longer resends the workunit, but instead marks it as timed out in the database. The scheduler will then send an informational message to the client reporting that this WU has been 'expired'.
I'll test this over the next few hours, and see if it has undesirable side effects.
Bruce
This is actually in development (sort of) at the moment. There is enough information (information about the deadline and remaining runtime) for each WU to determine slack time. This should be used when sending any work to make certain that there is enough slack time before the deadline for the WU to have a chance of getting it done.
got no work and every time i try to update i get this message group
01/08/2005 12:10:32|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:35|Einstein@Home|Unrecoverable error for result l1_1083.0__1083.1_0.1_T12_S4lA_2 (CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20))
01/08/2005 12:10:35||request_reschedule_cpus: start failed
01/08/2005 12:10:35||request_reschedule_cpus: process exit
Based on the feedback in this forum, I've made some additional modifications to the scheduler policy on resending lost workunits. Details may be found here: deadline_proposal.txt. This extends the deadlines (up to a total of an additional week) for machines that did not get the work when it was originally sent.
got no work and every time i try to update i get this message group
01/08/2005 12:10:32|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:35|Einstein@Home|Unrecoverable error for result l1_1083.0__1083.1_0.1_T12_S4lA_2 (CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20))
01/08/2005 12:10:35||request_reschedule_cpus: start failed
01/08/2005 12:10:35||request_reschedule_cpus: process exit
anyone help ?
already tried resetting
Stop BOINC and restart it.
Did you get a download error earlier for the .exe file? Theres a bug in 4.45 where it doesn't close the file after a temporary download error. Or maybe is says transient, either way it retries the download after a minute. The retry opens a second instance of the file, and closes that second instance. But since that first one is still out there, Windows can't start the new process.
Windows will close the file when BOINC stops, that should fix the problem.
All these close deadlines may
)
All these close deadlines may not be the norm. Since this feature has just been turned on, it is resending all the stuff that has been setting there for a while. In future the ghost work units should generally get resent well before the deadline; unless the host has been disconnected from the internet for some reason.
But I was thinking if a host should ask for xxxxx seconds of work, the server should subtract the amount of time of any ghost units and resend them. Then only send new work if the ghost units total time is less then that which was requested.
For Example:
Request 20,000 seconds of work. Have 5,000 seconds of ghost units. You would only need to send 15,000 seconds of new work.
You like Myst? Uru Live returns! www.urulive.com
The deduction of a resend
)
The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days. Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.
RE: The deduction of a
)
This would have a bad consequence. A host which had a proxy problem and never received a work unit, but which kept contacting the scheduler, would cause that workunit to never finish.
I don't know how to make this determination.
However I have just made the following changes. IF
- Work within 25% of deadline (42 hours for Einstein@Home), OR
- Work no longer needed (Canonical result already exists), OR
- Work unit has error flag set (something wrong), THEN
the scheduler no longer resends the workunit, but instead marks it as timed out in the database. The scheduler will then send an informational message to the client reporting that this WU has been 'expired'.
I'll test this over the next few hours, and see if it has undesirable side effects.
Bruce
Director, Einstein@Home
Two WUs resent were added to
)
Two WUs resent were added to two WUs in "Ready to run" by which I experience two bad things:
1. The task "No new work" for Project was ignored.
2. The deadline of the WUs resent are earlier by one day than that of the WUs in "Ready to run." 4WUs in total must be finished within four days. (Besides EAH, I also participate in SAH and PAH with 1 pc)
DeBugman Tokyo,Japan
RE: Please report good
)
Juli 26th 2005, I posted a thread to report about 'Spooky WU's.'
These 16 missing 'Ghost WU's' were resent a couple days later.
First of all, I never asked for 16 WU's; Since I am running 5 different Boinc-projects on my computer the "connect every X days" is set to 0.1 days.
Anyway, I got them resent and started running them. Endlessly:
I get the following error on a resent Ghost WU:
30/07/2005 13:30:27|Einstein@Home|Result l1_1480.5__1480.5_0.1_T00_S4lA_1 exited with zero status but no 'finished' file
30/07/2005 13:30:27|Einstein@Home|If this happens repeatedly you may need to reset the project.
30/07/2005 13:30:27||request_reschedule_cpus: process exited
This WU has been running for over 20 hours now (even though the Boinc-manager contradicts, and claims that CPU time is only 9 ours.)
I tend to abort all 16 resent Ghost-WU's.
Somehow this feels like a waste of time and effort.
RE: RE: Please report
)
If the WU is causing problems, go ahead and abort it, thats one of the reasons the abort function was added to BoincManager. Same with the extra WU's that were downloaded, if you have too much work, abort the "extra" ones. They'll be reissued and someone else can process them. And after aborting them, "update" the project so the status gets reported.
No idea why the running WU reports 9 hours after running for 20. Might be one tied with the reason its "exiting with no finished file". Check the stderr.txt file in the slots/n folder E@H is running in. Before aborting the WU that is, the reason the science app exits is written to that file. Usually you'll see something like "no heartbeat" meaning it lost communications with BOINC.
Walt
RE: RE: The deduction of
)
This is actually in development (sort of) at the moment. There is enough information (information about the deadline and remaining runtime) for each WU to determine slack time. This should be used when sending any work to make certain that there is enough slack time before the deadline for the WU to have a chance of getting it done.
BOINC WIKI
got no work and every time i
)
got no work and every time i try to update i get this message group
01/08/2005 12:10:32|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:33|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:34|Einstein@Home|CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
01/08/2005 12:10:35|Einstein@Home|Unrecoverable error for result l1_1083.0__1083.1_0.1_T12_S4lA_2 (CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20))
01/08/2005 12:10:35||request_reschedule_cpus: start failed
01/08/2005 12:10:35||request_reschedule_cpus: process exit
anyone help ?
already tried resetting
Based on the feedback in this
)
Based on the feedback in this forum, I've made some additional modifications to the scheduler policy on resending lost workunits. Details may be found here:
deadline_proposal.txt. This extends the deadlines (up to a total of an additional week) for machines that did not get the work when it was originally sent.
Bruce
Director, Einstein@Home
RE: got no work and every
)
Stop BOINC and restart it.
Did you get a download error earlier for the .exe file? Theres a bug in 4.45 where it doesn't close the file after a temporary download error. Or maybe is says transient, either way it retries the download after a minute. The retry opens a second instance of the file, and closes that second instance. But since that first one is still out there, Windows can't start the new process.
Windows will close the file when BOINC stops, that should fix the problem.
Walt