Scheduler Bug with work requests for O2MD1 work - ### Staff Please Read ###

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117785141936

RAC: 34711840

Zalster wrote:I'm seeing this

13 Oct 2019 22:47:00 UTC

Message 173863 in response to message 173859

(moderation:

)

Zalster wrote:

I'm seeing this when I try to find out why my computer won't download any new work units
[version] beta test app versions not allowed in project prefs.

I'm a bit confused. That first line would be a reason why you don't get GPU tasks (which are beta) but later on in the different bottom section of the log you posted, it says the opposite - host will accept beta work. Is this all part of the one log or are there bits of two totally different logs joined together?

Also, the bottom section mentions 288 lost tasks. What happened to cause the tasks to become lost?

From what I've seen and from what Richie mentions, even if you get the scheduler to resend the lost tasks, they will come as CPU tasks and not GPU tasks. I don't imagine you would want that to happen.

When you tried to reset the project, what does "but no go" actually mean? Was there some sort of response or just complete silence? I don't know that we're going to get much satisfaction until one of the Devs gets to take a look at the whole business. If you really get desperate, can you detach from the project and then reattach?

Sorry I can't offer anything more hopeful about how to resolve this.

EDIT: I've just seen your further response which must have arrived whilst I was composing the above.

As predicted, you are in for 288 CPU task resends. They will come in batches of 12 at a time with each new batch being triggered by a scheduler request which you can force with an 'update'. To get rid of them you will need to abort them as they arrive. If you don't physically abort them, the scheduler will be quite persistent in trying to give them back to you (eg after a further reset).

Cheers,
Gary.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

That was just 1 report. I cut

14 Oct 2019 1:22:45 UTC

Message 173870 in response to message 173863

(moderation:

)

That was just 1 report. I cut it off because there are 288 critical messages and I didn't want to post all of those on here so I took a few from the beginning and at the end.

Keith mention they were retiring certain APPS. He's lead me to believe that I still had those APPS on my computer when they retired them. Any work units I had were targeted for those APPS, so when they were removed, the work units didn't have an APP to run on. As such, they "disappeared" after the removal of the APPS.

This is all speculation on my part. I would have thought the work units would change over to the new APP instead of "disappearing" and causing the error messages. Who knows. I've started to abort the CPU work units as this machine isn't set up for them.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117785141936

RAC: 34711840

Zalster wrote:That was just 1

14 Oct 2019 4:21:03 UTC

Message 173872 in response to message 173870

(moderation:

)

Zalster wrote:

That was just 1 report. I cut it off because there are 288 critical messages and I didn't want to post all of those on here so I took a few from the beginning and at the end.

OK, I understand fully now. Thanks for that.

Zalster wrote:

Keith mention they were retiring certain APPS. He's lead me to believe that I still had those APPS on my computer when they retired them. Any work units I had were targeted for those APPS, so when they were removed, the work units didn't have an APP to run on. As such, they "disappeared" after the removal of the APPS.

There have been three recent GW GPU versions. The O2AS20-500 app (V1.09) was the first that started working properly and giving valid results. It was announced that they were going to finish up that search rather immediately. It doesn't appear to have been fully completed but I think there might have been some urgency to get the "Multi-Directed" search underway. My guess was that by directing the search towards known pulsars, there was a greater hope of detecting something rather than just continuing to search the whole sky.

To do this, the new O2MD1 search was launched using the successful V1.09 app that had a further 'fix' applied and was rebranded as V1.10. That ran for a little while until it was decided there needed to be another modification to the app to improve sensitivity. So to achieve that, V2.00 and V2.01 apps were released, but still using the same large data files. Unfortunately, there was no comment about what was to happen to the V1.10 tasks that were already out in the wild.

So, really, at that point there were 3 apps, still all 'current', with no official word about what was to happen with remaining tasks and any future resends for any of them. Until all potential resends for failed tasks have been dealt with, you're not supposed to delete any app versions.

Logforme asked the question about what to do with those V1.10 tasks. There has been no official answer and the tasks still remain current so it seems like they are expected to be crunched. I tried to test that by causing a couple to become lost. Since they got sent back as CPU tasks it seems like the project wants them to be crunched. If you would rather run the new app (V2.x) at higher sensitivity, the only real course of action would be to abort those GPU tasks that were 'branded' as V1.10. By doing that, they would be sent to some other sucker so, personally, I've been quite reluctant to abort anything. I've crunched the ones I had on one machine, since it ran out of all GW stuff and couldn't get more. I've saved them on a 2nd machine, while waiting for some sort of official comment about what the project wants to happen. I'll crunch them soon if there continues to be no official comment about what should happen.

In your case, it looks like deleting the app that the 288 tasks depended on has caused the tasks to become 'lost'. I wonder why the project just didn't cause a fresh copy of the app to be downloaded. I seem to recall that deleting apps that are still current (ie. still listed in the state file) just causes a new copy to be downloaded. Maybe a 'missing' app file just causes all tasks (<result> files) that depend on the app to be removed as well. Then, a new copy of the app could be sent when the lost tasks were resent.

Zalster wrote:

This is all speculation on my part. I would have thought the work units would change over to the new APP instead of "disappearing" and causing the error messages. Who knows. I've started to abort the CPU work units as this machine isn't set up for them.

At some point as you continue aborting, your allowed daily quota will be gone so unless you have something else left over from EAH to crunch, you might have to sit in the 'sin bin' until a new day ticks over :-(.

Cheers,
Gary.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Finally got to the remote

19 Oct 2019 19:06:07 UTC

Message 173948

(moderation:

)

Finally got to the remote location. Had to remove the project then reattach and download 12 work units a time that were GPU but now CPU. Abort those, then report, detach from project, reattach and repeat. Took a bit but finally cleared all of the "lost" task and now it's downloading fresh new GPU work units for the computer.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Ok, I was able to observe the

20 Oct 2019 2:42:24 UTC

Message 173951

(moderation:

)

Ok, I was able to observe the sequence of events.

Boinc connects to the server and ask for work, then it gets a response from the server saying there is no work available. However, in the connection log it gives an error message of critical that there is no app available for the work that it just sent.

So what is happening is boinc manager asks works, gets a no work available but another part of the server is not actually sending work but recording it as being sent. So it never downloads but is marked as if it did download.

Thus the "lost" work units.

So, until someone figures out what the server is doing, I'm not going to be able to do anymore GPU GW. Looks like Gamma rays is going to be the work unit of choice as it ACTUALLY downloads to the machine.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Three hosts had run out of

22 Oct 2019 16:18:16 UTC

Message 173982

(moderation:

)

Three hosts had run out of work today. Two of them were experiencing the same old stuff but one host shows this kind of critical error:

...

2019-10-22 15:51:31.1911 [PID=14548] Only one Beta app version result per WU (#423564096, re#608)
2019-10-22 15:51:31.2282 [PID=14548] [send] send_old_work() no feasible result younger than 238.2 hours and older than 168.0 hours
2019-10-22 15:51:31.2288 [PID=14548] [CRITICAL] build_working_set_namelist(fasthost): pattern .* not found (empty directory)
2019-10-22 15:51:31.2288 [PID=14548] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2019-10-22 15:51:31.2288 [PID=14548] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2019-10-22 15:51:31.2288 [PID=14548] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2019-10-22 15:51:31.2288 [PID=14548] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2019-10-22 15:51:31.2288 [PID=14548] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2019-10-22 15:51:31.2288 [PID=14548] [mixed] sending non-locality work second
2019-10-22 15:51:31.2487 [PID=14548] [send] [HOST#10685355] will accept beta work. Scanning for beta work.
2019-10-22 15:51:31.2635 [PID=14548] [debug] [HOST#10685355] MSG(high) No work sent
2019-10-22 15:51:31.2636 [PID=14548] [debug] [HOST#10685355] MSG(high) see scheduler log messages on https://einsteinathome.org/host/10685355/log
2019-10-22 15:51:31.2636 [PID=14548] Sending reply to [HOST#10685355]: 0 results, delay req 60.00

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Richie wrote:Three hosts had

22 Oct 2019 19:03:11 UTC

Message 173983 in response to message 173982

(moderation:

)

Richie wrote:

Three hosts had run out of work today. Two of them were experiencing the same old stuff but one host shows this kind of critical error:
...
2019-10-22 15:51:31.1911 [PID=14548] Only one Beta app version result per WU (#423564096, re#608)
2019-10-22 15:51:31.2282 [PID=14548] [send] send_old_work() no feasible result younger than 238.2 hours and older than 168.0 hours
2019-10-22 15:51:31.2288 [PID=14548] [CRITICAL] build_working_set_namelist(fasthost): pattern .* not found (empty directory)
2019-10-22 15:51:31.2288 [PID=14548] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2019-10-22 15:51:31.2288 [PID=14548] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2019-10-22 15:51:31.2288 [PID=14548] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2019-10-22 15:51:31.2288 [PID=14548] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2019-10-22 15:51:31.2288 [PID=14548] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2019-10-22 15:51:31.2288 [PID=14548] [mixed] sending non-locality work second
2019-10-22 15:51:31.2487 [PID=14548] [send] [HOST#10685355] will accept beta work. Scanning for beta work.
2019-10-22 15:51:31.2635 [PID=14548] [debug] [HOST#10685355] MSG(high) No work sent
2019-10-22 15:51:31.2636 [PID=14548] [debug] [HOST#10685355] MSG(high) see scheduler log messages on https://einsteinathome.org/host/10685355/log
2019-10-22 15:51:31.2636 [PID=14548] Sending reply to [HOST#10685355]: 0 results, delay req 60.00

Check your machines, I see at least 4 machines that you have that have the same message

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Zalster wrote:Check your

22 Oct 2019 19:33:58 UTC

Message 173984 in response to message 173983

(moderation:

)

Zalster wrote:

Check your machines, I see at least 4 machines that you have that have the same message

Yes, it seems none of my hosts is able to get GW GPU tasks at the moment. Some of them show that "pattern not found" error and some show the regular "WU# ... too old". Looks like this transforming disease has taken over all of them at some point today. A couple of machines still have a few tasks in queue but only for a few hours.

Jim1348

Joined: 19 Jan 06

Posts: 463

Credit: 257957147

RAC: 0

Same here, insofar as I can

23 Oct 2019 21:28:33 UTC

Message 173991

(moderation:

)

Same here, insofar as I can figure out. I am sure only that I can get no GPU work for my RX 570 (Win7 64-bit).

https://einsteinathome.org/host/12790950/log

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117785141936

RAC: 34711840

Jim1348 wrote:... I can get

23 Oct 2019 23:24:24 UTC

Message 173993 in response to message 173991

(moderation:

)

Jim1348 wrote:

... I can get no GPU work for my RX 570 (Win7 64-bit).

I'm guessing it is now a case of no further primary GPU tasks to distribute and maybe just some resends. If you look at the server status page, there seem to be lots of tasks but because of the limit of only one GPU task per quorum, perhaps those available tasks are for cases where a GPU task is already allocated. An excerpt from your host's log shows

[version] Best version of app einstein_O2MD1 is 2.01 ID 1203 GW-opencl-ati (251.25 GFLOPS)
[PID=8998 ] Only one Beta app version result per WU (#423425534, re#1)
[PID=8998 ] Only one Beta app version result per WU (#423225000, re#2)
[PID=8998 ] Only one Beta app version result per WU (#423225005, re#3)
[PID=8998 ] Only one Beta app version result per WU (#423481920, re#4)
[PID=8998 ] Only one Beta app version result per WU (#423482853, re#5)
.....
[PID=8998 ] Only one Beta app version result per WU (#423349256, re#117)
.....
[PID=8998 ] [CRITICAL] build_working_set_namelist(fasthost): pattern .* not found (empty directory)

I interpret the above as showing the scheduler unable to find any available workunits where a GPU task hadn't already been allocated and so the build_working_set_namelist() function couldn't build a list of available tasks because the cupboard was bare :-). There is no O2MD1 search progress block on the server status page so we have no idea how many tasks in total or what the current progress is but I think we may be close to finished.

I gave up on the hosts that had problems when I first documented what I was seeing at the start of this thread. I moved on to 2 new hosts that were able to get plenty of work without the baggage of previous history with O2MD1 tasks. These two have been running fine ever since with the frequency field in the task names marching ever upwards from about 150Hz to 395.75 for one and 399.50 for the other. I'm guessing the limit may be 400Hz so I'm guessing that all currently available GPU tasks may now be pretty much allocated.

Both these hosts have stopped getting new tasks (one has actually received 2 recent resends). They will both be out of work (and back on FGRPB1G) in about a day or so. Both of these show the same sort of "Only one Beta app version result ..." as listed above, which leads me to guess we are out of available work. Maybe the beta test stage will finish and a whole bunch of new stuff will suddenly appear. It would be nice if somebody would occasionally give us a clue about what to expect next.

Cheers,
Gary.

Scheduler Bug with work requests for O2MD1 work - ### Staff Please Read ###

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports