Scheduler exchange can cause client deferral for 24 hours

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118426881830
RAC: 25919356
Topic 217218

It doesn't happen frequently but I've seen it before and it happened to one of my hosts again last night.

I have several scripts that run on a central server machine that allow me to monitor the health of all hosts in my fleet.  These scripts regularly use ssh connections to all hosts on the LAN and use boinccmd to check client performance.  Two key parameters that are monitored are CPU clock ticks used in support of GPU tasks and last RPC time for contact with the project servers.  The first parameter can easily identify GPU crashes where the host itself continues to run normally.  The second one is useful in knowing that there are no network issues and that a host is regularly clearing completed tasks.

When I checked the overnight logs this morning, I noticed that a particular host had about 5 consecutive entries with increasingly longer intervals to the last scheduler RPC contact.  By default I allow a maximum contact time of close to 2 hours before flagging a warning.  When I saw the log entries this had grown to more than 6 hours.

I opened BOINC Manager on the host in question and found that there was a continuing deferral of some 17.5 hours still to run.  I went back through the event log to find when the deferral had been initiated.  Here is a snippet from the log that shows what had happened.

Sun 09 Dec 2018 10:25:00 PM EST | Einstein@Home | <![CDATA[Sending scheduler request: To report completed tasks.]]>
Sun 09 Dec 2018 10:25:00 PM EST | Einstein@Home | <![CDATA[Reporting 5 completed tasks]]>
Sun 09 Dec 2018 10:25:00 PM EST | Einstein@Home | <![CDATA[Not requesting tasks: don't need (CPU: ; AMD/ATI GPU: job cache full)]]>
Sun 09 Dec 2018 10:25:06 PM EST | Einstein@Home | <![CDATA[Scheduler request completed]]>
Sun 09 Dec 2018 10:25:06 PM EST | Einstein@Home | <![CDATA[platform 'x86_64-pc-linux-gnu' not found]]>

At 22:25PM local time (12:25 UTC) a normal reporting of completed work was initiated.  The request was completed successfully, BUT, you can see the unexpected final line in the snippet.  The scheduler decided that it no longer recognised the otherwise very standard 64bit Linux platform and consequently put the host into a 24 hour backoff.

All I can think of is that the scheduler must make a check of all the allowed platforms perhaps with some sort of timeout to retrieve the information and if it can't get that confirmation of allowed platforms (perhaps through some sort of race condition) it just tells the client to bugger off for 24 hours.  Yeah, what a good idea :-).

I guess server congestion is probably a key factor and the current batch of fast finishing tasks is probably exacerbating the problem.  It doesn't really affect me because my scripts report this very promptly.  However it would be nice if something could be done to prevent this from happening in the first place for the benefit of others who aren't able to monitor so closely.

 

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2982030545
RAC: 753776

Gary, I saw your report in

Gary, I saw your report in the boinccmd thread we're working on, and again here.

Could you be a little bit more explicit about how the actual 24-hour backoff is activated? If you roll the message log forward, is there a line like

10/12/2018 10:06:16 | Einstein@Home | Project requested delay of 60 seconds

or

10/12/2018 10:06:16 | Einstein@Home | [sched_op] Deferring communication for 00:01:00
10/12/2018 10:06:16 | Einstein@Home | [sched_op] Reason: requested by project

Or does the client choose to make the backoff by itself?

And BTW - what tool are you using to collect those event log snippets? We need to remove the <![CDATA[ wrapper.

Why does a single-line 'code' snippet render differently here (in preview) from a two-line snippet?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118426881830
RAC: 25919356

Richard Haselgrove

Richard Haselgrove wrote:
Gary, I saw your report in the boinccmd thread we're working on, and again here.

I felt it was my civic duty to at least put it on the record here in case the Devs have time to investigate.

Richard Haselgrove wrote:

Could you be a little bit more explicit about how the actual 24-hour backoff is activated? If you roll the message log forward, is there a line like

10/12/2018 10:06:16 | Einstein@Home | Project requested delay of 60 seconds

or

10/12/2018 10:06:16 | Einstein@Home | [sched_op] Deferring communication for 00:01:00
10/12/2018 10:06:16 | Einstein@Home | [sched_op] Reason: requested by project

Or does the client choose to make the backoff by itself?

There was no relevant extra info.  The host in question was running V7.6.33 - my self compiled version thereof.  I have no reference point, except for what the older V7.2.42 produces, for knowing whether or not my build is producing all output that it should produce.  I've never been in the habit of customising output with cc_config.xml flags - the default output has always seemed sufficient for my purposes.

Immediately before the snip, there were entries reporting a task finishing and the uploading of results.  Immediately after the snip was the same stuff for the next task to finish.  There were no further reporting events, just a stack of uploads recorded and completed work waiting to be reported when I found the problem.  I can't say for sure whether or not the scheduler instructed, or the client decided on it own to initiate the deferral.  Surely the smoking gun is the scheduler disowning the platform?  Anyway, that was my assumption.  It can't be a good outcome if the scheduler suddenly complains about the platform for no apparently good reason.

Your comments about the CDATA wrapper (whatever that is) lead me to speculate that perhaps the PCLOS specific --devel libs I used for building BOINC may be a bit different in some way and that this may affect the structure (and content) of the log output.  There are definitely no additional messages that give any detail whatsoever about the deferral and which bit of code decided to apply it.

I've built 7.6.33 a couple of times.  The very first time (nearly 2 years ago) was with gcc_4.9.3 (or something like that).  A couple of months later, the PCLOS build tools were upgraded (a gcc_7.x version if I remember correctly) and lots of stuff in the repo got rebuilt - including WxWidgets.  I first had a problem when I couldn't launch BOINC Manager on a new install because of an incompatible ABI between my older manager and the new WxWidgets build.  So I rebuilt BOINC with the new tools and made sure all machines running 7.6.33 were using that rebuilt version.  Later on, I also built 7.9.3 but have never deployed it.  I'll build again at some point but I guess I'll need to investigate why my event log messages are 'different' from perhaps what I should be getting.

Apart from using boinccmd in scripts, the only tool I use when viewing event logs is BOINC Manager, quite often over the LAN.  If I've actually hooked up peripherals to a host to run the local manager,  it will always be the same version as the client.  Over the LAN, the manager will be V7.2.42 - what's on the server machine.  I've never noticed any real problems in viewing a 7.6.33 client with a 7.2.42 manager.

When I want to include a snip into a forum message.  I will use a terminal session over ssh to open the relevant stdoutdae.txt (or .old) file on the LAN machine using the 'less' utility where I can easily search for what I want and then copy and paste it into the message composition window.  I avoid going to the physical machine wherever possible.  The room they are in is rather hot (and noisy) with high speed forced ventilation using outside air.  Quite OK in winter but hot in summer.  I consider throttling machines if it gets above 38C.  Today it's a pleasant 34C :-).

Richard Haselgrove wrote:
Why does a single-line 'code' snippet render differently here (in preview) from a two-line snippet?

Not only in preview but also in the posted version and I really have no idea why or how to change that.  I feel a bit sorry for people supporting Einstein as well as other traditional BOINC look and feel projects.  The differences between the two and then some of the odd local behaviour seem to make it a bit frustrating for them at times.  I've sort of adapted to here and forgotten about the frustrating bits since I don't have time to do much visiting of other projects.  The only other website I've started using lately is the BOINC one :-).  I believe there are some people doing really good work over there :-).

 

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2982030545
RAC: 753776

OK, not to worry. The

OK, not to worry. The question for this board is certainly the disappearing

Sun 09 Dec 2018 10:25:06 PM EST | Einstein@Home | <![CDATA[platform 'x86_64-pc-linux-gnu' not found]]>

Quick wrap on the other points:

Oliver's helpful certificate expiry test yesterday confirmed that client-generated backoffs show up in the work I'm doing, so your 24-hour case should show as well.

<![CDATA[...]]>

is a protective mechanism that ensures that message contents aren't wrongly interpreted as formatting commands if they contain special characters. BOINC was changed quite recently to add the protection when the message is created by the client, and remove it again before it's displayed by the Manager, boinccmd, or any other interface tool. It sounds as if your newer client is adding the protection, but your older Manager doesn't know it needs to remove it. I'll try to work out the minimum Manager version you need to clean it up.

I'm sure the drupal devs will get round to the layout bug eventually...

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2982030545
RAC: 753776

Richard Haselgrove

Richard Haselgrove wrote:

<![CDATA[...]]>

is a protective mechanism that ensures that message contents aren't wrongly interpreted as formatting commands if they contain special characters. BOINC was changed quite recently to add the protection when the message is created by the client, and remove it again before it's displayed by the Manager, boinccmd, or any other interface tool. It sounds as if your newer client is adding the protection, but your older Manager doesn't know it needs to remove it. I'll try to work out the minimum Manager version you need to clean it up.

Turns out this fix:

GUI RPC: enclose message bodies in CDATA to avoid XML parse errors for messages containing "<".

first appeared in v7.6.32 - so your v7.6.33 clients are just new enough to send it. But if you look at their event logs using their own Managers, they should be clean.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.