Improvements in the code of the clients

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117711402383
RAC: 35090657

I don't know of anyone else

I don't know of anyone else having this problem so my original thoughts may well be quite wrong.  The lack of other reports troubled me as well.

Since I posted my reply, I've been trying to remember more details of exactly what happened when I experienced the same sort of thing.  My memory doesn't work all that well these days :-).

I have this funny feeling that there may have been something else - something that might have caused the server to think that the tasks were missing required files or 'lost' in some way.  I think it was at a time when Bernd had disabled the 'resend lost tasks' mechanism for GW tasks - or something like that.  Instead of replacing what was lost, the server simply created some sort of time limit or deadline exceeded, but I can't recall the precise description.  It was quite a while ago.

Knowing your ability to delve deeply into mysterious happenings, I'm sure you'll track it down :-).

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2959539461
RAC: 704436

Well, I've completed my

Well, I've completed my spreadsheet of server reports, and it's compatible with the hypothesis I want to test - later this afternoon, when I get back from lunch.

I've also thought of another possible test, which I'll try out on the second machine tomorrow - that one's due for security patch maintenance, when a long task from another project has finished.

Watch this space!

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2959539461
RAC: 704436

OK everyone - hold on to your

OK everyone - hold on to your hats. I think I have an explanation.

First, some background. This is a reasonably powerful GPU machine (2x GTX 1660 Ti). Serious scientific GPU projects are getting thin on the ground (SETI dormant, GPUGrid erratic, and we won't talk about Collatz). So when WCG announced a GPU covid-19 project, I jumped in. First as Beta, then from early May, in full production.

Modern GPUs are fast, and there are a lot of hungry crunchers out there (like me!) looking for work. The Scripps Institute, who are the science behind the WCG project, have limited capacity both to create new BOINC workunits, and to analyse the results returned. So the WCG GPU workflow is rationed: a new batch of workunits is released every 30 minutes or so, and everyone plays 'catch as catch can'.

BOINC isn't designed to work well in that situation - half an hour with no successful work requests is quite enough to put the machine into extended backoff, so you miss the next batch. So the trick is:

1) force the machine to update every five minutes or so (that can be scripted)
2) ensure that it's low on cache, so will request work every time.

So, I've been requesting a large block of work (0.3 days, or 0.55 days) from Einstein two or three times a day, and then suspending some tasks. The machine can work through that block of Einstein work at its own speed, but divert to WCG whenever tasks are caught by the script. That's worked well for three months.

But it means that I periodically make big work requests at Einstein. As I've noted elsewhere (locality scheduler running very slow), a big request for GW work can often take longer to process than Linux is prepared to hold the RPC channel open. No problem: the client asks again a few minutes later, and gets the lost tasks resent, 12 at a time.

Until Thursday,

Following the wish expressed in this thread, we got a new app. This app - not unreasonably - has been marked as 'Beta' until it proves itself. And I think that's what's caused the problem. From my message log today:

Sat 07 Aug 2021 16:22:40 BST | Einstein@Home | [sched_op] NVIDIA GPU work request: 13126.52 seconds; 0.00 devices
Sat 07 Aug 2021 16:23:10 BST | Einstein@Home | Scheduler request completed: got 17 new tasks
Sat 07 Aug 2021 16:23:10 BST | Einstein@Home | [sched_op] estimated total NVIDIA GPU task duration: 13636 seconds
Sat 07 Aug 2021 16:31:21 BST | Einstein@Home | [sched_op] NVIDIA GPU work request: 7567.91 seconds; 0.00 devices
Sat 07 Aug 2021 16:31:48 BST | Einstein@Home | Scheduler request completed: got 10 new tasks
Sat 07 Aug 2021 16:31:48 BST | Einstein@Home | [sched_op] estimated total NVIDIA GPU task duration: 8024 seconds
Sat 07 Aug 2021 17:12:24 BST | Einstein@Home | [sched_op] NVIDIA GPU work request: 8312.02 seconds; 0.00 devices
Sat 07 Aug 2021 17:12:56 BST | Einstein@Home | Scheduler request completed: got 11 new tasks
Sat 07 Aug 2021 17:12:56 BST | Einstein@Home | [sched_op] estimated total NVIDIA GPU task duration: 8827 seconds
Sat 07 Aug 2021 17:24:42 BST | Einstein@Home | [sched_op] NVIDIA GPU work request: 31459.55 seconds; 0.00 devices
Sat 07 Aug 2021 17:25:49 BST | Einstein@Home | Scheduler request failed: Timeout was reached
Sat 07 Aug 2021 17:27:40 BST | Einstein@Home | Sending scheduler request: To fetch work.
Sat 07 Aug 2021 17:27:40 BST | Einstein@Home | [sched_op] NVIDIA GPU work request: 31779.94 seconds; 0.00 devices
Sat 07 Aug 2021 17:27:42 BST | Einstein@Home | Scheduler request completed: got 20 new tasks
Sat 07 Aug 2021 17:27:42 BST | Einstein@Home | Didn't resend lost task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1169_0 (expired) [repeated 35 times]
Sat 07 Aug 2021 17:27:42 BST | Einstein@Home | [sched_op] estimated total NVIDIA GPU task duration: 32525 seconds
Sat 07 Aug 2021 17:27:44 BST | Einstein@Home | Started download of templates_LATeah4011L03_1196_14225065.dat [and so on]

So, the small requests at the top completed fine - I've got the tasks in my cache, and they show as 'In progress' on the website.

But the big request at 17:24 was timed out after a minute, and didn't complete. These tasks were not resent, and a different set of gamma-ray tasks was sent instead. During the three minutes between the two requests, those tasks were green and 'In progress' on the website: after the second request, they were red and timed out at 16:27:40 UTC. My log is UTC+1, so the timing is exact.

My conclusion is that there's some logic in the scheduler - certainly the special locality scheduler we use here, possibly all BOINC schedulers - to the effect of:

"If a task goes missing,

a) if it's a production task, assume a comms glitch and resend it

b) if it's a Beta task, assume a bug: cancel it, and send something safer"

That logic may have been written up to 15 years ago, and everyone will have forgotten why it was written, or that it even exists. But we should be able to find it in code.

-------------------

If that conclusion is right, I'm probably the only one to have encountered it. A constellation of specific settings need to coincide. (Fast machine, large work requests, short http timeouts, Beta work accepted). I can adapt to that. But I found it an interesting cautionary tale - it might explain some odd events we've seen at other projects in the past.

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3960
Credit: 47071292642
RAC: 65414933

I’m not convinced that the

I’m not convinced that the new beta GW app has anything to do with the code changes described in this thread. There’s less than 3hrs between Bernd saying he’ll send it to the devs, and a new app appearing. Finding the right bit of code to change, implementing it, testing, compiling, bug fixing, review, approval, deployment all undoubtedly takes more than 3-4hrs. 
 

Whatever changes have been included in the new beta app have more likely been in development for quite a while, and just happened to come close to Bernd’s comment. 
 

It’s probably better if we stop discussing that new GW app in this thread as it will give people the impression they are connected when it’s more likely they are not. It’s going to cause more confusion and distraction than necessary. Might be better for a mod to splinter off the GW beta app discussion to a new thread. 

_________________________________________________________________________

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117711402383
RAC: 35090657

Richard Haselgrove wrote:....

Richard Haselgrove wrote:
.... As I've noted elsewhere (locality scheduler running very slow), a big request for GW work can often take longer to process than Linux is prepared to hold the RPC channel open. No problem: the client asks again a few minutes later, and gets the lost tasks resent, 12 at a time.

This bit triggered a memory.

There was a very similar situation back with a test version of an early GW GPU app.  Requesting even moderate amounts of test tasks took an inordinate amount of time to complete and the client timing out and dropping the connection became a regular event with the lost tasks being resent (or in some cases being dropped completely) by the scheduler.

I did a bit of a hunt and found this comment by Bernd to a similar situation where he stated:-

Quote:
I now changed the scheduler such that it will "expire" a task when there is no app version to process it.

Later in the same thread, he mentioned:-

Quote:
The more I think about it, the more it seems to me that expiring (i.e. dropping) "lost" tasks that can't be processed by the client is the right thing to do and was simply  neglected when the "resend lost tasks" feature was first implemented.

It seems like having a test app might be the trigger for the scheduler to expire rather than resend lost tasks created by its own slowness.

I'm wondering if whatever local changes Bernd made to the server code at Einstein might be the cause of what you see happening now with the current test app, rather than something you might find by trawling through the official BOINC code.

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2959539461
RAC: 704436

I think there's more to it

I think there's more to it than that. We used to see these extremely short 'deadline' values at SETI@Home (RIP), and puzzle over them. But they went back a lot longer ago into the past, than Bernd's comment dated November 2019.

The fundamental design of BOINC dates back to a period when dial-up modems (56K?) were common. Some of the hard-coded timing constants in BOINC date back to that era. Also, it's a fundamental design decision in BOINC that it can only handle a single project RPC at once: once a scheduler request has been launched at a project, no-one else can be addressed until that project has replied or the connection has been timed out. Timing has to include establishing the connection; transferring a potentially large file; processing time on the server; and receiving the reply file.

Bernd's comment also impinges on another common factor in BOINC programming: it responds to events, but doesn't consider the reason that led to the event. Bernd's comment refers to expiring a task when "there is no app version to process it". I had the app version: what I didn't have was the time. But both failure modes were treated the same.

There is a user-controlled option that comes into play here:

<http_transfer_timeout>seconds</http_transfer_timeout>

Abort HTTP transfers if idle for this many seconds; default 300.

I'm now on a 70 Mbit fibre connection, so file transfer speeds are negligible: so I'd changed that option to 60 seconds. But a fast connection doesn't reduce the server processing time at the other end. I've reverted that change, and I'll monitor the results.

OK, moan over. I'll return you to your regularly scheduled programming.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250601258
RAC: 34581

1. The 1.01 GW (O3AS) App

1. The 1.01 GW (O3AS) App versions have other optimizations in the GPU code, not the one suggested here. They should, however, be significantly faster.

2. The 1.01 App versions are still in "Beta test". At E@H, results of Beta test app versions are always validated / compared to that of establishes or official app versions.

3. Richard, your problem seems to be on the server side and is still a mystery to me. It doesn't look related to the 1.01 app versions.

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250601258
RAC: 34581

petri33 wrote:Twiddle dee

petri33 wrote:

Twiddle dee is defined __constant float2[3][256] and it is accessed from  different location by each thread resulting to serialized access on the nvidia hardware. SLOW!

Please replace the word __constant with global.

See: https://einsteinathome.org/fi/workunit/565663876

Regarding that: I'm not that tightly involved in GPU coding these days, maybe you could help me get up to speed:

- what is "twiddle dee" (except for AiW)? I suspect that's not related to our applications?

- there is no __constant in our FGRP code (as you can see for yourself, the OpenCL code is embedded in the app binary as text)

- There is only one __constant in the GW code:__constant REAL8 LAL_FACT_INV[21]. Do you expect replacing that having a noticeable effect? If so, I could probably build an app for testing once the 1.01 is out of Beta test.

 

BM

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2959539461
RAC: 704436

Bernd Machenschalk wrote:3.

Bernd Machenschalk wrote:

3. Richard, your problem seems to be on the server side and is still a mystery to me. It doesn't look related to the 1.01 app versions.

tl:dr

My problem started immediately after the v1.01 app was deployed on the server.

I believe it arises when the server can't see the tasks it believes it has issued to the same host in the previous contact.

a) if the tasks were assigned to a production app (v1.00), they are resent
b) if the tasks were assigned to a Beta test app (v1.01), they are cancelled

The reason for non-acceptance by the client is not considered.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3960
Credit: 47071292642
RAC: 65414933

Bernd Machenschalk

Bernd Machenschalk wrote:

petri33 wrote:

Twiddle dee is defined __constant float2[3][256] and it is accessed from  different location by each thread resulting to serialized access on the nvidia hardware. SLOW!

Please replace the word __constant with global.

See: https://einsteinathome.org/fi/workunit/565663876

Regarding that: I'm not that tightly involved in GPU coding these days, maybe you could help me get up to speed:

- what is "twiddle dee" (except for AiW)? I suspect that's not related to our applications?

- there is no __constant in our FGRP code (as you can see for yourself, the OpenCL code is embedded in the app binary as text)

- There is only one __constant in the GW code:__constant REAL8 LAL_FACT_INV[21]. Do you expect replacing that having a noticeable effect? If so, I could probably build an app for testing once the 1.01 is out of Beta test.

 

I'll quote petri from another thread here: https://einsteinathome.org/content/will-seti-ever-return?page=1#comment-188110

 

petri33 wrote:

Twiddle_dee is a small table of constant values in the code that is used in FFT calculations (to bring out repeating patterns of signals).

 

On NVIDIA constant values placed in __constant memory area should be accessed simultaneously at one memory address by all GPU threads (2000-5000 of them).

 

Einstein software fetches the twiddle_dee values almost randomly. So a performance penalty occurs: The nearly instantaneous parallel fetch of the value is turned into a sequential one. Not one clock cycle, but thousands of clock cycles: all computing waits for the last read from memory to be able to continue.

 

The FIX is to define twiddle_dee (a small buffer of constants) to reside in global memory and thus not be fetched through constant cache sequentially. The read from global memory can benefit from caching and nearby access of the values that were fetched with some earlier read, A read of one address fills the cache with more values ahead. So : Global memory can be served to the threads from 'random' addresses much faster.

 

so it probably depends on how you are populating the array and where you are fetching data from.

 

the change to from "__constant float2 twiddle_dee" to "__global   float2 twiddle_dee" in the FGRPB1G app (v1.20 nvidiaTV) has massive performance improvements for nvidia. the app processes literally 2x faster in some cases (like my 3080Ti). Turing cards usually see about 65% speed improvement. Pascal 40-50%. 

 

using a hex editor I can see twiddle_dee being defined as constant in all versions of the FGRPB1G apps as well as the O3AS GW app. it's definitely in there.

_________________________________________________________________________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.