Upload trouble 12/29/18

archae86

Joined: 6 Dec 05

Posts: 3164

Credit: 7373381687

RAC: 2177249

30 Dec 2018 0:25:21 UTC

Topic 217742

(moderation:

)

All three of my hosts have started to build up upload failures.

I see a mix of transient http error and of failure to communicate with project.

Such as these lines:

21546 Einstein@Home 12/29/2018 5:16:42 PM Started upload of LATeah2005L_412.0_0_0.0_21220_1_1
21547 Einstein@Home 12/29/2018 5:17:43 PM Temporarily failed upload of LATeah2005L_412.0_0_0.0_21220_1_0: transient HTTP error
21548 Einstein@Home 12/29/2018 5:17:43 PM Backing off 00:02:04 on upload of LATeah2005L_412.0_0_0.0_21220_1_0
21549 Einstein@Home 12/29/2018 5:17:43 PM Temporarily failed upload of LATeah2005L_412.0_0_0.0_21220_1_1: transient HTTP error
21550 Einstein@Home 12/29/2018 5:17:43 PM Backing off 00:03:48 on upload of LATeah2005L_412.0_0_0.0_21220_1_1
21551 Einstein@Home 12/29/2018 5:20:51 PM Started upload of LATeah2005L_420.0_0_0.0_321483_1_0
21552 Einstein@Home 12/29/2018 5:20:51 PM Started upload of LATeah2005L_420.0_0_0.0_321483_1_1
21553 12/29/2018 5:20:52 PM Project communication failed: attempting access to reference site

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1637959924

RAC: 486121

Yep, uploads are failing

30 Dec 2018 0:30:44 UTC

Message 168569

(moderation:

)

Yep, uploads are failing mightily

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119398855729

RAC: 25915229

Please try sending an email

30 Dec 2018 1:32:01 UTC

Message 168570

(moderation:

)

Please try sending an email to eah_admin(at)einsteinathome.org. I don't have email access right now.

The scheduler is responding normally so it must just be the upload server. Once there are enough uploads backed up, the client will stop requesting work (I believe). I think I remember seeing something like 8 uploads in progress as being the magic number to stop work requests.

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3164

Credit: 7373381687

RAC: 2177249

Gary Roberts wrote:Please try

30 Dec 2018 2:07:05 UTC

Message 168571 in response to message 168570

(moderation:

)

Gary Roberts wrote:

Please try sending an email to eah_admin(at)einsteinathome.org.

Done.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119398855729

RAC: 25915229

Out of necessity, I just

30 Dec 2018 4:24:28 UTC

Message 168575

(moderation:

)

Out of necessity, I just learned something new (for me).

A monitoring script I run regularly reported a host with a crashed GPU (in amongst all the reports of last RPC contacts being too long ago) :-). No big deal, it happens occasionally. So I restarted the machine and noticed just 5 tasks stuck in uploads from the current problem. However there were unusually few 'ready to start' tasks left in the cache (maybe 3 hrs worth) so I decided to do a quick top-up whilst the number of uploads was relatively small. The response I got was that there were already too many stuck uploads so no new work.

I still haven't worked out why there was so little work on board (there should have been a days worth) but I decided to read the documentation with a view to seeing if the number of uploads that would prevent new work was configurable. I found two possible tags - <max_file_xfers> (default=8) and <max_file_xfers_per_project> (default=2).

As time was quite short, I whipped up a cc_config.xml file with those two tags added with the values of 32 and 12 respectively. After 're-reading config files' in BOINC Manager, I was immediately able to download a bunch of work to bring the cache up to over a day.

I don't know which particular tag did the trick. I suspect the first has to be bigger than the second but the second is probably the important one when just one project has upload problems. I might take the opportunity to experiment with the numbers a bit more while the problem exists.

Cheers,
Gary.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3017593655

RAC: 1194138

When emailing Bernd, it's

30 Dec 2018 9:15:10 UTC

Message 168577

(moderation:

)

When emailing Bernd, it's helpful to do a little bit more diagnosis to help him find exactly which part of the system has broken, and in what way.

I'm getting

30/12/2018 09:05:29 | Einstein@Home | [http] [ID#811] Sent header to server: POST /EinsteinAtHome/cgi-bin/file_upload_handler_medium HTTP/1.1
30/12/2018 09:06:29 | Einstein@Home | [http] [ID#811] Received header from server: HTTP/1.1 504 Gateway Time-out

'file_upload_handler_medium' and '504 Gateway Time-out' are both exactly the same symptoms as the failures on 21 November and 21 December, when the upload server lost communications with what Bernd described as the web server. He was going to try and script an automated recovery, but I guess the holidays got in the way.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3017593655

RAC: 1194138

Before anybody else tries

30 Dec 2018 9:39:52 UTC

Message 168578 in response to message 168575

(moderation:

)

Before anybody else tries Gary's workround (I just did), the actual file upload limit is defined as

- client: define "too many uploads" (for work fetch) as 2 * max(ncpus, ngpus);

show this in the state displayed by <work_fetch_debug>

(from https://github.com/BOINC/boinc/commit/26114920fea508d44a8a0561afd71766799b4bf4)

With 45 uploads waiting on this machine, I haven't yet tried falsifying ncpus, but I might later....

Unfortunately, changing the file transfer limit doesn't hack it.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3017593655

RAC: 1194138

On the other hand, setting

30 Dec 2018 11:05:32 UTC

Message 168579

(moderation:

)

On the other hand, setting <ncpus>24</ncpus> in cc_config.xml did work. I think this might be one where you have to do a full client restart, rather than just 'Read config files' - but report back if you find different.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3017593655

RAC: 1194138

Uploads are working again -

30 Dec 2018 12:07:01 UTC

Message 168581

(moderation:

)

Uploads are working again - you may need to retry one to kick-start the process.

Bernd says that this was an automated restart, put into place after the previous problems, so it's less important to report it manually if it happens again.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119398855729

RAC: 25915229

Richard Haselgrove wrote:...

30 Dec 2018 20:55:54 UTC

Message 168592 in response to message 168578

(moderation:

)

Richard Haselgrove wrote:

... the actual file upload limit is defined as

- client: define "too many uploads" (for work fetch) as 2 * max(ncpus, ngpus);

Thanks for that. When I installed a cc_config.xml on that machine that had little work left, I grabbed an existing file I'd used to simulate 8 CPUs for the last time we had issues with very fast running tasks eating up the daily quota. I just installed the extra two tags without removing the <ncpus> line.

It worked immediately with just 're-read config files' - no client restart needed. It didn't occur to me that the success had anything to do with the <ncpus> line rather than either of the extra tags I'd added. I was pretty desperate to try something quickly before any more upload failures occurred on that machine, so I wasn't thinking very clearly :-).

I'm not suggesting I would have made the connection, even with a lot more time to ponder it :-).

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119398855729

RAC: 25915229

Richard Haselgrove

30 Dec 2018 21:13:30 UTC

Message 168593 in response to message 168581

(moderation:

)

Richard Haselgrove wrote:

Uploads are working again - you may need to retry one to kick-start the process.

By the time uploads started working, I was already in bed. With the multi-hour back-offs and with no further work to finish which might trigger a new upload attempt, hosts that were 'out', stayed out a lot longer than necessary.

Quote:

Bernd says that this was an automated restart, put into place after the previous problems, so it's less important to report it manually if it happens again.

Why does a failure like this have to continue for so long if there is an 'automated restart' mechanism in place?

Cheers,
Gary.

Upload trouble 12/29/18

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports