Uploads disabled

bk.newton09
bk.newton09
Joined: 29 Jul 13
Posts: 11
Credit: 91411111
RAC: 0

I apologize for my earlier,

I apologize for my earlier, inaccurate comment. What has been delayed is not uploading but rather validations of completed tasks. Until this weekend I had one from late December still in that queue. Now the oldest pair are from 13-Jan, so not that long.

Again, apologies for the mistake.
Brian

Maximilian Mieth
Maximilian Mieth
Joined: 4 Oct 12
Posts: 130
Credit: 10279241
RAC: 4074

RE: I apologize for my

Quote:
I apologize for my earlier, inaccurate comment. What has been delayed is not uploading but rather validations of completed tasks. Until this weekend I had one from late December still in that queue. Now the oldest pair are from 13-Jan, so not that long.


The reason was in both cases (207114165 and 207582577) that your wingmen did not deliver on time and the task had to be sent out again to a third cruncher. That is not related to the problem discussed in this thread.

AllparDave
Joined: 7 Jan 15
Posts: 8
Credit: 171011
RAC: 0

Just a quick note since I

Just a quick note since I posted issues before, all resolved, thanks and congrats, good job all.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250400550
RAC: 34744

Discussion of "report"

Discussion of "report" problem moved to a separate thread, as this had nothing to do with the upload issues / outage.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250400550
RAC: 34744

Here's a short history of

Here's a short history of everything:

- einstein4 was set up about a year ago to take over the handling of result files (upload, validation, assimilation, archival). It has 16 2 TB HDs in a RAID10 configuration, for maximum IOPs.

- At that time the FCGI version of BOINCs file upload handler was tested, however under reasonable load it locked up with a "futex deadlock" in the kernel. We never had time to find out whether the root cause was in the file upload handler or the Linux kernel that we were using back then. (Almost) All E@H machines @AEI are running nginx as the web server for performance reasons. As nginx doesn't handle old-fashioned cgi, we configured the "fcgiwrap" around the cgi version of the file upload handler.

- since it was set up, einstein4 has been collecting results of all E@H searches. Results got archived daily, but never actually got deleted on einstein4. When we started the S6BucketFU1UB search it became apparent that all results of that search won't fit on the remainig free space of the server. So in December we started a verification run of the archives of the old results of searches that had been completed (FGRP3, S6BucketLVE, S6CasA) to run over the holidays, just to make sure that we could safely delete the original result files from the server.

- Around the holidays (before and after) we have been quite busy supplying "work" for E@H, particularly for the CPUs. Due to an unexpected boost of computing power over the holidays (apparently right after Xmas) we ran out of first FGRP4 and then S6BucketFU1UB work, so we fell back to enabling BRP4 CPU app versions again. Unfortunately BRP4 produces significantly more result file volume than FGRP4. All that meant that the data partition on einstein4 began filling up much faster than expected over the holidays.

- The (nagios) monitoring configuration on einstein4 was identical to that of our other E@H servers @AEI. However, as we found out later, the disk volume monitoring was bound to device nodes, which occasionally happen to differ between machines, depending on the configuration of e.g. the RAID controller and whether the machines have an additional DOM for the OS. Therefore we didn't get an early warning when the data partition of einstein4 ran full.

- I was actually sitting at a computer at home when the filesystem ran full. I immediately stopped uploads, took another look at the archive verification and started to delete old result files (FGRP3, S6CasA) to free up some space again.

- Although we freed 15% of the disk space, the filesystem performance was still pretty bad. We turned basically everything else off that would read or write this filesystem, except for the file upload handler. Still didn't get any better. About 40% of all upload requests got through, 60% were rejected or timed out. Creating a single new file took 8s.

- As described earlier, the root cause seemed to be the free inode management of the XFS. This could only be changed by re-building the filesystem. The data fragmentation was negligible, and apparently there is no way of defragmenting the "directory" / inode btree.

- We don't have any "hot spare" servers @AEI. However a handful of machines all have basically identical hardware (except for the RAID setup / disk sizes), and uniform software configuration, so one of these could easily take over the role of every other with rather little configuration work. We decided to shift the task of einstein3 (serving BRP4G Arecibo data files) over to einstein1, which was rather bored with serving BRP5, and set up einstein3 as the new upload / result handling server.

- First thing we actually fixed was the configuration of the disk monitoring, which is now based on mountpoints / directories instead of device nodes.

- To make use of the new "free inode btree" of XFS we needed to compile a recent kernel (and rebuild the filesystem). So in parallel all data present on einstein3 was shifted elsewhere (to a backup server or straight to einstein1) and the new kernel compiled. We also built the latest version of the BOINC file upload handler (both cgi and fcgi).

- After that was done, the filesystem of einstein3 was rebuilt, using the options required for the "free inodes btree". Then the data that needed to be on that machine (upload & download) was copied (back) from einstein4 and the einstein3 backup.

- Copying data back & forth was slowed down by technical and human errors (network to the backup server, misunderstanding rsync options (by default, rsync overwrites new files on destination with old versions from source, and --update is _not_ part of the options bundled in -a))

- File upload was enabled. We gave the FCGI version another try, and so far it works quite fine. Four instances are enough to - in extreme case - max out the filesystem.

- One daemon after the other was enabled again, and so step by step the main project Einstein@Home was brought back up again.

- Only then we took care of our test project Albert@Home. What had been running previously on einstein4 of that project also had to be moved to einstein3. However, there was much less data to be moved, so this went reasonably fast.

- we are currently running a full backup of the data of einstein4 to the backup server. For some reason that is still not fully understood, even reading from the filesystem seems pretty slow. We are transferring data at 12MB/s peak, and the einstein4 filsystem seems to be 100% utilized. At this speed, backing up ~16TB of data will take a couple of weeks.

BM

BM

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171376
RAC: 43

RE: - Only then we took

Quote:
- Only then we took care of our test project Albert@Home. What had been running previously on einstein4 of that project also had to be moved to einstein3. However, there was much less data to be moved, so this went reasonably fast.

One addition to this: moving Albert@Home to another server in Hannover required a reconfiguration of the VPN tunnel to the main project server (albert) in Milwaukee. einstein4 still had a dedicated tunnel to albert but since we already moved to a centralized VPN gateway for einstein, we decided to use that one as well for albert. Therefore we had to reconfigure network routes, firewalls and monitoring configurations on various hosts as well as albert's database setup.

Cheers,
Oliver

Einstein@Home Project

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 315666989
RAC: 329359

The pictorial version

The pictorial version :

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Michael Hoffmann
Michael Hoffmann
Joined: 31 Oct 10
Posts: 32
Credit: 31031260
RAC: 0

RE: RE: time for the

Quote:
Quote:
time for the well-earned Feierabendbier ;)

Darn, forgot about this one! Oh well, there's still a lot to do...

I'd gladly send you a crate of your choice, as expression of appreciation for the recent work. One needs soul food once in a while ;)

Om mani padme hum.

Tom*
Tom*
Joined: 9 Oct 11
Posts: 54
Credit: 366729484
RAC: 0

Black Ice ???

Black Ice ???

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 315666989
RAC: 329359

Black ice at the Battle of

Black ice at the Battle of Hastings ? Yup, I'd buy that. :-)

Cheers, Mike.

( edit ) I must apologise : I wasn't aware that the Bayeux Tapestries depicted horses' willies.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.