Upload failures (FGRPB1G) after moving the slot directory to a RAMdisk

bozz4science
bozz4science
Joined: 4 May 20
Posts: 15
Credit: 70003923
RAC: 2947
Topic 225340

Dear all,

lately, I have been accumulating a lot of invalid tasks and finally identified the culprit. Apparently, E@H doesn't find (possibly missing access rights) the output files of the computed FGRPB1G WUs. Tasks were crunched on a system that was reliably crunching ahead of moving the BOINC slot directory to a RAMdisk using ImDisk on Win10. As this is the only setting that changed recently, I reckon that moving to RAMdisk is very likely causing these upload failures.

Pls see below for an examplary stderr file output: Task 1113890663

Excerpt:

read_checkpoint(): Couldn't open file 'LATeah3011L00_244.0_0_0.0_5223186_1_0.out.cpt': No such file or directory (2)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>LATeah3011L00_244.0_0_0.0_5223186_1_0</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>LATeah3011L00_244.0_0_0.0_5223186_1_1</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>

Is there any way of fixing this issues besides reverting back to the setup ex-ante, thus removing the RAMdisk and letting Boinc compute on my SSD?

Would highly appreciate any hint! Thx :)

bozz4science
bozz4science
Joined: 4 May 20
Posts: 15
Credit: 70003923
RAC: 2947

Still have this issue. The

Still have this issue. The tool I am using is called ImDisk which is working wonderfully. The process for setting this up is to rename the BOINC/slots folder to sth else and create a new folder named slots instead. This one is then linked to copy the original slot directory to the newly created folder. Never had a problem running this setup with no project except for E@H. Even removed it several times and set it up with different sizes so I could exclude that the RAM disk is not set up too small. Removed the projects several times and reattached it in the BOINC manager without any success.

What really confuses me is that the O3 GW All Sky WUs work just fine. They finish and upload and finally get validated and credited. All Gamma Ray #1 WUs compute successfully, but somehow can't get uploaded. Is there somewhere I can check whether the output files get stuck or even exist in the said directory?

Would love to hear your input on this!

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5044
Credit: 19047822270
RAC: 6534459

Turn on http_xfer_debug in

Turn on http_xfer_debug in the Event Log Options.

 

bozz4science
bozz4science
Joined: 4 May 20
Posts: 15
Credit: 70003923
RAC: 2947

Thanks for the advice. Will

Thanks for the advice. Will try this next!

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2992812800
RAC: 709466

BOINC normally does all

BOINC normally does all read/write operations to the slot directory.

At the end of a task run, the result is uploaded from the project directory. Somehow, it has to move between the two directories.

I can think of three scenarios:

1) Output file in the slot directory is a soft link, data is written directly into the project directory by redirection.
2) Output file is written to the slot directory, copied when ready to the project directory.
3) Output file is written to the slot directory, renamed to an absolute path/name in the project directory.

1 and 2 should work under all circumstances.

3 will work under a normal BOINC installation - both slots and project directories are sub-folders of the same data directory, and are mounted on the same device. Renaming will work.

But 3 may fail in the arrangement you describe. Can't move a file by renaming across devices.

You probably need to isolate a matched pair of <workunit> and <result> specifications from client_state.xml, and take them away to a clean room for examination. Look for a difference in the <file_ref> syntax between FGRPB1G tasks and O3 GW All Sky tasks.

bozz4science
bozz4science
Joined: 4 May 20
Posts: 15
Credit: 70003923
RAC: 2947

That’s a whole lot of useful

That’s a whole lot of useful information. Always great to learn about how BOINC really works. Keith’s advice that prompted me to try again today, after many unsuccessful attempts since May, and run a task for testing with the new debug flag set in the manager somehow resulted in a successful upload. Now it’s pending, so I will try to run a few more. Really weird as this is the first FGRPB1G task that didn’t result in said error.
I now saw thanks to Keith’s advice that the output file was written successfully and got uploaded subsequently just fine. 
 

Next, I will start digging into Richard’s advice if said issue should prevail. Thanks for sharing your thoughts! Helpful stuff so far to further investigate my issue. 

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2992812800
RAC: 709466

Decided to test my theory by

Decided to test my theory by running the experiment on my own machine...

For FGRPB1G tasks, there are two output file references, both of the form

    <file_ref>
        <file_name>LATeah4012L00_764.0_0_0.0_19134020_1_0</file_name>
    </file_ref>

The actual content of that file is

<soft_link>../../projects/einstein.phys.uwm.edu/LATeah4012L00_764.0_0_0.0_19134020_1_0</soft_link>

Nothing is written into that file until the end of the run, when

D:\BOINCdata\projects\einstein.phys.uwm.edu\LATeah4012L00_764.0_0_0.0_19134020_1_0

pops into existence, gets uploaded, and vanishes again. This test was carried out under Windows, and my entire BOINC data folder tree resides on a secondary drive D:

So, that's my case (1), and it may fail on your machine because the softlink is an address relative to the working (slot) directory.

For an O3AS task, the output file references are like

    <file_ref>
        <file_name>h1_0146.80_O3aC01Cl1In0__O3AS1_147.00Hz_418_1_0</file_name>
        <open_name>GCT.out</open_name>
    </file_ref>

    <file_ref>
        <file_name>h1_0146.80_O3aC01Cl1In0__O3AS1_147.00Hz_418_1_2</file_name>
        <open_name>GCT.timing</open_name>
        <optional/>
    </file_ref>

(there's a third, for stderr.gz, but it appears not to be used). The two I've quoted are again softlinks:

<soft_link>../../projects/einstein.phys.uwm.edu/h1_0146.80_O3aC01Cl1In0__O3AS1_147.00Hz_326_1_0</soft_link>
<soft_link>../../projects/einstein.phys.uwm.edu/h1_0146.80_O3aC01Cl1In0__O3AS1_147.00Hz_326_1_2</soft_link>

Again, those are relative paths, so they should behave the same. I don't know how the extra <open_name> parameter affects things, but it might force an implicit <copy_file/>

bozz4science
bozz4science
Joined: 4 May 20
Posts: 15
Credit: 70003923
RAC: 2947

Thanks Richard for running

Thanks Richard for running the detailed test and reporting your results back to me. I still don't quite get why the gamma ray search #1 GPU tasks are suddenly running successfully again on my setup, but maybe it did somehow have to do with resizing of the RAM disk. All else was unchanged since the error first popped up on my host back in May and kept reoccuring ever since until a few days ago. Your technical explanation is definitely enlightening and a great starting point to investigate any future issues I might encounter with my setup in running BOINC on a RAM disk.

Seeing that the O3AS and FGRPB1G application handle the result file generation and upload process differntly, is a key indicator that this was indeed causing this issue for me initially. I will keep running the FGRPB1G tasks to see if I will run into this problem again. Thanks for your great advice!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118711913545
RAC: 20223040

bozz4science wrote:.... my

bozz4science wrote:
.... my setup in running BOINC on a RAM disk.

I'm confused by this.  With just the slot directories in RAM, I wouldn't have thought you could say you're running the whole of BOINC in RAM.  Lots of files in the slots are links to the real files stored somewhere else on disk.  If stuff needs to be transferred to/from those files, won't the transfers be at disk speeds, not RAM speeds anyway?

Have you observed any measurable change in output by just putting the slots in RAM?  I imagine it would be tiny and therefore difficult to measure because crunch times do vary quite a bit from task to task, particularly with GW tasks.

Cheers,
Gary.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4080
Credit: 48686166255
RAC: 34856437

Gary Roberts

Gary Roberts wrote:

bozz4science wrote:
.... my setup in running BOINC on a RAM disk.

I'm confused by this.  With just the slot directories in RAM, I wouldn't have thought you could say you're running the whole of BOINC in RAM.  Lots of files in the slots are links to the real files stored somewhere else on disk.  If stuff needs to be transferred to/from those files, won't the transfers be at disk speeds, not RAM speeds anyway?

Have you observed any measurable change in output by just putting the slots in RAM?  I imagine it would be tiny and therefore difficult to measure because crunch times do vary quite a bit from task to task, particularly with GW tasks.

 

hes probably doing it for WCG OpenPandemics GPU work. Their app is super green and has an insane number of disk writes. Several users switched to running the slots directory in RAM to mitigate the writers to disk (it works, I’ve done it) with the caveat that a sudden power loss will probably corrupt something. But if the devs won’t change the app, the users do this to preserve the life of their SSDs. 

_________________________________________________________________________

bozz4science
bozz4science
Joined: 4 May 20
Posts: 15
Credit: 70003923
RAC: 2947

You are both right!  First

You are both right! 

First of all, I do it mainly due to the reason that Ian pointed out. The OpenPandemics GPU version on WCG is not very efficient in terms of disk I/O and resulted in 3-5 TB (with a T) of data written to the disk per DAY. I only ever recognised that late in the stress test that they rolled out a while back and decided to put the slots directory on a RAM disk because most of the I/O was happening within the slot directory. I have been running fine with that setup ever since. I haven't changed to the RAM disk approach due to potential performance gains, nor have I ever analytically compared runtimes. My personal impression however is that the tasks do run more smoothly and my SSD is left alone and provides better responsiveness whenever I need it for demanding tasks during the day. 

I take that risk that some work would be lost if my computer were to suddenly crash, e.g. power outage as Ian pointed out. And you are definitely right Gary that you can't call that approach running BOINC completely on a RAM disk, but with 32 GB of RAM, I simply don't have the space to put it all on RAM and leave a bit of free space on top.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.