Compute Error: All-Sky Gravitational Wave search on O3 v1.07 (GW-opencl-nvidia-2)

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 380
Credit: 9704273455
RAC: 22665621
Topic 231044

I'm getting many computation errors.  Please see

     https://einsteinathome.org/task/1606103689     

It says 

   ABORT: XLAL call failed

which doesn't help me much.

Thanks

S-F-V

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3927
Credit: 45681612642
RAC: 63974379

in your

in your stderr:

Quote:
ERROR: could not parse line 1114397 in skyGrid-file '../../projects/einstein.phys.uwm.edu/O3ASHF1_skygrid_1442Hz_m0.008.dat'



a problem reading the file. maybe random. maybe file corruption. maybe disk problems. looks like you had a lot of these errors a few days ago, with other errors showing "I/O error"

I would first try resetting the project to erase all project files and download new versions.

if it happens again, then maybe examine the disk for signs of failure or issues with the disk, and maybe replace it.

 

_________________________________________________________________________

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 380
Credit: 9704273455
RAC: 22665621

... will try and report back

... will try and report back ...

many thanks for your reply

have a nice sunday !

S-F-V

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5870
Credit: 116960184977
RAC: 36755688

I had the exact same problem

I had the exact same problem several weeks ago - a random line in a random skygrid file could (randomly) not be parsed causing all results that depended on the skygrid to fail, even though earlier results for the same skygrid were OK.  It happened multiple times with different lines being reported and a few different skygrids.  After trying quite a few things (disk checks, replacing files, etc.) I started checking the MD5 checksum of a skygrid against what is stored in the state file and found there was no problem with any file that I looked at.

Since these files are quite large, I wondered if the problem might be something to do with loading the whole file in memory and transmitting relevant parts of it over the PCIe bus to the GPU.  I don't know how this all works but I imagined the GPU might call for parts of the file quite regularly and that if the part being called happened to be stored in memory that (intermittently) was not reading correctly, maybe this would explain why things would work for a while and then randomly crash.

I had this thought after days of trying everything else.  The system RAM was 2x4GB sticks so I just replaced them both.  That was a few weeks ago and there hasn't been a single crash since.  If you haven't resolved your problem, replacing the RAM might be worth a try.

 

Cheers,
Gary.

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 380
Credit: 9704273455
RAC: 22665621

Seems to have been a bad

Seems to have been a bad drive.

I will watch the behavior.

If it re-appears, new/other memory sticks is on my list.

Thanks.

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 380
Credit: 9704273455
RAC: 22665621

Switched to new NVMe - Error

Resetted and also switched to new NVMe - Error persists.

Will check more.

S-F-V

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 380
Credit: 9704273455
RAC: 22665621

I am at loss. All kinds of

I am at loss.

All kinds of different errors in relation with the resulting XCAL error.

Parsing error -- argument missing --

Replaced Dimms.

Deleted BOINC + files completely and re-installed it.

Reduced GPU load.

Changed percent usage ...

No overheating, no overloading, just a boring setup.

Any ideas would be appriciated.

https://einsteinathome.org/task/1615028363

https://einsteinathome.org/task/1615026855

https://einsteinathome.org/task/1612790312

Thanks  S-F-V

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3927
Credit: 45681612642
RAC: 63974379

you have the root cause same

you have the same root cause error in all tasks (just different lines and skygrid files)

Quote:
ERROR: could not parse line 574942 in skyGrid-file '../../projects/einstein.phys.uwm.edu/O3ASHF1_skygrid_1470Hz_m0.008.dat'

_________________________________________________________________________

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5870
Credit: 116960184977
RAC: 36755688

I looked at the third link

I looked at the third link you provided and saw a task that had started successfully and had also stopped and restarted and eventually gave the following error message on about the third startup:-

ERROR: could not parse line 5968662 in skyGrid-file '../../projects/einstein.phys.uwm.edu/O3ASHF1_skygrid_1464Hz_m0.008.dat'

As I mentioned in my previous message, I had this exact same behaviour which was eventually resolved by replacing the RAM after confirming (using an md5sum check utility - multiple times) that the skygrid files involved did indeed always give the correct MD5 checksum values.  The correct values are stored in the state file.

If you have replaced your RAM, have you actually confirmed that the file on disk does always give the correct checksum?  Perhaps you have some intermittently bad sectors on disk and perhaps the errors only happen when there is a disk read of that part of the skygrid file that contains a flaky sector?  Because the example I looked at had quite a lot of successful computation prior to the error, it seems to suggest that the problem shows intermittently so perhaps you need to run multiple checksum scans to see if you get an occasional failure.

Cheers,
Gary.

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 380
Credit: 9704273455
RAC: 22665621

Finally solved the

Finally solved the problem.

The substituted used RAM was also faulty.

Bought a set of brand new RAMs and all is fine.

Thanks for your help.

S-F-V

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.