More computation error problems

th3tricky
th3tricky
Joined: 15 Mar 15
Posts: 18
Credit: 944439068
RAC: 0
Topic 222889

Hi guys,

Just got my computers switched back over to running exclusively Einstein and I'm getting computation errors on a lot of GW work like others recently.

Seeing the issues in the other threads, my current GPU drivers are up-to-date as are my windows installs, and I do not have a GPU with under 6gb VRAM, with utilization being about 3.5gb on all 7 cards. They are all running 2 tasks each. Below is the file I found on one of the bad GW files, i sampled a few of the bad files and they were all the same, I just need help interpreting the data please! I also posted the started portion of the BOINC event log, if that helps any.

 

Name:h1_1591.20_O2C02Cl4In0__O2MDFV2h_VelaJr1_1592.10Hz_418_1

Workunit ID:459260163

Created:30 May 2020 11:46:30 UTC

Sent:8 Jun 2020 0:35:11 UTC

Report deadline:15 Jun 2020 0:35:11 UTC

Received:8 Jun 2020 6:23:10 UTC

Server state:Over

Outcome:Computation error

Client state:Compute error

Exit status:1024 (0x00000400) Unknown error code

Computer:11777687

Run time (sec):126.51

CPU time (sec):118.13

Peak working set size (MB):484.21

Peak swap size (MB):1964.16

Peak disk usage (MB):0.02

Validation state:Invalid

Granted credit:0

Application:Gravitational Wave search O2 Multi-Directional GPU v2.07 (GW-opencl-nvidia)
windows_x86_64


Stderr output

<core_client_version>7.16.7</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 1024 (0x400)</message>
<stderr_txt>
putenv 'LAL_DEBUG_LEVEL=3'
2020-06-07 23:19:02.2123 (10960) [normal]: This program is published under the GNU General Public License, version 2
2020-06-07 23:19:02.2123 (10960) [normal]: For details see http://einstein.phys.uwm.edu/license.php
2020-06-07 23:19:02.2123 (10960) [normal]: This Einstein@home App was built at: Dec 19 2019 12:14:49

2020-06-07 23:19:02.2123 (10960) [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_O2MDF_2.07_windows_x86_64__GW-opencl-nvidia.exe'.
Activated exception handling...
[DEBUG} GPU type: 1
[DEBUG} got GPU info from BOINC
[DEBUG} got VendorID 4318
2020-06-07 23:19:02.2748 (10960) [debug]: BSGL output files
2020-06-07 23:19:02.2905 (10960) [debug]: Flags: LAL_DEBUG, OPTIMIZE, HS_OPTIMIZATION, GC_SSE2_OPT, X64, SSE, SSE2, GNUC X86 GNUX86
2020-06-07 23:19:02.2905 (10960) [debug]: Set up communication with graphics process.

DEPRECATION WARNING: program has invoked obsolete function XLALGetVersionString(). Please see XLALVCSInfoString() for information about a replacement.
Code-version: %% LAL: 6.19.2.1 (CLEAN 98bbe72a728eb25935e9195dafae691335dabf8c)
%% LALPulsar: 1.17.1.1 (CLEAN 98bbe72a728eb25935e9195dafae691335dabf8c)
%% LALApps: 6.23.0.1 (CLEAN 98bbe72a728eb25935e9195dafae691335dabf8c)

2020-06-07 23:19:02.9727 (10960) [normal]: Reading input data ... 2020-06-07 23:20:41.5491 (10960) [normal]: Search FstatMethod used: 'ResampOpenCL'
2020-06-07 23:20:41.5491 (10960) [normal]: Recalc FstatMethod used: 'DemodSSE'
2020-06-07 23:20:41.5491 (10960) [normal]: OpenCL Device used for Search/Recalc and/or semi coherent step: 'GeForce RTX 2060 (Platform: NVIDIA CUDA, global memory: 6144 MiB)'
2020-06-07 23:20:41.5647 (10960) [normal]: OpenCL version is used for the semi-coherent step!
2020-06-07 23:21:04.2476 (10960) [normal]: Number of segments: 17, total number of SFTs in segments: 10091
done.
% --- GPS reference time = 1177858472.0000 , GPS data mid time = 1177858472.0000
2020-06-07 23:21:04.3257 (10960) [normal]: dFreqStack = 4.035776e-007, df1dot = 2.558432e-012, df2dot = 1.356969e-018, df3dot = 0.000000e+000
% --- Setup, N = 17, T = 864000 s, Tobs = 19750204 s, gammaRefine = 31, gamma2Refine = 51, gamma3Refine = 1

DEPRECATION WARNING: program has invoked obsolete function InitDopplerSkyScan(). Please see XLALInitDopplerSkyScan() for information about a replacement.
2020-06-07 23:21:04.3413 (10960) [normal]: INFO: No checkpoint checkpoint.cpt found - starting from scratch
% --- Cpt:0, total:49, sky:1/1, f1dot:1/49

0.% --- CG:2118404 FG:123892 f1dotmin_fg:-6.690155795097e-008 df1dot_fg:8.253006451613e-014 f2dotmin_fg:-6.651808823529e-019 df2dot_fg:2.660723529412e-020 f3dotmin_fg:0 df3dot_fg:1
XLAL Error - XLALComputeECLFFT_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:1248): Processing FFT failed: CL_MEM_OBJECT_ALLOCATION_FAILURE
XLAL Error - XLALComputeECLFFT_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:1248): Internal function call failed
XLAL Error - XLALComputeFaFb_Resamp_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:654): Check failed: (*fftfuncs->computefft_func) ( fftfuncs->fftplan, ws->TS_FFT, ((void *)0) ) == XLAL_SUCCESS
XLAL Error - XLALComputeFaFb_Resamp_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:654): Internal function call failed
XLAL Error - XLALComputeFstatResamp_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:441): Check failed: XLALComputeFaFb_Resamp_OpenCL ( resamp, ws, thisPoint, common->dFreq, numFreqBins, TimeSeriesX_SRC_a, TimeSeriesX_SRC_b ) == XLAL_SUCCESS
XLAL Error - XLALComputeFstatResamp_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:441): Internal function call failed
XLAL Error - XLALComputeFstat (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat.c:875): Check failed: (input->method_funcs.compute_func) ( *Fstats, common, input->method_data ) == XLAL_SUCCESS
XLAL Error - XLALComputeFstat (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat.c:875): Internal function call failed
MAIN: XLALComputeFstat() failed with errno=1024
2020-06-07 23:21:04.8413 (10960) [CRITICAL]: ERROR: MAIN() returned with error '1024'
2020-06-07 23:21:04.8413 (10960) [debug]: resultfile '../../projects/einstein.phys.uwm.edu/h1_1591.20_O2C02Cl4In0__O2MDFV2h_VelaJr1_1592.10Hz_418_1_0' (len 95), current config file: 0
Code-version: %% LAL: 6.19.2.1 (CLEAN 98bbe72a728eb25935e9195dafae691335dabf8c)
%% LALPulsar: 1.17.1.1 (CLEAN 98bbe72a728eb25935e9195dafae691335dabf8c)
%% LALApps: 6.23.0.1 (CLEAN 98bbe72a728eb25935e9195dafae691335dabf8c)

FPU status flags: COND_0 PRECISION
2020-06-07 23:21:04.8570 (10960) [debug]: worker done. return(1024) to caller
2020-06-07 23:21:04.8570 (10960) [normal]: done. calling boinc_finish(1024).
23:21:04 (10960): called boinc_finish

</stderr_txt>
]]>

 

 

 

 

 

 

6/20/2020 3:57:37 PM |  | cc_config.xml not found - using defaults
6/20/2020 3:57:37 PM |  | Starting BOINC client version 7.16.7 for windows_x86_64
6/20/2020 3:57:37 PM |  | Libraries: libcurl/7.47.1 OpenSSL/1.0.2s zlib/1.2.8
6/20/2020 3:57:37 PM |  | Data directory: C:\ProgramData\BOINC
6/20/2020 3:57:37 PM |  | Running under account trickster
6/20/2020 3:57:38 PM |  | CUDA: NVIDIA GPU 0: GeForce RTX 2080 Ti (driver version 446.14, CUDA version 11.0, compute capability 7.5, 4096MB, 3539MB available, 13448 GFLOPS peak)
6/20/2020 3:57:38 PM |  | CUDA: NVIDIA GPU 1: GeForce GTX 1660 (driver version 446.14, CUDA version 11.0, compute capability 7.5, 4096MB, 3555MB available, 5153 GFLOPS peak)
6/20/2020 3:57:38 PM |  | OpenCL: NVIDIA GPU 0: GeForce RTX 2080 Ti (driver version 446.14, device version OpenCL 1.2 CUDA, 11264MB, 3539MB available, 13448 GFLOPS peak)
6/20/2020 3:57:38 PM |  | OpenCL: NVIDIA GPU 1: GeForce GTX 1660 (driver version 446.14, device version OpenCL 1.2 CUDA, 6144MB, 3555MB available, 5153 GFLOPS peak)
6/20/2020 3:57:38 PM |  | Windows processor group 0: 12 processors
6/20/2020 3:57:38 PM |  | Host name: DESKTOP-BU27MB4
6/20/2020 3:57:38 PM |  | Processor: 12 GenuineIntel Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz [Family 6 Model 158 Stepping 10]
6/20/2020 3:57:38 PM |  | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx smx tm2 pbe fsgsbase bmi1 hle smep bmi2
6/20/2020 3:57:38 PM |  | OS: Microsoft Windows 10: Core x64 Edition, (10.00.18363.00)
6/20/2020 3:57:38 PM |  | Memory: 31.94 GB physical, 36.69 GB virtual
6/20/2020 3:57:38 PM |  | Disk: 465.16 GB total, 378.96 GB free
6/20/2020 3:57:38 PM |  | Local time is UTC -7 hours
6/20/2020 3:57:38 PM |  | No WSL found.
6/20/2020 3:57:38 PM |  | VirtualBox version: 6.0.14
6/20/2020 3:57:38 PM |  | General prefs: from http://boinc.bakerlab.org/rosetta/ (last modified 22-Apr-2020 00:52:41)
6/20/2020 3:57:38 PM |  | Host location: none
6/20/2020 3:57:38 PM |  | General prefs: using your defaults
6/20/2020 3:57:38 PM |  | Reading preferences override file
6/20/2020 3:57:38 PM |  | Preferences:
6/20/2020 3:57:38 PM |  | max memory usage when active: 24526.50 MB
6/20/2020 3:57:38 PM |  | max memory usage when idle: 24526.50 MB
6/20/2020 3:57:39 PM |  | max disk usage: 30.00 GB
6/20/2020 3:57:39 PM |  | (to change preferences, visit a project web site or select Preferences in the Manager)
6/20/2020 3:57:39 PM |  | Setting up project and slot directories
6/20/2020 3:57:39 PM |  | Checking active tasks
6/20/2020 3:57:39 PM | Einstein@Home | URL http://einstein.phys.uwm.edu/; Computer ID 12782230; resource share 100
6/20/2020 3:57:39 PM | Rosetta@home | URL https://boinc.bakerlab.org/rosetta/; Computer ID 3710317; resource share 100
6/20/2020 3:57:39 PM |  | Setting up GUI RPC socket
6/20/2020 3:57:39 PM |  | Checking presence of 197 project files
 

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7346501687
RAC: 2209867

The entry

The entry CL_MEM_OBJECT_ALLOCATION_FAILURE in stderr, which is present in your post, is the standard signature of GW work being processed exceeding the available GPU RAM.  On some card/driver/OS combinations this results in material extension of completion time (apparently some form of paging takes place between GPU RAM and system RAM) while on others the task simply errors out.  Your setup appears to be in the second category.

You mention that you are running at 2X, and that your GPU RAM is 6Gb or more.  Yet the computer link for your post declares that at least one card on that host is a 4 Gb card.  A majority of currently issued Einstein GW GPU tasks will not fit in GPU RAM run at 2X on a 4Gb card.  A smaller but material group don't fit at 2X on a 6 Gb card.  The specific task for which you post information has DF (Delfta Frequency--in this case1592.10 - 1591.20) of .90, and thus is in the group with highest GPU RAM requirements.

I've just now made a post with a graph giving my observed GPU RAM reported utilization vs. DF and multiplicity.

I think that if you were to alter the running condition of cards for which this type of error report is currently generated to specify 1X rather than 2X operation, these specific errors would disappear.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1632545099
RAC: 584360

 "Yet the computer link for

 "Yet the computer link for your post declares that at least one card on that host is a 4 Gb card."

AFAIK that is a bug in  BOINC. It reads the card's memory in 32 bit mode so it only sees 4GB of memory. 

I do know the apps can use all the memory. My 6 GB GTX1660Super has been running 2X for weeks wo the out of memory problem whereas My GTX1060 3 GB do have the problem so they only do pulsars. 

GPUZ regularly shows almost 6GB being used but never getting there yet and I do all GW work the the project will send me.  

Methinks something else is going on. 

Were the drivers installed from the Nvidea site?

There have problems in the past getting Nvidea the drivers from MS. 

When they were installed did you choose the clean install option. 

th3tricky
th3tricky
Joined: 15 Mar 15
Posts: 18
Credit: 944439068
RAC: 0

What a bummer. What is going

What a bummer. What is going on with Einstein/BOINC as of the last 6ish months? First I am unable to run Rosetta and Einstein concurrently without Rosetta hogging resources and now this issue. I have been running the two in peace for about 5 years with no issues! 

My smallest card is a GTX 1660 and it has 6gb VRAM. Arachae86, you are right, I watched GPUZ for several minutes and it went from 3gb vram used, crept up to 6gb, then jumped back down to 3gb. Now what is weird is that the biggest card, 2080 ti has 11gb VRAM and is having tons of error work units as well.

Archae86, I have just sent my preferences to 1 instance per GPU, fingers crossed that that helps! I really don't want to go back to folding@home! 

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7346501687
RAC: 2209867

th3tricky wrote:Archae86, I

th3tricky wrote:

Archae86, I have just sent my preferences to 1 instance per GPU, fingers crossed that that helps! I really don't want to go back to folding@home! 

Please let us know what happens next.  It will add to the picture.

th3tricky
th3tricky
Joined: 15 Mar 15
Posts: 18
Credit: 944439068
RAC: 0

I've got another question

I've got another question now: How long does it take BOINC to adjust after changing preferences online under my account? 

I will certainly update with the results.

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7346501687
RAC: 2209867

th3tricky wrote:I've got

th3tricky wrote:
I've got another question now: How long does it take BOINC to adjust after changing preferences online under my account? 

It takes effect for all tasks of that type, including ones currently executing, immediately on the download of the next task of that application.  But not sooner, even if that is days.

This means you have to unsuspend all tasks, as otherwise tasks won't be requested.  Also, it means you must currently be allowed to get work, and not in the penalty box for too many recent errors or too many tasks already received today.

On the other hand, if you control multiplicity using an app_config.xml file, the change takes effect within seconds of your doing a "read config files".  Once you go down that road, the way back is not simple.  But as one who just started using app_config.xml for this purpose yesterday, I'll observe that it certainly has some real immediacy advantage.

th3tricky
th3tricky
Joined: 15 Mar 15
Posts: 18
Credit: 944439068
RAC: 0

Hmm, maybe I'll wait another

Hmm, maybe I'll wait another day to see if it changes. Has been 24 hours now and still 2 tasks per GPU running.

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7346501687
RAC: 2209867

If you have downloaded any

If you have downloaded any new GW GPU tasks and the change has not taken effect, something is wrong, and it won't get better with waiting.

Here are some candidates, maybe someone else can chip in other possibilities to check:.

1. I assume you are changing the requested GPU utilization factor of GW apps and have set it to 1 instead of 0.5 and saved the change.

But could it be that you changed it with the "preference set" scrolling item at the top of the preferences page set to a different location (of generic|home|school|work) than the one to which the host is actually assigned?

2. Could it be that you have an app_config.xml file installed in the Einstein directory under the BOINC directory, which could have settings which would take precedence over settings from the web page?

3. Could it be that you have so much work in queue ("in progress" as termed on the account page) that your host is not actually requesting any work?

4. If it is requesting work, do you have several types of work requests enabled, and it could be requesting work but not new GW GPU tasks?

th3tricky
th3tricky
Joined: 15 Mar 15
Posts: 18
Credit: 944439068
RAC: 0

Easy fix: I had my BOINC

Easy fix: I had my BOINC clients set to "no new tasks" while I sorted out the issue. Seems one WU per GPU is working for the time being. What is interesting is that right before I figured out why preferences weren't updating I stopped generating back WU's. May play with the GPU utilization a bit and see if VRAM is really the culprit. 

Thanks for the help thus far!

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.