Consistent unknown error with GPU

Jim Remington
Jim Remington
Joined: 14 Sep 21
Posts: 5
Credit: 526185101
RAC: 471478
Topic 226414

Hi, All:

All jobs that attempt to use the GTX-650 GPU on my eight-core Win10 machine fail immediately with "unknown error". CPU-only tasks work fine. I'm not able to interpret the log file, but if someone else can see what the problem is, I would love to know!

Recent example:

Task 1191193861

Name: h1_0323.80_O3aC01Cl1In0__O3AS1_324.00Hz_4747_1

Workunit ID: 586973982

Created: 14 Nov 2021 3:01:14 UTC

Sent: 14 Nov 2021 3:18:02 UTC

Report deadline: 21 Nov 2021 3:18:02 UTC

Received: 14 Nov 2021 8:17:38 UTC

Server state: Over

Outcome: Computation error

Client state: Compute error

Exit status: -1 (0xFFFFFFFF) Unknown error code

Computer: 12901980

Run time (sec): 37.07

CPU time (sec): 34.20

Peak working set size (MB): 270.64

Peak swap size (MB): 1047.95

Peak disk usage (MB): 0.01

Validation state: Invalid

Granted credit: 0

Application: Gravitational Wave search O3 All-Sky #1 v1.01 (GW-opencl-nvidia)
windows_x86_64


Stderr output

<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 4294967295 (0xffffffff)</message>
<stderr_txt>
putenv 'LAL_DEBUG_LEVEL=3'
2021-11-14 00:07:41.3030 (916) [normal]: This program is published under the GNU General Public License, version 2
2021-11-14 00:07:41.3059 (916) [normal]: For details see http://einstein.phys.uwm.edu/license.php
2021-11-14 00:07:41.3089 (916) [normal]: This Einstein@home App was built at: Aug  5 2021 15:20:43

2021-11-14 00:07:41.3108 (916) [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_O3AS_1.01_windows_x86_64__GW-opencl-nvidia.exe'.
Activated exception handling...
[DEBUG} GPU type: 1
[DEBUG} got GPU info from BOINC
[DEBUG} got VendorID 4318
2021-11-14 00:07:41.3635 (916) [debug]: Flags: LAL_DEBUG, OPTIMIZE, HS_OPTIMIZATION, GC_SSE2_OPT, X64, SSE, SSE2, GNUC X86 GNUX86
2021-11-14 00:07:41.3723 (916) [debug]: Set up communication with graphics process.
2021-11-14 00:07:41.3743 (916) [normal]: Parsed user input successfully

DEPRECATION WARNING: program has invoked obsolete function XLALGetVersionString(). Please see XLALVCSInfoString() for information about a replacement.
Code-version: %% LAL: 6.21.0.1 (CLEAN 8d0838c264f9ff9adc8c3cdbfa17b5154eaa2994)
%% LALPulsar: 1.18.2.1 (CLEAN 8d0838c264f9ff9adc8c3cdbfa17b5154eaa2994)
%% LALApps: 6.25.1.1 (CLEAN 8d0838c264f9ff9adc8c3cdbfa17b5154eaa2994)

2021-11-14 00:07:41.3782 (916) [normal]: Initialise compartments with freqWidth = 0.05 and candidates per compartment = 3000.
2021-11-14 00:07:42.3283 (916) [normal]: Reading input data ...
2021-11-14 00:07:42.3293 (916) [normal]: Loading SFTs matching '..\..\projects\einstein.phys.uwm.edu\h1_0323.80_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\l1_0323.80_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\h1_0324.00_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\l1_0324.00_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\h1_0324.20_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\l1_0324.20_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\h1_0324.40_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\l1_0324.40_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\h1_0324.60_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\l1_0324.60_O3aC01Cl1In0' into catalog ...2021-11-14 00:07:44.9833 (916) [normal]: done.
2021-11-14 00:07:44.9843 (916) [normal]: Validating SFTs (detectors: H1, L1, ) ... success.
2021-11-14 00:07:53.4428 (916) [normal]: Search FstatMethod used: 'ResampGPU'
2021-11-14 00:07:53.4428 (916) [normal]: Recalc FstatMethod used: 'DemodSSE'
2021-11-14 00:07:53.4438 (916) [normal]: GPU Device used for Search/Recalc and/or semi coherent step: 'NVIDIA GeForce GTX 650 ( Platform: NVIDIA CUDA )'
2021-11-14 00:07:53.4457 (916) [normal]: GPU Backend used for Search/Recalc and/or semi coherent step: 'OpenCL'
2021-11-14 00:07:53.4467 (916) [normal]: GPU version is used for the semi-coherent step!
2021-11-14 00:08:08.9044 (916) [normal]: Number of segments: 37, total number of SFTs in segments: 11745
2021-11-14 00:08:08.9229 (916) [normal]: Finished reading input data.
% --- GPS reference time = 1246070525.0000 , GPS data mid time = 1246070525.0000
2021-11-14 00:08:08.9239 (916) [normal]: dFreqStack = 2.000000e-006, df1dot = 1.500000e-010, df2dot = 0.000000e+000, df3dot = 0.000000e+000
% --- Setup, N = 37, T = 432000 s, Tobs = 15809012 s, gammaRefine = 250, gamma2Refine = 4653, gamma3Refine = 1

DEPRECATION WARNING: program has invoked obsolete function InitDopplerSkyScan(). Please see XLALInitDopplerSkyScan() for information about a replacement.
2021-11-14 00:08:15.0298 (916) [normal]: INFO: No checkpoint checkpoint.cpt found - starting from scratch
% --- Cpt:0, total:2000, sky:1/100, f1dot:1/20

0.% --- CG:9272015 FG:250000 f1dotmin_fg:-2.717183860529e-009 df1dot_fg:5.97609561753e-013 f2dotmin_fg:0 df2dot_fg:0 f3dotmin_fg:0 df3dot_fg:1
XLAL Error - XLALOpenCLExecuteKernel (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/OpenCLUtils.c:506): Enqueue OpenCL kernel failed with OpenCL error: CL_MEM_OBJECT_ALLOCATION_FAILURE
XLAL Error - XLALOpenCLExecuteKernel (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/OpenCLUtils.c:506): Generic failure
XLAL Error - XLALSemiCohStep_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT_OpenCL.c:137): Check failed: XLALOpenCLExecuteKernel ( &(GCTOpenCLKernels.kernel_SemiCohStep), &size, 1 ) == XLAL_SUCCESS
XLAL Error - XLALSemiCohStep_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT_OpenCL.c:137): Internal function call failed: Generic failure
XLAL Error - XLALSemiCohStep_GPU (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:4396): Check failed: (*usefulparams->gct_gpu_funcs->SemiCohStep) ( coarsegrid, finegrid, stacks, NSegmentsInv, toplists_sortby, usefulparams->BSGLsetupGPU, usefulparams->computeBSGL, usefulparams->getMaxFperSeg, toplist1_last_entryGPU->data, toplist2_last_entryGPU->data, toplist3_last_entryGPU->data ) == XLAL_SUCCESS
XLAL Error - XLALSemiCohStep_GPU (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:4396): Internal function call failed: Generic failure
XLAL Error - MAIN (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:2214): Check failed: XLALSemiCohStep_GPU( &coarsegrid, &finegrid, nStacks, &usefulParams, NSegmentsInv, uvar->SortToplist, compartment, compartment2, compartment3) == XLAL_SUCCESS
XLAL Error - MAIN (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:2214): Internal function call failed: Generic failure
2021-11-14 00:08:15.6958 (916) [CRITICAL]: ERROR: MAIN() returned with error '-1'
Code-version: %% LAL: 6.21.0.1 (CLEAN 8d0838c264f9ff9adc8c3cdbfa17b5154eaa2994)
%% LALPulsar: 1.18.2.1 (CLEAN 8d0838c264f9ff9adc8c3cdbfa17b5154eaa2994)
%% LALApps: 6.25.1.1 (CLEAN 8d0838c264f9ff9adc8c3cdbfa17b5154eaa2994)

FPU status flags: PRECISION
2021-11-14 00:08:15.7113 (916) [debug]: worker done. return(-1) to caller
2021-11-14 00:08:15.7123 (916) [normal]: done. calling boinc_finish(-1).
00:08:15 (916): called boinc_finish

</stderr_txt>
]]>





Jim Remington
Jim Remington
Joined: 14 Sep 21
Posts: 5
Credit: 526185101
RAC: 471478

After reading through earlier

After reading through earlier forum posts, I now understand that the problem is likely due to too little GPU memory, and so I have disabled GW searches.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7204834931
RAC: 935942

probably the key entry in

probably the key entry in your stderr is this:

CL_MEM_OBJECT_ALLOCATION_FAILURE

If you do a search on the Einsteinathome.org site, you'll find that many people report seeing this message when the GPU they are providing lacks sufficient RAM for the task(s) sent to it.

Your system is reported as providing:

 GeForce GTX 650 (1024MB)

Possibly if you restrict task provision to the GPU Gamma-Ray pulsar tasks you may find the card more consistently able to support them than the Gravity-Wave tasks.

 

Jim Remington
Jim Remington
Joined: 14 Sep 21
Posts: 5
Credit: 526185101
RAC: 471478

Thank you!

Thank you!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.