!DEBUG, Depreciation Warnings, obsolete function, & sub 50% utilization

TribbleRED
TribbleRED
Joined: 14 Oct 20
Posts: 3
Credit: 766441625
RAC: 508
Topic 225750

Hello folks,

Two systems, same problem. One node config below, a task it completed, & the task log:

Win10 Pro (10.0.19043)
Gigabyte x570 Aorus Xtreme
5950x
128GB (4x32GB) Trident Royal Z 3600 @ 16-22-22-42
1x Gigabyte RTX 3090 Gaming OC
3x Western Digital Black sn850 in RAID0 (AMD-RAID)
ALL drivers up-to-date
No Gigabyte settings software installed

Enviornment: Most of my nodes are either in racks or are any one of my two personal machines. They run 24/7. They only run these applications. Applications run constant(no start/stop action). Internet is constant. Racks on UPS. Temps are high during the summer months but within operational limits. VERY little of these personal machines spend cpu/gpu cycles on anything outside of BOINC while they run around the clock. The one in question is a new build and is entering its last stages of run-up before I deploy it.

Application: Gravitational Wave search O3 All-Sky #1 v1.00 (GW-opencl-nvidia)

Problem: Application not utilizing more than 42%(observed) of available GPU core at any given time during application run.

Background: I noticed that a 3090 was taking longer to complete tasks than a 1660 Super. MSI Afterburner, along with HWiNFO, indicated a less than 50% utilization of GPU - ALL thermals (including memory junction temperature) were well within the black and no indication of faulty hardware or faulty hardware configuration as this is backed up by a myriad of other applications both within and outside of BOINC/VBOX. With the information provided by the logs I cannot identify the cause of the error indicators here as I do not understand the nomenclature but upon further investigation it also appears to be happening on my 1660 Super - to which I have just now noticed as my having taken that time for granted as "normal" and this may also be affecting my Tesla K80s. No further information on the status of the 1660 Super at the moment that can be attained outside of the Einstein task logs.

The following is from Task 1146876861 - one of many identical tasks performed by the 3090 node outlined above

  • Stderr output

  • <core_client_version>7.16.11</core_client_version>
    <![CDATA[
    <stderr_txt>
    putenv 'LAL_DEBUG_LEVEL=3'
    2021-07-25 01:10:19.0381 (15308) [normal]: This program is published under the GNU General Public License, version 2
    2021-07-25 01:10:19.0461 (15308) [normal]: For details see http://einstein.phys.uwm.edu/license.php
    2021-07-25 01:10:19.0521 (15308) [normal]: This Einstein@home App was built at: Jun 28 2021 14:35:05
    
    

    2021-07-25 01:10:19.0571 (15308) [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_O3AS_1.00_windows_x86_64__GW-opencl-nvidia.exe'.
    Activated exception handling...
    [DEBUG} GPU type: 1
    [DEBUG} got GPU info from BOINC
    [DEBUG} got VendorID 4318
    2021-07-25 01:10:19.2032 (15308) [debug]: Flags: LAL_DEBUG, OPTIMIZE, HS_OPTIMIZATION, GC_SSE2_OPT, X64, SSE, SSE2, GNUC X86 GNUX86
    2021-07-25 01:10:19.2132 (15308) [debug]: Set up communication with graphics process.
    2021-07-25 01:10:19.2162 (15308) [normal]: Parsed user input successfully

    DEPRECATION WARNING: program has invoked obsolete function XLALGetVersionString(). Please see XLALVCSInfoString() for information about a replacement.
    Code-version: %% LAL: 6.21.0.1 (CLEAN b2ca8445cafe6d2a9cd3c12385d5762781b57b6d)
    %% LALPulsar: 1.18.2.1 (CLEAN b2ca8445cafe6d2a9cd3c12385d5762781b57b6d)
    %% LALApps: 6.25.1.1 (CLEAN b2ca8445cafe6d2a9cd3c12385d5762781b57b6d)

    2021-07-25 01:10:19.2222 (15308) [normal]: Initialise compartments with freqWidth = 0.05 and candidates per compartment = 3000.
    2021-07-25 01:10:20.0639 (15308) [normal]: Reading input data ...
    2021-07-25 01:10:20.0639 (15308) [normal]: Loading SFTs matching '..\..\projects\einstein.phys.uwm.edu\h1_0453.20_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\l1_0453.20_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\h1_0453.40_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\l1_0453.40_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\h1_0453.60_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\l1_0453.60_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\h1_0453.80_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\l1_0453.80_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\h1_0454.00_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\l1_0454.00_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\h1_0454.20_O3aC01Cl1In0;..\..\projects\einstein.phys.uwm.edu\l1_0454.20_O3aC01Cl1In0' into catalog ...2021-07-25 01:10:24.0224 (15308) [normal]: done.
    2021-07-25 01:10:24.0234 (15308) [normal]: Validating SFTs (detectors: H1, L1, ) ... success.
    2021-07-25 01:10:34.0750 (15308) [normal]: Search FstatMethod used: 'ResampGPU'
    2021-07-25 01:10:34.0760 (15308) [normal]: Recalc FstatMethod used: 'DemodSSE'
    2021-07-25 01:10:34.0770 (15308) [normal]: GPU Device used for Search/Recalc and/or semi coherent step: 'NVIDIA GeForce RTX 3090 ( Platform: NVIDIA CUDA )'
    2021-07-25 01:10:34.0790 (15308) [normal]: GPU Backend used for Search/Recalc and/or semi coherent step: 'OpenCL'
    2021-07-25 01:10:34.0810 (15308) [normal]: GPU version is used for the semi-coherent step!
    2021-07-25 01:10:54.7658 (15308) [normal]: Number of segments: 37, total number of SFTs in segments: 11745
    2021-07-25 01:10:54.8179 (15308) [normal]: Finished reading input data.
    % --- GPS reference time = 1246070525.0000 , GPS data mid time = 1246070525.0000
    2021-07-25 01:10:54.8189 (15308) [normal]: dFreqStack = 2.000000e-006, df1dot = 1.500000e-010, df2dot = 0.000000e+000, df3dot = 0.000000e+000
    % --- Setup, N = 37, T = 432000 s, Tobs = 15809012 s, gammaRefine = 250, gamma2Refine = 4653, gamma3Refine = 1

    DEPRECATION WARNING: program has invoked obsolete function InitDopplerSkyScan(). Please see XLALInitDopplerSkyScan() for information about a replacement.
    2021-07-25 01:11:03.1881 (15308) [normal]: INFO: No checkpoint checkpoint.cpt found - starting from scratch
    % --- Cpt:0, total:2000, sky:1/100, f1dot:1/20

    0.% --- CG:9272015 FG:250000 f1dotmin_fg:-2.717183860529e-009 df1dot_fg:5.97609561753e-013 f2dotmin_fg:0 df2dot_fg:0 f3dotmin_fg:0 df3dot_fg:1
    ...................
    1....................
    2....INFO: Major Windows version: 6
    c
    ................
    3....................
    4....................
    5....................
    6....................
    7....................
    8....................
    9....................
    10..c
    ..................
    11....................
    12....................
    13....................
    14....................
    15....................
    16....................
    17....................
    18....................
    19....................
    20....c
    ................
    21....................
    22....................
    23....................
    24....................
    25....................
    26....................
    27....................
    28....................
    29....................
    30....................
    31..............c
    ......
    32....................
    33....................
    34....................
    35....................
    36....................
    37....................
    38....................
    39....................
    40....................
    41....................
    42....................
    43............c
    ........
    44....................
    45....................
    46....................
    47....................
    48....................
    49....................
    50....................
    51....................
    52....................
    53....................
    54....................
    55...................c
    .
    56....................
    57....................
    58....................
    59....................
    60....................
    61....................
    62....................
    63....................
    64....................
    65....................
    66....................
    67....................
    68....................
    69....................c

    70....................
    71....................
    72....................
    73....................
    74....................
    75....................
    76....................
    77....................
    78....................
    79....................
    80....................
    81....................
    82....................
    83.c
    ...................
    84....................
    85....................
    86....................
    87....................
    88....................
    89....................
    90....................
    91....................
    92....................
    93....................
    94....................
    95....................
    96....c
    ................
    97....................
    98....................
    99....................
    2021-07-25 01:20:36.7945 (15308) [normal]: Finished main analysis.
    2021-07-25 01:20:36.7955 (15308) [normal]: Recalculating statistics for the final toplist...
    2021-07-25 01:34:37.6423 (15308) [normal]: Finished recalculating toplist statistics.
    2021-07-25 01:34:37.6433 (15308) [normal]: Finished with peak RAM usage: -1.0 MB on CPU 'AMD Ryzen 9 5950X 16-Core Processor 'The system cannot find the path specified.
    , peak VRAM usage: 873.8 MB on GPU Device: 'NVIDIA GeForce RTX 3090 ( Platform: NVIDIA CUDA )' with backend: 'OpenCL'.
    2021-07-25 01:34:37.6723 (15308) [debug]: Writing output ... done.

    DEPRECATION WARNING: program has invoked obsolete function FreeDopplerSkyScan(). Please see XLALDestroyDopplerSkyScan() for information about a replacement.
    Code-version: %% LAL: 6.21.0.1 (CLEAN b2ca8445cafe6d2a9cd3c12385d5762781b57b6d)
    %% LALPulsar: 1.18.2.1 (CLEAN b2ca8445cafe6d2a9cd3c12385d5762781b57b6d)
    %% LALApps: 6.25.1.1 (CLEAN b2ca8445cafe6d2a9cd3c12385d5762781b57b6d)

    FPU status flags: PRECISION
    2021-07-25 01:34:38.5571 (15308) [debug]: worker done. return(0) to caller
    2021-07-25 01:34:38.5581 (15308) [normal]: done. calling boinc_finish(0).
    01:34:38 (15308): called boinc_finish

    </stderr_txt>
    ]]>



Does the aforementioned share any insight as to the cause of the less-than-half gpu usage and if so what is the fix or workaround? Being that this is happening to (at least) two of my nodes it may very well be happening to my others that have not run Einstein recent enough for me to pull logs down through this portal. I'm guessing it's a config error on my part but I don't want to do anything until I have some clear direction from someone who can translate the above log and indicate how it is throwing these symptoms.

Please advise

Harri Liljeroos
Harri Liljeroos
Joined: 10 Dec 05
Posts: 4461
Credit: 3263363363
RAC: 1898060

I don't have an answer to

I don't have an answer to your question, but have you tried to run two tasks at the same time on your GPU? That's what many heavy users are doing.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118354668952
RAC: 25479064

TribbleRED wrote:Does the

TribbleRED wrote:
Does the aforementioned share any insight as to the cause of the less-than-half gpu usage and if so what is the fix or workaround?

I had a quick look through the full list of tasks for the machine with the RTX 3090.  There are no compute errors or invalid tasks and everything seems to be normal.  Everybody tends to see apparently low GPU utilisation with the Einstein GW GPU app.

There are significant portions of the code that can't be done efficiently on a GPU so those sections are offloaded to a CPU core.  If there isn't fast and timely availability of CPU resources when needed, the run time for the task can be severely impacted.  As long as you aren't running other compute apps for other BOINC projects on too many of your 32 threads, you should be OK.  If you are, you should temporarily suspend that, just to see if you get a significant improvement in elapsed times.  A few quick experiments should allow you to find an optimum mix.

I don't run nvidia GPUs so I'm only guessing at what a reasonable run time should be.  I saw times in the 20-30 min range which may be a bit on the slow side.

The deprecation warnings are quite normal - everybody who peruses their logs sees them.  The question has been asked but there has been no official explanation from the Staff.  A whole bunch of functions are supplied from the LIGO consortium.  The assumption is that the Einstein app is compatible with previous versions but not later ones - hence the warnings about deprecated versions.

Cheers,
Gary.

TribbleRED
TribbleRED
Joined: 14 Oct 20
Posts: 3
Credit: 766441625
RAC: 508

I'll give that a go and see

I'll give that a go and see how much if any better that will be.

TribbleRED
TribbleRED
Joined: 14 Oct 20
Posts: 3
Credit: 766441625
RAC: 508

Gary Roberts

Gary Roberts wrote:

TribbleRED wrote:
Does the aforementioned share any insight as to the cause of the less-than-half gpu usage and if so what is the fix or workaround?

I had a quick look through the full list of tasks for the machine with the RTX 3090.  There are no compute errors or invalid tasks and everything seems to be normal.  Everybody tends to see apparently low GPU utilisation with the Einstein GW GPU app.

There are significant portions of the code that can't be done efficiently on a GPU so those sections are offloaded to a CPU core.  If there isn't fast and timely availability of CPU resources when needed, the run time for the task can be severely impacted.  As long as you aren't running other compute apps for other BOINC projects on too many of your 32 threads, you should be OK.  If you are, you should temporarily suspend that, just to see if you get a significant improvement in elapsed times.  A few quick experiments should allow you to find an optimum mix.

I don't run nvidia GPUs so I'm only guessing at what a reasonable run time should be.  I saw times in the 20-30 min range which may be a bit on the slow side.

The deprecation warnings are quite normal - everybody who peruses their logs sees them.  The question has been asked but there has been no official explanation from the Staff.  A whole bunch of functions are supplied from the LIGO consortium.  The assumption is that the Einstein app is compatible with previous versions but not later ones - hence the warnings about deprecated versions.

 

Ok. Thanks for the info.  I'll look into it further

Burned
Burned
Joined: 25 Jun 21
Posts: 32
Credit: 388221900
RAC: 0

I noticed what I thought was

I noticed what I thought was an oddity, running GPU work and CPU intensive work (both BOINC and Folding@Home) on Windows.  I had to keep two entire cores (not threads) idle to get nominal throughput on the GPU.  Perhaps there is some processor serialization issue.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.