Dual AMD computes on first card only

koschi

Joined: 17 Mar 05

Posts: 86

Credit: 1693774207

RAC: 823040

10 May 2019 9:10:50 UTC

Topic 218834

(moderation:

)

Hi everyone,

I have quite a strange problem on my main system. Einstein computes all WUs on only one of the AMD cards.

First its specs:

Ubuntu 18.04.2 (kernel 4.15, AMDGPU-PRO 19.10, BOINC 7.14.2) & Ubuntu 19.04 (kernel 5.0.0, OpenCL from AMDGPU-PRO 18.50, BOINC 7.14.2)

AMD R7 (undervolted, 120W power draw with CPU units)

BeQuiet Straight Power 11 650W (Gold/93%)

1st 16x PCIe Radeon RX580 (monitor attached, 82W doing 2x FGRP1B)

2nd 16x PCIe Radeon Vega 56 (180W PowerLimit)

Seems like enough power, wouldn't expect any issues on that end.

The RX580 was my main card during last months, the 2nd PCIe slot hosted a GTX1060 until this week. Both cards were crunching along in Einstein in parallel (& doing 2 WUs per card) until I took the GTX out.

I put the Vega 56 in, removed the Nvidia drivers and hoped I would now run 2 x 2 (0.5ngpu) FGRP1B ATI tasks on these two Radeon cards.

However, WUs are only processed on the VEGA.

Both cards are recognised by BOINC:

Wed 08 May 2019 20:20:27 CEST | | OpenCL: AMD/ATI GPU 0: Radeon RX Vega (driver version 2841.4 (PAL,HSAIL), device version OpenCL 2.0 AMD-APP (2841.4), 8176MB, 8176MB available, 11397 GFLOPS peak)Wed 08 May 2019 20:20:27 CEST | | OpenCL: AMD/ATI GPU 1: Radeon RX 580 Series (driver version 2841.4, device version OpenCL 1.2 AMD-APP (2841.4), 7295MB, 7295MB available, 5161 GFLOPS peak)

<use_all_gpus>1</use_all_gpus> is set and acknowledged by BOINC:Wed 08 May 2019 20:20:28 CEST | | Config: use all coprocessors

Regardless how many WUs I run in parallel (tested 1 and 2), they all end up on the Vega. The RX580 shows no load / increased temperature.

With ngpus 1.0 the BOINC client sends one WU to each GPU, in the manager this is shown in the status column as (device 0) & (device 1). The FGRP1G app is correctly called by BOINC, once with --device 0 and once with --device 1:

root 28013 11934 14 23:13 pts/2 00:01:03 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati --inputfile LATeah1049X.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.090540e-07 --ldiBins 30 --f0start 180.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir JPLEPH.405 --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile templates_LATeah1049X_0188_2669947.dat --debug 1 --debugCommandLineMangling --device 1

root 28592 11934 57 23:20 pts/2 00:00:05 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati --inputfile LATeah1049X.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.090540e-07 --ldiBins 30 --f0start 180.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir JPLEPH.405 --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile templates_LATeah1049X_0188_2793903.dat --debug 1 --debugCommandLineMangling --device 0

However, lmsensors, amdgpu-utils and the WU runtime indicate that both WUs are being run on the Vega, while the RX580 remains idle.

Quite a strange problem. I'm not sure at what level this is screwed up. Most likely not BOINC, it was sending WUs to devices 0 and 1, as shown by the manager and the FGRPB1G processes themselves. Is it the Einstein executable that ignores the device parameter (and runs everything on device 0) or somewhere in OpenCL, scheduling these tasks to the more powerful card?

I have reproduced the problem on two independent installation of Ubuntu at different release levels.
I'm completely out of ideas...

Anyone any idea? Any insight on the FGRP1B executable themselves, can we somehow trace why/how they decide on where they compute?

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7229961520

RAC: 1155271

koschi wrote:Any insight on

10 May 2019 14:07:46 UTC

Message 171254

(moderation:

)

koschi wrote:

Any insight on the FGRP1B executable themselves, can we somehow trace why/how they decide on where they compute?

They don't. The BOINC installation on your system does.

koschi

Joined: 17 Mar 05

Posts: 86

Credit: 1693774207

RAC: 823040

Well BOINC already assigns

10 May 2019 14:17:06 UTC

Message 171255

(moderation:

)

Well BOINC already assigns each WU to a different device and starts both FGRP1B processes with different --device parameters. So that information is given to the Einstein executable. I don't see what BOINC itself could do any better here?

I tried excluding the Vega with <ignore_ati_dev>1</ignore_ati_dev> (also trying 0).

I swapped both cards in their slots, the Vega is now the main card. Nothing helps. BOINC starts the executable correctly, but they don't respect the --device flag.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7229961520

RAC: 1155271

koschi wrote:Well BOINC

10 May 2019 14:42:45 UTC

Message 171257 in response to message 171255

(moderation:

)

koschi wrote:

Well BOINC already assigns each WU to a different device and starts both FGRP1B processes with different --device parameters. So that information is given to the Einstein executable. I don't see what BOINC itself could do any better here?

I tried excluding the Vega with <ignore_ati_dev>1</ignore_ati_dev> (also trying 0).

I swapped both cards in their slots, the Vega is now the main card. Nothing helps. BOINC starts the executable correctly, but they don't respect the --device flag.

You may wish to review information at:

https://boinc.berkeley.edu/trac/wiki/AppCoprocessor

In particular, the specific device flag you are harping on about is long deprecated. I think the current standard method relies on <gpu_device_num> in the init_data.xml file for a particular task run. But I'm non-expert on this matter, and currently only run single-GPU machines.

Gavin

Joined: 21 Sep 10

Posts: 191

Credit: 40644337738

RAC: 1

I wonder and assume you do

10 May 2019 18:09:01 UTC

Message 171266

(moderation:

)

I wonder and assume you do not have a monitor connected to each card... If you connect your display to the current non working card and reboot will tasks then run on it and ignore the other?

mmonnin

Joined: 29 May 16

Posts: 291

Credit: 3426956540

RAC: 3882827

Are you still on the same

10 May 2019 19:50:42 UTC

Message 171269

(moderation:

)

Are you still on the same driver as when it was the NV/AMD setup? The same driver supports both VEGA and RX cards?

koschi

Joined: 17 Mar 05

Posts: 86

Credit: 1693774207

RAC: 823040

@Gavin, the Vega didn't have

10 May 2019 20:50:05 UTC

Message 171270

(moderation:

)

@Gavin, the Vega didn't have a monitor connected and was crunching along fine. The not-computing RX580 was my primary card with the monitor attached.

I ran 18.50 in the mixed AMD/Nvidia setup. 18.50 also nicely powered the VEGA, as does 19.10.

https://www.amd.com/en/support/kb/release-notes/rn-rad-lin-18-50-unified

These drivers support cards from GCN 2 (Radeon 200 series) up to latest Radeon VII.

When it comes to OpenCL though, Polaris/RX580 requires the legacy implementation to be installed, while the VEGA requires the PAL implementation. I had both installed, the cards were recognized by clinfo and BOINC.

Not knowing what the exact problem is, I gave AMD ROCm (RadeonOpenCompute) another shot on the Ubuntu 19.04 installation (kernel 5.0.0, ROCm 2.45, Mesa 19.2). Being maybe a few percent slower than the AMDGPU-PRO OpenCL implementations, it is able to run Einstein FGRP1B on both cards. That completely makes up for not having the Polaris active at all.

Awesome!

========================ROCm System Management Interface======================== ================================================================================ GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap SCLK OD MCLK OD GPU% 0 71.0c 132.0W 1474Mhz 920Mhz 42.75% auto 130.0W N/A 15% 97% 1 52.0c 81.239W 1120Mhz 2000Mhz 18.82% auto 122.0W 0% -2% 98% ================================================================================ ==============================End of ROCm SMI Log ==============================

So it seems the official drivers can't reliably run OpenCL code on two cards that require different OpenCL implementations (RX580=>legacy, Vega=>PAL). Somewhere in there it gets messy, all started tasks are then scheduled onto the Vega.

@archae86

thanks, will check the init_data.xml on my main install

koschi

Joined: 17 Mar 05

Posts: 86

Credit: 1693774207

RAC: 823040

This is on my main install

11 May 2019 13:22:21 UTC

Message 171277

(moderation:

)

This is on my main install that I didn't fix yet:

root@frickelbude:/var/lib/bunker2/slots# grep gpu_device_num */init_data.xml 0/init_data.xml:<gpu_device_num>1</gpu_device_num> 1/init_data.xml:<gpu_device_num>0</gpu_device_num> root@frickelbude:/var/lib/bunker2/slots#

Seems about right, each init_data.xml specifies a different target GPU.

Dual AMD computes on first card only

Forums › Problems and Bug Reports

koschi wrote:Any insight on

Well BOINC already assigns

koschi wrote:Well BOINC

I wonder and assume you do

Are you still on the same

@Gavin, the Vega didn't have

This is on my main install

Comment viewing options

Forums › Problems and Bug Reports