AMD driver problem mixing RX & HD boards

Joseph Stateson
Joseph Stateson
Joined: 7 May 07
Posts: 174
Credit: 3091751743
RAC: 811790
Topic 219181

I put an unused HD7950 in with several RX570s and Einstein tasks failed immediately. Was able to suspend fast enough to avoid running through my allotment for the day.  

This was a new 18.04 Linux system and I used sudo ./andgpu-install --opencl=legacy It is not clear exactly what that does and the AMD release info says use legacy for anything before "Vega-10".  I have never seen a vega-10.   I have heard of the 56 and 64 vegas.  I would guess that RX is not legacy but it came out before the Vega 64.  Maybe "legacy" refers to OpenCL 1.2 which is all boinc seems to need. Anyway, the RX are all crunching Einstein fine, just cant have that HD board in at same time. I just checked command line options and =legacy,pal is another option but I don't know if that will fix the HD problem.  I first posted this over at the boinc site but maybe someone has seen this problem here and can make a suggestkion.

mikey
mikey
Joined: 22 Jan 05
Posts: 12781
Credit: 1870955374
RAC: 1907350

JStateson wrote:I put an

JStateson wrote:

I put an unused HD7950 in with several RX570s and Einstein tasks failed immediately. Was able to suspend fast enough to avoid running through my allotment for the day.  

This was a new 18.04 Linux system and I used sudo ./andgpu-install --opencl=legacy It is not clear exactly what that does and the AMD release info says use legacy for anything before "Vega-10".  I have never seen a vega-10.   I have heard of the 56 and 64 vegas.  I would guess that RX is not legacy but it came out before the Vega 64.  Maybe "legacy" refers to OpenCL 1.2 which is all boinc seems to need. Anyway, the RX are all crunching Einstein fine, just cant have that HD board in at same time. I just checked command line options and =legacy,pal is another option but I don't know if that will fix the HD problem.  I first posted this over at the boinc site but maybe someone has seen this problem here and can make a suggestkion. 

Was it trying to run the workunits on the new gpu or the old one? If the new one that could be the problem, you may have to exclude it in the cc_config.xml file until you get thru them all then unexclude it and let it download new files so Boinc and Einstein 'see' the new gpu too.

Joseph Stateson
Joseph Stateson
Joined: 7 May 07
Posts: 174
Credit: 3091751743
RAC: 811790

Errors are

Errors are here

https://einsteinathome.org/host/12783910/tasks/6/0

 

system, an open frame, had room for several more boards so I added an old HD7950 that was working when last used under windows 10.

 

When I rebooted, those 5 tasks listed above all errored out before I could suspend Einstein. I assume the driver does not support the HD series as that board was not present when installed.  

 

 

 

 

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7276155051
RAC: 1922065

JStateson wrote:I assume the

JStateson wrote:
I assume the driver does not support the HD series as that board was not present when installed. 

I'm not a Linux person, and only very recently am I an AMD person, but in my years of putting various combinations of Nvidia cards into and out of Windows systems running Einstein, I had the opinion that a good hardware change practice was first to uninstall the current driver, then to power down and do the card configuration change, then after reboot to install an up-to-date driver, which would immediately see the hardware configuration as it installed.

Not always was that necessary, but I had no way to know when it would be needed or not, so just always did it.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118409641942
RAC: 25804676

JStateson wrote:I assume the

JStateson wrote:
I assume the driver does not support the HD series as that board was not present when installed.

It's probably rather more complicated than that.  I doubt you would get that card to work properly even if you re-installed the OS with the card present.

Your HD 7950 belongs to the 1st generation of the Graphics Core Next (GCN) architecture.  I have a couple of HD 7950s and a lot of HD 7850 and R7 370 GPUs that are also GCN 1st gen.  Collectively, these GPUs were code-named "Southern Islands".  The 2nd gen were named "Sea Islands", the 3rd "Volcanic Islands" and the 4th "Polaris".  Your RX 570s are GCN 4th gen.

Vega belongs to the 5th gen and the code name given to the first ones released (the models Vega 64 and Vega 56) was Vega10.  It's all very confusing, isn't it? :-).  You can get information about these different generations from wikipedia.

For Linux users, the problem is that at the time (2012) of the first release of 1st gen GCN cards, the available driver was the proprietary fglrx.  AMD has progressively transitioned to the open source amdgpu driver and the fglrx driver was deprecated in 2016.  The new amdgpu didn't support (initially) the 1st and 2nd gen GCN cards.  That support is an ongoing work-in-progress and may never be fully complete.  The version of Xorg that came along after that no longer supported fglrx so the choices for a Linux user who wanted to run OpenCL applications was either to invest in new hardware that was supported by the new amdgpu driver OR to use an old version of the OS so that the old Xorg and fglrx (with its OpenCL libs) could be used, whilst waiting for the necessary support to finally arrive in amdgpu.  I chose, and am still using that second option.

Since there is ongoing development of amdgpu (which is upstreamed into the Linux kernel) I do from time to time test out the latest kernels with GCN 1st gen hardware in a test environment.  I did that a couple of times late last year, immediately after participating in this thread.  I suggest you read it fully since at the end there is mention of a list of environment variables that caused the OPs GCN 1st gen card to start working.  I did try with my test setup but still got the same error that you see.

I suspect that the OP of that thread got tasks to complete but perhaps not to validate reliably.  I did ask if his results were validating but he didn't give an answer to that.  Currently his RAC shows as pretty much zero so I guess he didn't continue for very long or perhaps gave up if he found a lot of invalids.  His computers were hidden so it wasn't possible to see the actual results.  He didn't give a link to his computer.

A couple of weeks ago, i tried again.  I get the OpenCL libs from the Red Hat version of AMDGPU-PRO.  Last year I'd been using the 18.30 version of AMDGPU-PRO.  This time I'd worked out what I needed to extract from both the 18.50 and 19.10 versions so I felt it was time for another test.  This time I had success with the new versions of the libs and with the environment variables.  It wouldn't work without the environment variables.

Before you start cheering, there is a catch.  I've tested with both a HD 7850 and an R7 370.  Tasks do run to completion without compute errors.  However, there seems to be a real problem with validation.  With the HD 7850, I ran about 15-20 tasks and waited to see the validation results.  A couple validated and then several more ended up as "checked but no consensus".  So I put it all on hold to wait for final results.  In the end, probably more than 50% of them ended up being declared invalid.

In the last couple of days, I've retried with a more recent R7 370 (still GCN 1st gen).  The results have been worse.  The tasks list for the test host still shows 4 tasks from the earlier test (the ones sent out on 22 Jun and 29 Jun).  The others from that earlier test are no longer in the online database and these last 4 will probably disappear fairly quickly too.

There are 2 others dated 8 Jul 08:08:40 UTC.  These were originally 2 leftovers from the earlier test that I got the server to resend for the new test (hence the 8 Jul date).  Being older, I assumed there would be a good chance they would be tested for validation immediately - and they were.  One was a validate error (which surprised) and the other first became checked but no consensus until later when it also became invalid.  I allowed 2 more and when I saw the second validate error I decided to pull the plug on the test.  4 tries and 4 invalids was not very encouraging :-).

For the time being the machine is back crunching without any problems using my 2016 version of the OS and the fglrx driver.  For the few tasks that did achieve validation with the amdgpu driver, the crunch time was somewhat slower than for fglrx.  In the case of the HD 7850 GPU it wasn't much slower - maybe 4-5%.  For the R7 370, it was significantly worse than that.  Under fglrx, the R7 370 is around 3-5% faster than the HD 7850.

As you can see from the tasks list, I still have 6 tasks available for a further test if I get some bright idea - well at least I'll have them until 22 Jul when they expire.  I seem to be rather short on bright ideas at the moment :-).

I just had a thought.  I have a lonely R7 260X that still runs on fglrx.  It's GCN 2nd gen (Sea Islands).  Maybe I should use those last 6 tasks to test it.  If those tasks do validate, at least that would be one machine I could bring right up to current :-).

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118409641942
RAC: 25804676

On thinking a bit more about

On thinking a bit more about the problem of getting GCN 1st gen GPUs to crunch Einstein, I've replayed in my mind, some of the things I considered when I was working out exactly what bits from the full AMDGPU-PRO package to install.  In my reading in various internet sites, I'd come across the information that it was not a problem to have multiple OpenCL implementations installed.  There's a thing called an "Installable Client Driver" (ICD).  Here's a nice concise definition given on an intel website.  Note the bit about "apps selecting between implementations at run time".

JStateson wrote:
... I just checked command line options and =legacy,pal is another option but I don't know if that will fix the HD problem.

The definition of an ICD suggests it's not a bad idea to have multiple implementations available so the app can choose the best for its purposes.  2+ years ago when I was getting my first Polaris GPUs to crunch, I decided to install two separate implementations.  You can check what you're currently using by noting what is installed in /etc/OpenCL/vendors/.  You should have at least one file with a .icd extension.  On my RX 570 hosts I see two.  The full names are amdocl64.icd and amdocl-orca64.icd.  These are just text files containing a library name.  In my case those two libraries are libamdocl64.so and libamdocl-orca64.so.  Those two libraries are both installed but I've never bothered to work out which one actually provides OpenCL for the Einstein app.

The thought occurred to me that you should see what you have in /etc/OpenCL/vendors/ on your machine.  You know that your RX 570s are happy with that but perhaps your HD 7950 isn't :-).  If you tell us which one you currently have, I might remove that one from my test install and see if the opposite one on its own happens to work.  Or you could try installing with =legacy,pal and see if that gives you more implementations.  I'd be interested to know how many you end up with :-).  At least you have a "supported OS" (Ubuntu).  Mine (PCLOS) definitely isn't :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.