I'm running Ubuntu 18.04.3 LTS and just upgraded to the latest distro, which upgraded the Linux kernel from 5.0.0-37 to 5.3.0-26. That also upgraded AMDGPU drivers from 19.3-934563 to 19.5-967956. Following the reboot, while running GW tasks on my RX 570s (4x on each of 2 GPUs), I noticed the following problems:
- tasks are taking about twice as long to complete;
- the 4 threads of my CPU were running consistently near 100%, when previously they would be ~60%;
- the 'top' terminal command shows one boinc process using ~90% CPU with the other 7 at ~30% each, whereas previously all 8 tasks/processes used the same amount of CPU resources (when no task is running the 99% completion phase that is);
- and the 'top' terminal command shows several sdma and comp_1 processes each using ~15% to 30% of CPU resources, which they never did previously.
(If anyone uses the amdgpu-utils program to control GPU run parameters, the upgrades also broke s-clock power masking.)
I ran a search on 'gpu sdma' and found this tidbit posted a couple of weeks ago:
https://linuxreviews.org/Mesa_20_Will_Have_SDMA_Disabled_On_AMD_RX-Series_GPUs
While the article is about an upcoming Mesa 20 release, it ends with this note: "mesa 19.3.2, released January 9th, 2019, includes the "disable SDMA on gfx8 to fix corruption on RX 580" patch." AMDGPU uses Mesa drivers.
Although the article doesn't mention OpenCL functions, given the altered sdma and CPU utilization and extended GW crunch times I'm seeing, I suspect that the changes to the recent Linux/Mesa/AMDGPU drivers have hobbled these AMD cards. I'm not sure whether to try to roll back the upgrades or wait it out for new drivers to fix the problems.
Ideas are not fixed, nor should they be; we live in model-dependent reality.
Copyright © 2024 Einstein@Home. All rights reserved.
cecht wrote:I'm running
)
If this is a Boinc only machine and you have a spare drive I would load the older version on it and set it up just to crunch until they come out with upgades to fix the problem, leaving the existing drive alone to swap back to once they come out.
mikey wrote:If this is a
)
That's a good thought, thanks Mikey. I've realized, however, that crunch times of the good ol' gamma ray binary pulsar tasks are not throttled by the AMDGPU upgrade, so I've taken the lazy man's approach and temporarily switched to running only GRP tasks. I'm guessing that the reason my GPUs appear to be only slightly affected while running these tasks is because of the FGRP app's low CPU overhead. This makes sense if the cause of the upgrade "problem" is that RX series GPUs running the GW app are heavily reliant on SDMA (system Direct Memory Access), which was disabled in the most recent AMDGPU/Mesa drivers. My limited understanding of DMA/SDMA is based on https://en.wikipedia.org/wiki/Direct_memory_access.
In short, beware the upgrade if you are running 2.02 (GW-opencl-ati) work on Linux. I wonder whether this affects the Windows app in the same way?
Ideas are not fixed, nor should they be; we live in model-dependent reality.
cecht wrote:mikey wrote:If
)
I run the GRP tasks on my own 5870 and have for awhile now on my Win10 machine but am not willing to try the other kinds of tasks as I have had issues with them before. This works and lets me keep posting here while splitting time with MilkyWay, the rest of my gpu's are running Collatz right now as people are trying to pass me and I need to build up a cushion.
To conclude, I confess that
)
To conclude, I confess that my assumptions were wrong about the upgrade causing problems. While I thought my problems were an SDMA issue embedded in a Mesa driver upgrade, I finally got around to installing glxinfo and discovered that system is currently running Mesa 19.2.8, not 19.3.2 as I had thought, so SDMA has nothing to do with it. The changes in how GW GPU tasks affected system resources was (is) something else entirely.
Ideas are not fixed, nor should they be; we live in model-dependent reality.