I have noticed a severe problem with Linux kernel 5.10 in combination with my AMD GPU using AMDGPU OSS video driver plus the Radeon v20.45 compute drivers (only). Does anyone else see this?
On kernel 5.9, things are fine, but if I boot kernel 5.10, my Gnome Shell crashes frequently, say 20-30 minutes after boot when with E@H GPU tasks running. Without E@H GPU tasks, the system seems okay; it's certainly more stable.
My GPU is 5700XT.
The last time I checked, about six months ago, 5700XT (aka Navi 10) was still not supported by fully OSS compute stack. Does anyone know if that has changed?
When I get some time, I plan to remove Radeon driver v20.45, test OSS, and then reinstall Radeon driver. I'm pretty confident that will not change anything, though. I suspect v20.45 just isn't compatible with 5.10.
Copyright © 2024 Einstein@Home. All rights reserved.
I use driver 20.40 with
)
I use driver 20.40 with kernel 5.4 (Ubuntu 20.04) in my computer with the same GPU.
I tried to update to use driver 20.45 and I see the gnome crashes. After various tests with that driver I reinstalled the 20.40 and It's stable with that.
Paul wrote: I suspect v20.45
)
Try the new 20.50 driver. It fixes a compatibility problem with kernel 5.8.
Is that amdgpu-pro 20.45 /
)
Is that amdgpu-pro 20.45 / 20.50? Would you be using the RHEL/CentOS packages?
Since amdgpu-pro started using ROCr-based OpenCL from 20.45 onwards, I haven't had any success with BOINC GPU processing, I've had to revert to 20.40. But that's using the Ubuntu packages - still, I find it interesting that you seem to be successful with a Navi-based GPU.
https://einsteinathome.org/content/troubleshooting-ubuntu-20-and-fresh-install-amd-drivers (most recent posts)
Soli Deo Gloria
Just following up on this.
)
Just following up on this. Yes, this is a weird, but that seems pretty familiar too, and that's frustrating.
Since I reported the problem, this situation has improved, and may not be the same issue, but the combination of kernel 5.10+ and amdgpu compute is not solved. What I see how is that memory usage grows steadily until all RAM is consumed and the system crashes. This is particularly confusing because we have OOM now and other protections, so it's not clear how this OOM condition is handled so poorly.
My best guess is unchanged, though: incompatible non-PRO AMD drivers in OSS kernel vs. AMD compute stack.
I'm using the compute components from 21.10 now, and that hasn't resolved the issue, either. But, the last time I checked the system took 48 hours to exhaust 32GB of memory.
Paul wrote:.... What I see
)
If you take a look at this message you will find some comments about an issue that looks very similar. Two messages later in that same thread, I gave more details.
For me, this happens with any 5.10.x or 5.11.x kernel that I've tried so far. I haven't seen it with 5.4.x, 5.7.x, 5.8.x or 5.9.x kernels. I haven't yet tried a 5.12.x kernel. My personal preference is to use the latest LTS kernel - in this case the latest in the 5.4 series. I'm up to 5.4.115 and will be downloading an even later one in that series shortly. I do try to test a member or two of each kernel series in case the LTS series has a problem. I usually wait until there have been 5 to 10 releases - eg. for 5.12.x, I'll try from around 5.12.10 or later to see if there are any issues with the series. Usually there isn't and I was quite surprised when this issue turned up in 5.10.x.
My guess is that some sort of bug has been introduced in the 5.10 (and probably later) series which hopefully will get sorted fairly soon :-).
Cheers,
Gary.
I have been getting "Gnome
)
I have been getting "Gnome shell" error messages lately even in kernel 5.4
Since there is no evidence at the GUI level of a problem I just toggled the "don't show me this error anymore".
Tom M
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor) I want some more patience. RIGHT NOW!
System is working okay
)
System is working okay again. Kernel 5.12 + AMDGPU OSS Drivers 21.10.