Consistent failures on new AMD Radeon 5700 XT

Helsionium
Helsionium
Joined: 24 Dec 06
Posts: 2
Credit: 32280545
RAC: 0
Topic 223891

Hello there,

I can't seem to run any GPU applications, they all either crash or completely lock up the system after about 10 minutes. Some workunits managed to complete, though.

Example failed tasks:

https://einsteinathome.org/de/task/1024319859

https://einsteinathome.org/de/task/1026107562

GPU model: AMD Radeon 5700 XT, 8 GB

Driver: 20.10.1 (latest)

Temperatures: normal (always under 80 °C)

The GPU and CPU work just fine for hours in stress tests (FurMark and Prime95).

 

Can anyone offer some advice?

Kind regards,

Helsionium

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118373622160
RAC: 25561764

Helsionium wrote:I can't seem

Helsionium wrote:
I can't seem to run any GPU applications, they all either crash or completely lock up the system after about 10 minutes. Some workunits managed to complete, though.

I can understand your frustration but  "can't run any" or that you have "Consistent failures" as your thread title suggests, is perhaps not the best description :-).

I had a look at all 273 tasks assigned to your computer, 233 GW GPU tasks and 40 GRP GPU tasks.  For the GW tasks, 214 were aborted, 11 were deadline misses, 5 were validated and there were only 3 actual compute errors.  So the ratio of 5 successes to 3 failures seems to suggest that the software/firmware/drivers side of things is OK and that perhaps you have some sort of transient hardware issue, despite the stress testing that you have already done.

It's a similar story with the GRP tasks, 38 were aborted and there was 1 valid and 1 actual failure.

Of course the failures are not acceptable so you need to do something.  I don't know what sort of relationship you have with your hardware supplier, but the first thing I would try to do is beg/borrow/steal a known good PSU to see if that solves the problem.  In the process of testing the PSU, make sure you pay particular attention to all connectors to make sure all devices have a good electrical connection.  The other easiest items to test are memory modules and the GPU itself.

Change things one at a time so you know exactly what is to blame if you suddenly get a string of good results.  If your GPU is relatively new, your supplier might be quite willing to lend you a replacement in order to verify if you have a warranty claim or not.

When you are ready to test, make sure you have a very low work cache size so you don't have to abort hundreds of tasks if the failures continue.  Good luck with finding the problem.

Cheers,
Gary.

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 469
Credit: 10398770702
RAC: 3674714

Helsionium wrote: ... Can

Helsionium wrote:

...

Can anyone offer some advice?

...

 

What comes to my mind and might help:

   Have you tried running "sfc  /scannow" ?

   Have you done a complete virus scan ?

Helsionium
Helsionium
Joined: 24 Dec 06
Posts: 2
Credit: 32280545
RAC: 0

Thank you for your

Thank you for your input.

Without making any changes to my hardware, I decided to test the BOINC project Collatz Conjecture's AMD app over a period of more than 24 hours. The idea was that if the hardware was to blame, all apps should have similar crashing/freezing issues.

I still got computation errors on the Collatz project, but significantly fewer than on this project or the LHC@home project. And the Collatz app never froze my computer or caused the video driver to crash. Out of 165 tasks, 150 were completed successfully. I also tried the PrimeGrid app, it completely froze my computer a few seconds into the first task.

All of the computation errors were "access violations", just as it was with this project.

For now, I suspect it is a driver or OS issue rather than a hardware issue.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.