I recently replaced two GTX1060s with two RX570s and initially was very happy with the performance. The run time for the RX570s averaged ~1200 seconds with very little deviation. This lasted for about two week. A few days ago I noticed a significant number of tasks with times clustered around 1800 seconds. This was not a gradual change, it happened all at once.
I tried a variety of things to figure out what was happening and eventually came to the conclusion one of the RX570s is completing tasks in ~1200 seconds and the other is completing tasks in ~1800 seconds. I used <ignore_ati_dev>0</ignore_ati_dev> and <ignore_ati_dev>1</ignore_ati_dev> to disable them one at a time. Device 0 takes ~1800 seconds and device 1 takes ~1200 seconds.
The fan speed and GPU temperature are normal and I have not seen an increase in the number of invalid tasks. One card is just suddenly slower than it use to be. These are both new cards, so I would not have been surprised if one had failed after a few weeks of operations. However, I am surprised that one became significantly slower after a few weeks and the tasks continue to validate.
Has anyone else experienced this? Any suggestions for things to try? I don’t have easy physical access to this computer, so I am doing everything via ssh.
Ryan
Copyright © 2024 Einstein@Home. All rights reserved.
I don't know what ssh allows
)
I don't know what ssh allows you to do. If you can, I suggest you reboot the host computer.
I run two RX570 cards on two different Windows machine. Each has suffered sudden drops in performance, healed by rebooting. I have seen a drop in performance to essentially zero performance, and also have seen a drop in performance to 1/3 of the previous value, both repeatable over a good part of a day before I caught them.
I don't know the mechanism. Noticing that rebooting got them out of these states and that they did not get into these states until they had been running for well over a week since the last reboot, I, for the time being, have adopted the practice of rebooting about once per week. I cannot promise this has any benefit, but suspect it may.
Please let us know any additional observations.
Rebooting the host computer
)
Rebooting the host computer had no effect.
I tried a few more things, but what ended up fixing the issue was removing the video driver and reinstalling it. I don't understand how a driver would affect only one of the two cards installed in the same host. I also don't understand what happened after a few weeks to cause the issue. Oh well, at least it is working normally again.