E@H is not using my AMD GPU properly

UnionJack
UnionJack
Joined: 9 Feb 05
Posts: 15
Credit: 76409473
RAC: 37714
Topic 219661

I have six BOINC projects running here, of which three use the AMD GPU. Until a few days ago, no GPU project ever showed total time (elapsed + remaining) of more than a few minutes, but E@H is now expecting to take a day or two - and today I see one task showing 62 days. I've tried aborting jobs, and I've reset the project, but the problem persists.

Radeontop shows barely any GPU graphics use while an E@H task is running, which explains the long execution time.

Other projects are using the GPU as expected, and radeontop shows plenty of activity.

Could I have done something to cause E@H projects to misuse the GPU, or has the code changed?

Gentoo stable system, openrc-0.41.2
gcc 8.3.0, sys-kernel/gentoo-sources 4.19.72
QT 5.12.3, KDE frameworks 5.60.0, KDE plasma 5.16.5
KDE apps 19.04.3 incl KMail 19.04.3 (5.11.3), akonadi 19.04.3
dev-db/mariadb-10.2.22-r1, net-libs/webkit-gtk-2.24.2
x11-drivers/xf86-video-amdgpu 19.0.1
dev-libs/amdgpu-pro-opencl 19.10.785425-r1

--
Rgds
Peter.

Sebastian M. Bobrecki
Sebastian M. Bo...
Joined: 20 Feb 05
Posts: 63
Credit: 1529602972
RAC: 105

This may be because the O2

This may be because the O2 GPU application is in beta. Try to set "Run test applications?" in the project settings to NO. Then you should only get new GPU tasks for "Gamma-ray pulsar binary search #1" app. If these tasks will work properly then it means that the problem probably is with the beta application.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117779901962
RAC: 34780574

UnionJack wrote:I have six

UnionJack wrote:
I have six BOINC projects running here, of which three use the AMD GPU. Until a few days ago, no GPU project ever showed total time (elapsed + remaining) of more than a few minutes, but E@H is now expecting to take a day or two - and today I see one task showing 62 days. I've tried aborting jobs, and I've reset the project, but the problem persists.

I looked through your current tasks list in the online database and there is no evidence of any Einstein tasks that have taken just "a few minutes" so I guess you are referring to projects other than Einstein.

There are two current GPU searches here.  There is a long running and stable search for pulsars emitting gamma rays (FGRPB1G).  There is also a search for gravitational waves (GW) - the Observational Run #2 - All Sky (O2AS) search.  The GPU version for the O2AS search is under development and is having issues, both with the length of time that tasks take (parts of the calculations do not use the GPU and perhaps never will - we don't know) and with the validity of results as compared to results from the long running and stable CPU app.

Your tasks list shows no evidence of having run the stable FGRPB1G app.  That search should allow your GPU to complete tasks in minutes rather than many hours or days.  The setting for allowing test apps to run defaults to 'no' so you must have changed your preferences to allow test apps.  Unfortunately, failures can and will happen when running test apps.  If you want stable and well behaved performance, you should turn off test apps and make sure the gamma ray pulsar search on GPUs (FGRPB1G) is selected.

In looking through failed tasks (click the task ID link for a failed task on the website), some information about the reason for failure can sometimes be deduced.  In your case (for the couple I looked at) the reason was "Maximum time limit exceeded".  You have a GCN 4th gen GPU (a workstation version of Polaris 10) and that type of GPU in a more common consumer version seems to be able to complete these GW GPU tasks in a couple of hours at most and not get anywhere near the allowed time limit.  This seems to suggest that there may be some issue with the driver/OpenCL libs configuration.  I think the easiest way to test that would be to run a few FGRPB1G tasks to see if they can be crunched in say something like 10-15 mins or so.  If those tasks run extremely slowly or crash, then that would tend to confirm where the problem lies.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7229838194
RAC: 1155113

One other possibility: are

One other possibility: are you running more than one task on the GPU at a time?  (most of us do).

This can work out badly if two (or more) tasks running at the same time are from different applications that happen not to "share" well.

So far as I understand it, generally a GPU task runs flat-out on the GPU non-stop from the moment it gets a slice until it needs external service.  The nature of the service may be either data not already onboard the GPU, or with more complexity, the result of a computation performed by the CPU, not the GPU.  

Whatever the reason, the currently active GPU task loses use of the GPU, and if another "simultaneously active" task is fully ready to go, it gets a turn.

When two of the same task are paired, and they are consistent over most of their run time in how long they can run until they need service, this can give remarkably equal resource to two running tasks.  But imagine the case if one task needs service in tens of milliseconds, while the other only needs service after a dozen seconds.  In the scenario I've sketched, the task needing frequent service will get a very small proportion of GPU resource--so will take an extremely long elapsed time to complete.

Here at Einstein, our current Gamma-Ray Pulsar GPU task and our current Gravity Wave GPU task have this kind of trouble.  If one of each is running on my GPU (I'm set up to run two GPU tasks at once for both of these types), the GRP task will run almost as fast as if it were running alone, while the GW task will make progress at an extremely low rate.

Of course, if you are not running multiple GPU tasks of differing applications "at once" on your GPU, this is not an issue for you.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117779901962
RAC: 34780574

archae86 wrote:Of course, if

archae86 wrote:
Of course, if you are not running multiple GPU tasks of differing applications "at once" on your GPU, this is not an issue for you.

You raise an interesting point.  As I mentioned, there's no sign of any FGRPB1G tasks - in fact when you look at the task list, the FGRPB1G app isn't even listed - but there are GPU tasks from other projects that are running.  Because there's lots of CPU involvement with the GW app, maybe the app of a different project 'takes over' whenever there's a 'GPU pause' happening with the Einstein app.  There's got to be something causing the task to run so slowly as to exceed the time limit.

Cheers,
Gary.

UnionJack
UnionJack
Joined: 9 Feb 05
Posts: 15
Credit: 76409473
RAC: 37714

Interesting discussion -

Interesting discussion - thanks, all.

I should have thought of the test-apps choice. I've now reset that and aborted all the previously received Einstein jobs. I'll see how it goes from now.

I only run one GPU task at a time.

Gary's suggestion about my opencl setup rings a bell, faintly. I can't put my finger on anything, but I'm not entirely confident in its robustness. I think the AMD code is under some degree of development, and of course, any time open-source and proprietary code modules are mixed, there's bound to be room for suspicion.

I got the opencl version wrong in my report; it's actually dev-libs/amdgpu-pro-opencl-19.30.838629, which AMD released on 7 July 2019.

--
Rgds
Peter.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.