Issues with O2MD tasks on Radeon GPUs

Gaurav Khanna
Gaurav Khanna
Joined: 8 Nov 04
Posts: 42
Credit: 30814435604
RAC: 13512012
Topic 220046

Hi Everyone 

Not being able to complete the O2MD GPU tasks on my Radeon Fury Box. For example

https://einsteinathome.org/task/899598434

They seem to hang or crash.

This is the machine: https://einsteinathome.org/host/12219055

Any suggestions?

Thanks! 

 

 

 

mikey
mikey
Joined: 22 Jan 05
Posts: 12781
Credit: 1870694874
RAC: 1912292

Gary Roberts posted about a

Gary Roberts posted about a problem with SOME gpu's and Gravitional Wave tasks in another thread that may or may not apply to you:

[url]As it turns out, I've very recently explained the cause of this in this message.  Before digesting that explanation, look at the 4 messages that preceded my comment because they show the initial query and the responses from Holmis who pointed out the error message which then allowed the problem to be explained.[/url]

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118405025303
RAC: 25752685

Mikey, My comment that you

Mikey,

My comment that you quoted was directed at problems that seem specific to Pitcairn and Tahiti series GPUs.  As I mentioned, those GPUs are quite old and belong to the 1st gen of the GCN architecture.  Gaurav clearly mentioned that his GPUs are "Radeon Fury".  If you look here you will see the listing for the Radeon R9 Fury (Fiji Pro) where it clearly states that the architecture is GCN 3rd gen. which is rather more recent than the old 1st gen stuff.

I have one GPU that is 3rd gen, an R9 380 (Tonga Pro) and it has no problem with FGRPB1G tasks.  I have no reason to even suspect that this card might have a problem with O2MDF or that there might be any problem with 3rd gen cards in general.  This particular card has been crunching FGRPB1G tasks without issue.

I wasn't intending to shift the R9 380 to GW at the moment.  However, so that Gaurav doesn't go chasing down some unnecessary rabbit holes, I've made a very temporary switch to O2MDF to check for any problems.  I've grabbed a small batch of tasks - just 5 tasks in total - and the first has successfully completed on its own at an elapsed time of just over 16 mins.  The remaining 4 have crunched in pairs (2x) and the average time per task for those is quite a lot less, so everything is working as expected.

Of course, crunch times (because of work content variations) are rather variable so you can't read too much into the values shown for the completed tasks.  However, it does seem that GCN 3rd gen in general (and Tonga Pro in particular) don't have any issue with the GW app.

Cheers,
Gary.

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 106
Credit: 3904470653
RAC: 550733

I see the same

I see the same message

"Warning: Program terminating, but clFFT resources not freed. Please consider explicitly calling clfftTeardown( )."

at the end of the log of this Task.

The WU was stuck at about 75%. After a few minutes I aborted the WU.

But only with the first of 100 WUs. The others look ok so far.

Linux Mint, Radeon VII

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118405025303
RAC: 25752685

DF1DX wrote:I see the same

DF1DX wrote:

I see the same message

"Warning: Program terminating, but clFFT resources not freed. Please consider explicitly calling clfftTeardown( )."

That message is a warning that comes after the actual compute error.  It's something to do with the way the OpenCL system terminates once processing (successful or otherwise) has finished.  As an example, for the hundreds of completely successful FGRPB1G tasks I've ever browsed on the server after successful completion, every single one has had (and still does today) that same warning.  I've just checked a GW task and there is no warning so we can assume the GW app has been written in such a way that the OpenCL gods are appeased and that the program has been terminated in a pedantically correct fashion :-).

In your case notice the lonely 'c' character in the output, preceded (and followed) by lots of 'dots'.  Each dot represents a calculation loop.  The 'c' usually represents a loop where a checkpoint is (or perhaps can be) written.  If you compare your output with Gaurav's, he had 2 'c' chars and significantly fewer dots (if I remember correctly).  I suspect your task was really 'spinning its wheels' (for unknown reasons) which you 'fixed' by aborting it.  Did you perhaps try stopping and restarting BOINC before aborting?  Maybe that might have corrected things.

DF1DX wrote:
The WU was stuck at about 75%. After a few minutes I aborted the WU.

That was probably just some sort of simulated progress.  With just one potential checkpoint, it couldn't really be real.  The task of my own that I looked at just now (mentioned above) completed in around 16 mins and had about 15 'c' chars.  There's supposed to be a checkpoint every minute or so, so 15 checkpoints seems right.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.