Computation Error on AMD/ATI Radeon GPU with OpenCL and Windows OS

2d3hqJaofRNK8nrMoPFRRMrd3nvf
2d3hqJaofRNK8nr...
Joined: 9 Feb 05
Posts: 4
Credit: 187604019
RAC: 133560
Topic 213603

Hello,

I posted my initial message on another thread, here:  https://einsteinathome.org/goto/comment/164384.  However, that thread was geared toward Linux, so I am reposting under a new thread.  Several forum members posted replies there and it appears that my issue may be on-going and not fully understood at this point.

Paul,

My apologies.  In reading the message thread, it did not mention Linux specifically, but I noticed that all posts leaned that way.  So maybe it was off topic slightly with regard to OS.  In addition, I noticed the CAL vs OpenCL, but my system would not DL any GPU wus until I upgraded the drive to OpenCL 1.2 at the vary least.  And it indicated OpenCL in the Event Log, so I was thinking it was an OpenCL and it is on an AMD Radeon series GPU.  Again, my apologies.

Based on your follow-up post, it looks like it may be an on-going issue.  I would be very much willing to help, but unfortunately, I do not contain the appropriate knowledge to know about debug packages.  I would need guidance for installation and to make sure that it is working.  Also, the OS is different, so I would need to find persons knowledgeable with my OS.

 


Gary,


My apologies.  I thought I had copied enough to show the computer ID.  the Computer ID is 12628920, with task list linked here:  https://einsteinathome.org/host/12628920/tasks/0/0


I do see several tasks with the stderr.txt file listed after the task info and error messages shown.  Hopefully, this means something to someone as I do not understand the error(s) being reported.  Way above my pay grade.


 


I will have to look into the driver.  I downloaded the most recent driver listed on AMD's website (dated 02/15/2017) and installed it based on the GPU model number and OS.


 


Regarding the other computer that is exactly the same.  I neglected to mention that it did not have E@H installed, but SETI@Home.  It crunchs wus on Seti just fine with no errors, so I thought it should be fine on E@H.  However, this does not appear to be the case.  Paul noted that this might be an on-going issue, so I likely will have to switch to Seti tasks for the GPU for now.


 


I appreciate the help and will look into the driver issue some more to see if I can find a solution.


 


Mikey,


This how I found the most current driver and installed it.  As noted previously, the drive it found for my system was dated 02/15/2017.  So the driver should be good, but as I also noted previously I will look into the drive issue more to see if I can find additional information.  Thank you for responding, it is much appreciated.


 


If I find anything out that is helpful, I will definitely post it.


 


Rudabega


 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118443945058
RAC: 25891697

afewdents wrote:I do see

afewdents wrote:
I do see several tasks with the stderr.txt file listed after the task info and error messages shown.  Hopefully, this means something to someone as I do not understand the error(s) being reported.  Way above my pay grade.

I was just looking into the small group of 11 tasks that had thrown an error when all the remaining 'in progress' tasks suddenly got aborted.  If you aren't going to continue, please let me know.  There's no point in spending time on this if you're pulling out.    Here are some bits from the stderr output that give a clue as to what is happening.     

<message>
Network access is denied.
(0x41) - exit code 65 (0x41)</message>

This is spurious. The science app produces an exit code that would be meaningful to the author of that code.  However Windows sees that code and interprets it as a Windows error - which it is not - so the text is just wrong.           

% Opening inputfile: ../../projects/einsteinathome.org/LATeah0055L.dat
% Total amount of photon times: 30007
% Preparing toplist of length: 10
% Read 1255 binary points
read_checkpoint(): Couldn't open file 'LATeah0055L_980.0_0_0.0_8218995_1_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1255
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 31 df1dot: 3.344368011e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs

In the above snip, everything is normal for the start of crunching.  A checkpoint file (.cpt extension) will only exist if this task is being restarted from a saved checkpoint.  The read_checkpoint() function looks for a checkpoint that would have been saved if there had been some crunching of this task previously.  No checkpoint was found so crunching is starting right at the beginning at Binary point #1 out of a total of 1255 to be done.

The problem then arises pretty much immediately.                  

Error in computing index of fft input array, i:1055167508 pair:13768
ERROR: prepare_ts_2_phase_diff_sorted() returned with error 2288728
00:52:46 (2724): [CRITICAL]: ERROR: MAIN() returned with error '1'

The above will be meaningful to the author of the science app.  Hopefully, one of the staff might find time to have a quick look and comment about why this might be happening.  It seems unlikely that this is a program bug as millions of these tasks have already been crunched successfully.  You would expect many similar reports if it were a bug.  That really leaves two possibilities.   

It could be a hardware issue with your graphics card.  Seeing as you have two identical systems, you could swap graphics cards and see if the problem goes with the card or stays with the system.  If it stays with the system, it's not the card - it's hardly likely that two different cards would have the same hardware error.

If it's not the card, the second possibility is the OS/driver combination.  As the 'Northern Islands' cards are getting a bit old now, there could be something in recent drivers that is not quite compatible.  It would be useful to see if there is any difference in behaviour with an earlier driver.  I'd be tempted to try to find a version around the 2015 to early 2016 mark.  From what other people have reported, it's pretty important to clean all the remnants of a previous driver install when updating or downgrading a driver. Maybe the very first step should just be a complete clean and reinstall of the current driver to see if that makes any difference.

Quote:
It crunchs wus on Seti just fine with no errors, so I thought it should be fine on E@H.  However, this does not appear to be the case.

You can't always assume that because it works on one project it will work on all others.  You also can't assume that because one card doesn't work, a similar card in a different machine also won't work.  Did you actually give your 'seti machine' the chance to see if it could handle an Einstein GPU task?

 

Cheers,
Gary.

2d3hqJaofRNK8nrMoPFRRMrd3nvf
2d3hqJaofRNK8nr...
Joined: 9 Feb 05
Posts: 4
Credit: 187604019
RAC: 133560

Gary, I was just looking

Gary,

I was just looking into the small group of 11 tasks that had thrown an error when all the remaining 'in progress' tasks suddenly got aborted.  If you aren't going to continue, please let me know.  There's no point in spending time on this if you're pulling out.    Here are some bits from the stderr output that give a clue as to what is happening.

I will review your entire post later tonight, but wanted to let you know that I am not pulling out.  I just thought it better to abort the current tasks and not request any new ones until I have a potential solution.  To me it seems wasteful to continue to process WUs that I know will not end up providing useful results, but maybe I am wrong.  In any case, I very much want to understand why I am having the issues so I can get back to crunching data.

 

Thanks,

Rudabega

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118443945058
RAC: 25891697

afewdents wrote:Gary, Gary

afewdents wrote:

Gary,

Gary Roberts wrote:
I was just looking into the small group of 11 tasks that had thrown an error when all the remaining 'in progress' tasks suddenly got aborted.  If you aren't going to continue, please let me know.  There's no point in spending time on this if you're pulling out.    Here are some bits from the stderr output that give a clue as to what is happening.

I will review your entire post later tonight, but wanted to let you know that I am not pulling out.

OK, fine.  I thought you may have originally just suspended the balance to prevent them from being wasted whilst waiting for a possible solution.  Then when I saw them all get aborted, I thought you might have decided to pull the plug.  I'm happy to try to help if you want to keep going.

If you want to explore swapping cards between machines or installing/reinstalling drivers, etc, it would be useful to keep a supply of tasks on hand and suspended.  That way you can just resume one of them when needed to see if any change made has solved the problem.  If it has, you can resume them all.  If not, the remainder are protected from instant failure.

I was hoping one of the staff might have seen this thread and commented.  I'll send a PM to Bernd and ask him if he has any suggestions.  He's very busy with the new GW run so we need to be patient.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118443945058
RAC: 25891697

I received a reply from Bernd

I received a reply from Bernd to my PM.  He is very busy getting the latest GW search running correctly.  He mentioned that it was quite a while since he had worked on the code for FGRPB1G so he didn't really remember the details of the message output that mentioned , "Error in computing index of fft input array ...".  He wasn't sure but he wondered if that particular operation was performed on the CPU anyway, prior to the GPU crunching getting started.  If that were true, it might not actually be the GPU that is the problem.  He stressed he wasn't sure but he wondered if you had been able to run FGRP5 CPU only tasks without error.

I had a look at your tasks list for CPU tasks and you have 11 of them, all 'in progress' but none returned, error or not.  Are you allowing any of those tasks to crunch?  It would be useful to see if any of those would complete without error.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.