Getting 90% invalid results on Polaris GPU

sapphitron
sapphitron
Joined: 30 Dec 19
Posts: 2
Credit: 6747523
RAC: 0
Topic 220338

Hi, I just started crunching Einstein@Home on my Radeon RX590 (Polaris architecture, so it shouldn't be affected by the Navi issues) and I'm getting over 90% "validation errors" or invalid results.

For the first 2 days or so the card was running overclocked, but it's been running at stock speeds since january 1st and it doesn't seem to have made any difference.
The GPU doesn't display any sort of issue in games, it's running well below Tmax (around 60°C) and in general I don't think it's a hardware problem.

The only thing I can think of is that since the GPU was running at just 60% load, I reduced the gpu load factor to 0.2 so now it's executing 4 tasks at a time, and that keeps it around 90-95% load. Could that be the cause of the issue? VRAM utilization is around 50% so it's not running out of memory or anything.

I'm gonna revert it back to 1 and see how it goes, but if I can't find a fix I'm gonna take it offline so I don't keep polluting the results.

 

Does anyone have any ideas?

Thanks!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118404198641
RAC: 25759177

sapphitron wrote:... I

sapphitron wrote:
... I reduced the gpu load factor to 0.2 so now it's executing 4 tasks at a time, and that keeps it around 90-95% load.

If you are referring to the GPU utilization factor, that setting would probably try to run 5 concurrent tasks, and even if you had enough cores to support it, that's just crazy :-).  That may be the reason for why you only get 4 tasks attempting to run, but I have a vague recollection that you only get 5 if the setting is a tiny bit below 0.2.  Trying to run that many is very likely the reason for the failures.  You don't actually say if the errors are 'validate errors' or just tasks that are declared to be invalid.  There is quite a difference and you need to specify - or allow us to check for ourselves.  Your wording suggests you have a mixture of both but it would be useful to know which type predominates.

Your computers are hidden so we can't check.  We don't know important details like the above and also things like the OS, the actual search or searches you are trying to run, the driver and OpenCL versions you are using, your CPU cores, RAM, etc., etc., so to get proper help, at least tell us the hostID of your machine so that someone willing to help can make a proper assessment of what is happening.

I run a *lot* of Polaris series GPUs, under Linux, and all they work extremely well for me.  I don't have an RX 590 - the bulk of mine are RX 570s, but I also have some 460s, 560s and 580s.  There is no reason I know of why yours wont work very well also.  You will likely need different settings for optimal conditions, depending on which particular GPU search you wish to run, the gamma-ray pulsar search (FGRPB1G) or the gravitational wave (GW) search (O2MD).

Let us know what you want to do.

Cheers,
Gary.

sapphitron
sapphitron
Joined: 30 Dec 19
Posts: 2
Credit: 6747523
RAC: 0

Hi, yeah, i did get both kind

Hi, yeah, i did get both kind of errors.

I set the utilization factor to 0.2 because a friend of mine said that's what he uses. The VRAM utilization didn't go over 50% (it's an 8GB card so it still had another 4GB if needed) and the CPU has no problems at all keeping up as it's a Ryzen 7 2700x 8 cores/16 threads. I've got 32GB of system memory.

The OS is Windows 10 (x64 obviously) and drivers are the latest Adrenalin (19.12.3).

Here's the host: https://einsteinathome.org/host/12800960
Do I need to change any settings to make it public?

 

It would seem that I'm no longer getting errors after increasing the load factor back to 1, though most of the units are still being validated.

I have no particular preference for the projects, though if I had to pick one I'd pick the pulsar search.

What settings would you recommend? At 1 the GPU isn't fully loaded at all. It still seems strange that just running more tasks would lead to computation errors, though I'll admit I've never dabbled with GPU programming so I don't know the caveats.

 

Thanks!

 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118404198641
RAC: 25759177

sapphitron wrote:What

sapphitron wrote:
What settings would you recommend? At 1 the GPU isn't fully loaded at all. It still seems strange that just running more tasks would lead to computation errors ...

You need to lose this fixation about getting the GPU "fully loaded" :-).  Just remember, we are talking about having a GPU perform number crunching - not the normal type of workload that various utilities might expect you to be running for something like gaming.  Load will depend very much on how well the compute problem can be 'parallelized' and how much back and forth communication there needs to be with the app running on the CPU.  If a GPU has to 'wait' for CPU support, of course its apparent 'load' will go down.

The gamma-ray pulsar app that analyses the large area telescope data (LAT) on board the Fermi satellite (ie the FGRPB1G app) is very mature and efficiently uses the GPU with little need for a lot of CPU support.  It runs very well singly (x1) but you can get perhaps something like a 10% improvement in throughput if you run x2 - ie. a GPU utilization factor of 0.5.  I have tried x3 (0.33) from time to time and most of the time it will work but the throughput gain is negligible.  Perhaps depending on specific data files being analysed, random fluctuations in crunch time and random errors start creeping in which more than negate any potential gain.  This probably varies with GPU model and GPU settings (frequency, voltage) so you would need to experiment to find out for your own particular hardware.  In the end, it's far simpler not to go beyond x2 and keep the nice stable operation, irrespective of what you think the GPU load might be.

The app that is analysing LIGO data, looking for detections of continuous gravitational wave (GW) emissions from massive rotating bodies like neutron stars, is quite new and relatively immature.  There has been a long running CPU app which is considered to be the benchmark for potentially detecting continuous GW and, to speed up the rate that data can be analysed, there is now a GPU app.  This has been difficult to get running correctly.  It's taken a long time to get it to produce results compatible with the CPU app and it needs way more CPU involvement and support than the GRP app.  I looked through your results and the vast bulk of them are for this app so it's little wonder that you are seeing what you consider to be unacceptably low GPU loads.  Unfortunately, that's just the nature of the beast.  A lot of the calculations still need to be done on the CPU.

I've been experimenting with the GW GPU app on some RX 570 GPUs.  The run is referred to as O2MD where O2 refers to data from observational run #2 of the LIGO detectors and MD stands for Multi-Directional which means multiple specific directions towards specific known pulsars in the Milky Way but as close as possible to earth.  The signals will be incredibly weak so using the closest objects gives the best chance of a detection.  At a later stage, the search will spread to 'All Sky' (O2AS) which will run a lot longer.

For the current O2MD search, I'm currently running x3 (GPU utilization 0.33).  Earlier on I was running x4 but when the attention switched to the Vela pulsar, that soon started producing invalid results so I switched back to x3.   Going from x1 to x2 to x3 to x4 all gave worthwhile improvements in throughput, provided that full CPU support was available.  To achieve that on a Ryzen 2600, I allowed all 6 cores (12 threads) to be available for GPU support duties.

At the end of the day, there is no point running more tasks concurrently than will allow continuous error free operation without having to baby sit the equipment.  Until you get a feel for how things run on your equipment, the best advice I could give would be to 'hasten slowly' :-).

Until you get sufficient experience of where the boundaries are, I would recommend running the FGRPB1G search only and using x2 to start with.  It will most efficiently use your GPU and you could run quite a few CPU tasks on some of your CPU cores.  If running x2 you should make sure there are at least 2 CPU threads available for GPU support.

Do not mix the two GPU searches.  Use your project prefs to 'turn off' the GW search if you choose to run FGRPB1G.  Also make sure you disable 'non-preferred apps'.  The main reason for that is that estimates for crunch times are wrong for both search types but in opposite directions so results for one search type will badly interfere with the estimates for the other type, leading to great difficulty for BOINC to maintain a proper cache of work.  There shouldn't be a problem running FGRPB1G tasks for the GPU and to allow some GW CPU tasks to run on some of the spare CPU cores.  There is a setting that controls the number of cores BOINC can use.  Start at 50% and if that doesn't affect GPU crunch times you could try going a little higher.  At some point, GPU crunch times will be affected and you want to avoid that.

To start with, have low work cache settings (eg 0.5 days max) until BOINC adjusts the estimates to suit the type you have chosen.  Once that happens you could increase to maybe a day or two to protect against any relatively short term work outages.  Another advantage of FGRPB1G is the 14 day deadline which means you could increase the work cache settings a bit more once things were stable.

If you need further advice, please ask and someone is bound to reply :-).

Cheers,
Gary.

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 469
Credit: 10398914033
RAC: 3457564

+1 VERY good and detailed

+1

VERY good and detailed explanation!    (as always)

Thanks Gary

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.