3 Nvidia GPUs, no GPU tasks (Ubuntu 20.04)

Brandon Clark
Brandon Clark
Joined: 10 Mar 20
Posts: 9
Credit: 248204104
RAC: 0
Topic 225381

Hello group,

Long time reader, first time poster . . . .

 

Problem

I recently upgraded two systems, and set up one new system. All three have Nvidia GPUs and are running under Ubuntu, but none of them are getting GPU tasks. One system was previously running GPU tasks under an earlier version of Ubuntu (18.xx, I think) as recently as a month or two ago.

 

 

 

System 1: 12849999

GenuineIntel Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz [Family 6 Model 42 Stepping 7] (4 processors)

NVIDIA GeForce GTX 1050 Ti (4038MB) driver: 460.73

Ubuntu 20.04.2 LTS [5.8.0-50-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)]

 

System 2: 12850795

GenuineIntel Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz [Family 6 Model 60 Stepping 3] (8 processors)

NVIDIA Quadro K2100M (1999MB) driver: 390.99 (I have tried reverting back to an older driver, but that didn't fix things.)
(Kubuntu) Ubuntu 20.10 [5.8.0-50-generic|libc 2.32 (Ubuntu GLIBC 2.32-0ubuntu3)]

 

System 3: 12879734

GenuineIntel Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz [Family 6 Model 62 Stepping 4] (8 processors)

NVIDIA TITAN Xp (4095MB) driver: 460.73

Ubuntu 20.10 [5.8.0-50-generic|libc 2.32 (Ubuntu GLIBC 2.32-0ubuntu3)]

 

Research

I found a few threads relating to issues with the current Ubuntu Kernel not being compatible (might not be the right word) with AMD drivers. The solutions found were to either revert back to an earlier version of the kernel, or reinstall the OS without allowing updates during the install process. My systems have Nvidia cards though, so I'm not sure if that could be the same issue.

Also, in the other threads I found it seemed like the issue was that BOINC was not able to recognize that GPUs were present. On my systems the GPUs are recognized. The information on the systems above was copied from this website. Also, the event log for each system has entries with verbiage like "requesting tasks for Nvidia GPU", and other entries relating to the GPUs.

I'm kind of stuck at this point. The issue seems

 

I'm not any kind of power user or linux expert - just an amateur from Seti@home days who finds older hardware and puts it to use for BOINC. There is probably a lot of optimization I could do, or things I'm doing that I shouldn't be. I just have fun contributing without taking it too seriously.

 

Many thanks to everyone. This forum is a great resource.

Brandon

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4045
Credit: 48065408464
RAC: 34426528

Your computers are hidden. So

Your computers are hidden. So we can’t see the system information. That would be helpful. Because even if you think you have drivers installed, it could be apparent that there’s still a problem if BOINC isn’t detecting your cards. Also helpful to post the startup output of the BOINC Event Log, the first 20-30 lines or so.
 

as a guess, do you have the OpenCL driver components installed? The Nvidia driver packages do not include them anymore and you have to deliberately install them. 
 

sudo apt install ocl-icd-libopencl1

_________________________________________________________________________

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118358775588
RAC: 25502441

Brandon Clark wrote:... All

Brandon Clark wrote:
... All three have Nvidia GPUs and are running under Ubuntu, but none of them are getting GPU tasks.

Hi Brandon, as was mentioned, your computers are 'hidden' but since you've listed the host IDs, we can find the same details, just a little less conveniently :-).  If you provide a clickable link to each one, you could save yourself typing out all the individual details.  It also makes it easy for others to look and respond.

I had a look at 12849999 and on the details page there is a further link to the last scheduler contact.  If you follow that link and read through what the scheduler had to say, you will find this particular line of output:-

NVidia device (or driver) doesn't support OpenCL

It gets repeated quite a few times as each different 'plan class' gets checked.

This doesn't mean that the hardware is deficient - just that the libs that handle that capability are not installed.  It pretty much confirms what Ian&Steve C. suggested.

Cheers,
Gary.

Brandon Clark
Brandon Clark
Joined: 10 Mar 20
Posts: 9
Credit: 248204104
RAC: 0

Hello group, Thanks for

Hello group,

Thanks for the help so far. I found the setting to make my computers visible, so they should show up now.

I ran the code from Ian&SteveC above to install the OpenCL drivers. On two machines the install was successful. I restarted them, and when I opened Boinc manager after restart they had already begun downloading GPU tasks. You guys are great!

On the third machine, 12850795, when I tried to install the drivers the terminal indicated that the drivers were already installed. I'm going to restart that system tomorrow, give it a few minutes, and then pull the event log to see what shows up. I'll post a follow-up.

Brandon

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5023
Credit: 18928275173
RAC: 6517451

When in doubt about OpenCL

When in doubt about OpenCL support, always do a sanity check with clinfo. It will show whether OpenCL component of drivers is installed.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118358775588
RAC: 25502441

Brandon Clark wrote:... On

Brandon Clark wrote:
... On the third machine, 12850795, when I tried to install the drivers the terminal indicated that the drivers were already installed. I'm going to restart that system tomorrow, give it a few minutes, and then pull the event log to see what shows up. I'll post a follow-up.

You could have solved the problem with that one yourself by following the link to the last scheduler contact, which is on the host details page. :-).  I looked just now (thanks for the link) and the very first thing that shows up is:-

No disk space available: disk_max_used_gb 2.00GB disk_max_used_pct 10.00 disk_min_free_gb 1.00GB

One of the restrictions is that you are only allowing BOINC to use 10% of the partition where BOINC is installed, with a maximum of 2GB and a further requirement that at least 1GB be kept free.  If the machine is essentially being used to run BOINC, you need to loosen the restrictions, particularly if it's on a smallish partition.  If you're running GW CPU tasks, you'll quickly eat up 2GB even if you increase the 10% restriction.

What type of GPU search are you attempting to run?  The GPU VRAM is only 2GB so, very likely, it has insufficient memory for most of the GW GPU tasks it will receive.  You should limit that one to gamma-ray pulsar (GRP) GPU tasks to avoid memory related problems.

You can tell that the scheduler is happy that OpenCL is available.  Look for the block of messages that starts:-

Checking plan class 'FGRPopencl1K-nvidia'

and ends:-

plan class ok

In other words, if you fix the disk space allowed problem, the scheduler would be happy to send you GRP GPU tasks.

Cheers,
Gary.

Brandon Clark
Brandon Clark
Joined: 10 Mar 20
Posts: 9
Credit: 248204104
RAC: 0

Good morning team, This is

Good morning team,

This is pretty cool - I'm learning all kinds of new tools I can use to troubleshoot my systems. I looked up the clinfo tool and that looks handy. I'm going to add that to my notes to run whenever setting up a new system.

As for 12850795, I went into the settings and changed the disk space limits. It should now have up to around 20 Gb to work with. That explains another issue I always noted with that system: it never had more than eight or ten tasks showing at any one time. 

I also found this link (is this the scheduler log that was mentioned above?) and if I'm reading it correctly it looks like the system started downloading GPU work just after I left this morning. I'll double-check via the GUI this evening when I get home.

With regard to disk space and work units, how much space do typical work units take up? Where are they stored on a system? I have gone looking through the hard drive a few times in the past when I was curious, but never managed to find the files that corresponded to the individual work units. Is there a best practice for setting the "days" of work available on the system?

Back when I had almost all the same systems (recovered office machines) I ran everything using global preferences. These days I have a lot more "oddballs" and a lot more GPUs. Mostly I've paid attention to the computing settings to manage heat and crashes: always keeping 1 CPU core free, and adjusting the % time to keep the fans at low speed. It seems like I need to do a once-over with each system and probably adjust more than just the compute settings.

Thanks everyone,

Brandon

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5023
Credit: 18928275173
RAC: 6517451

When BOINC is installed from

When BOINC is installed from a distro, the main BOINC files are located at /var/lib/boinc-client.

The individual project files are in the respective /project folders inside the main boinc-client folder.

Your tasks are in those folders also.

My best practice for setting the caching levels is 0.2 days of work and 0.0 additional days of work.

That way you don't overcommit too much work to any one project.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118358775588
RAC: 25502441

Brandon Clark wrote:I also

Brandon Clark wrote:
I also found this link (is this the scheduler log that was mentioned above?)

Yes.  Notice it's the actual time and date (UTC) for when your client most recently made contact.  It tells you exactly what the scheduler response was to the request made at that time.  The date and content pointed to by the link will change with each subsequent request which is why I chose to show the critical excerpts for the one I looked at.  Many of the misconfiguration type problems that inexperienced users have can be solved by using this feature of the website.

Brandon Clark wrote:
With regard to disk space and work units, how much space do typical work units take up? Where are they stored on a system?

You need to understand the difference between tasks and data.  The data files are large and can be reused for many different tasks.  They are generated from the various observatories, LIGO Hanford and LIGO Livingston for GW tasks and the Large Area Telescope (LAT) on board the Fermi satellite for GRP tasks.

GW data files all start with either "h1_" or "l1_" depending on the source of the data.  GRP data files all start with "LATeah".  Data files will hang around in the project directory until all possible tasks that depend on that data have been completed.  The project is supposed to take care of that.  You should not interfere with these.  When the new O3 GW run fully starts (it's currently being tested with simulated data) you may need to store many GBs of data - if it's anything like the previous run, which it probably will be.

The 'O3' stands for Observation run #3 of the LIGO detectors.  It will be the most sensitive data yet produced and might just lead to the 'holy grail' - the first detection of continuous GW emissions :-).  Quite an exciting time to be involved with this project!

There are no actual 'workunit files' sent for tasks.  A scheduler reply contains a series of parameters that represent each task.  Your BOINC client immediately inserts these parameters into <workunit> ... </workunit> blocks inside the state file (client_state.xml) so there is no separate file on disk that you could see.  Files you could see would represent data (and transient results) and for GW there will eventually be large numbers of data files which is why disk space is important.  As tasks are completed, transient result files will disappear as soon as they are safely uploaded by the client.

Cheers,
Gary.

Brandon Clark
Brandon Clark
Joined: 10 Mar 20
Posts: 9
Credit: 248204104
RAC: 0

That's interesting about the

That's interesting about the difference between the data files and the tasks. Lots more reading to do.

As for the GPUs, all systems are now running normally. On two systems the problem was that the OpenCL drivers were missing. On the other one it was just a disk space issue. That's surprising to me since that machine had the drivers already without them needing to be installed manually. Not much point worrying about it though.

Thanks to everyone for the help.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.