A perplexing GPU problem ....

Richard de Lhorbe

Joined: 15 Dec 05

Posts: 46

Credit: 9601863176

RAC: 1257437

21 Sep 2018 4:08:18 UTC

Topic 216410

(moderation:

)

I have an Intel PC running Ubuntu 16.04. I have two EVGA 970 GPU's installed, with a cc_config.xmf file telling it to use all GPU's. Each GPU was set to run 2 workunits through the appropriate preferences file. This combination has been running successfully for over two years. This computer is only used to crunch Einstein@home.

A couple of days ago, I noticed the daily totals had suddenly started dropping off, so I went to look. Suddenly, now only one GPU was running one single workunit, and the other GPU was not running anything at all. There had been no intervention on my part between the time everything was working properly, and when it started acting strange. I checked the cc_config file and it was still where it was supposed to be according to the BOINC data directory listed in the BOINC event log, and still properly called up "use all GPU's". I opened Gedit and opened the file, made one small change, then undid the change, and resaved the file. I then re-booted the computer, no change. I checked the output of the Event Log and it still correctly said "use all GPU's". I told BOINC manager to re-read the config files, no change, then rebooted again, no change. The NVIDIA Xserver Settings app correctly recognized both GPU cards, saying one was at 100 % utilization, the other was at zero.

I took the computer off-line and cleaned it of dust, just in case there was any possible overheating issues (there was not a lot of dust, but I did it anyway), and removed and reseated the GPU's. I also moved the auxiliary power plugs for the GPU's into two different outlets on the power supply, just in case there was any issue there. Turned the computer back on, no change. Turned the computer off again, removed the GPU that was not showing any activity and replaced it with a spare EVGA 1050 Ti card that I had. Turned the computer back on, no change (the NVIDIA Xserver Settings app recognized the change in cards) ..... the GPU that had been running one workunit was still doing that, the "new" GPU was still not running anything.

I then shut down BOINC manager and all running workunits, and uninstalled it. Rebooted the computer, and reinstalled BOINC Manager from the Ubuntu software centre (so it will have been the same version of BOINC manager both times). Opened BOINC up, no change.

If it makes any difference, the GPU that is working (albeit running only the one workunit) is card 0, the GPU that is not showing any work at all is card 1. Not sure if that implies anything or not ..... I have not tried swapping the physical positions of the two GPU's yet.

So ... the way BOINC is currently running it seems that it is only running the one GPU (just like the default setting a virgin BOINC installation assumes, one with no cc_config file), and it is also not acting on the preferences file that says to run two work units at once. However, the Event Log says it is in fact reading the configuration and preferences files the right way, it is just not acting on the settings properly. Another computer that uses the same preferences file is running two workunits at once, properly, so I can't blame the preferences file.

I can't think of anything else to try, except possibly to upgrade from Ubuntu 16.04 to 18.04 (which would install a newer version of BOINC manager too) and see if that makes any difference. However, I was not really ready to upgrade this computer quite yet, as in general I am happy with 16.04. I would rather try to fix this installation if I can.

Anyone have any suggestions of what might be causing this ?

Thanks, Richard

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

post the first 30 lines of

21 Sep 2018 4:32:02 UTC

Message 166909

(moderation:

)

post the first 30 lines of the event log after you restart BOINC manager so we can see what it says. I know you say it sees the cc_config file but you have not said anything about what BOINC says about your cards.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119433715531

RAC: 25933746

Richard de Lhorbe wrote:....

21 Sep 2018 6:22:06 UTC

Message 166910

(moderation:

)

Richard de Lhorbe wrote:

.... Turned the computer off again, removed the GPU that was not showing any activity and replaced it with a spare EVGA 1050 Ti card that I had. Turned the computer back on, no change (the NVIDIA Xserver Settings app recognized the change in cards) ..... the GPU that had been running one workunit was still doing that, the "new" GPU was still not running anything.

I just looked through your computers list (7 listed) trying to see any mention of a GTX 970. There is no direct mention of that model but there is a machine listed as having 2 NVIDIA GPUs, at least one of which is a 1050Ti, since that's the model listed. Is that the machine you are referring to?

Also, please do what Zalster suggested so we can see exactly what BOINC thinks about the setup through the startup messages.

Quote:

... I have not tried swapping the physical positions of the two GPU's yet.

That is certainly worth trying. If it was just some sort of fault with the card, the 1050 Ti should have started working. If there was some problem with the PCIe slot itself, it could affect whatever card you inserted. If you went back to the original config of two GTX 970s but swapped their positions, that should help you decide if it's the card, the slot, or neither as the problem.

As a separate line of thought, the fact that the working card will only crunch a single task seems to point quite strongly at not just a faulty card or faulty power or a faulty slot but at some software/config change no matter how unlikely that seems. Can you get a single card to crunch 2 tasks? Does this change to just one task after you plug in the second card? Does plugging in a spare monitor (or dummy plug) to the second card make any difference?

Cheers,
Gary.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

What other tasks are Boinc

21 Sep 2018 14:09:37 UTC

Message 166916

(moderation:

)

What other tasks are Boinc running?
CPU tasks? If so what are the estimated run time for them?
If Boinc thinks it's in deadline trouble then the priorities will change and the settings for how one want things to run might get overruled.

Richard de Lhorbe

Joined: 15 Dec 05

Posts: 46

Credit: 9601863176

RAC: 1257437

To Gary: Yes, that is the

21 Sep 2018 19:40:50 UTC

Message 166917

(moderation:

)

To Gary: Yes, that is the only active computer using two GPU's. I have tried some experimenting with swapping cards as follows. The two GPU's are slightly different models of 970, with one slightly longer than the other, and the long one only fits in the first slot nearest the CPU ... I will refer to them as 970 (L) for long and 970 (S) for the shorter one:

(a) original configuration, slot 1 nearest the CPU 970 (L), slot three 970 (S)

(b) next tried as described above slot 1 970 (L) and slot 3 1050 Ti. Slot 1 card active with one wu, slot 3 card inactive

(d) same as (c), with a dummy plug in the 970 (S). No change.

(e) slot 1 empty, slot 3 with 970 (S). Slot 3 active with one wu (note: this was the 970 card in the original configuration that had stopped working).

There are probably more config's I could try, but I think shows it is not hardware related, and both slots are working.

Zalster: I can't seem to find a way to copy the text out the Event Log ... I am sure there must be a simple way to do it, I just don't know what it is. I can take a screen grab of the window though ... the one I have taken this morning following a fresh reboot is with one 970 card and the 1050 Ti card. Hope this works ....

47af8cf8-1188-4ea6-959a-ea3d1f1da3d9

Holmis: in the original configuration, I had the 8-core CPU running 4 GPU tasks and two Gamma-ray Pulsar #5 (i.e. 75 % utilization, but with 4 of those GPU tasks that do not use the CPU very much, as you know). With the sudden change, it now (naturally) runs 1 GPU wu and 6 CPU tasks (with one of those CPU tasks running probably inefficiently. As of yesterday evening, BOINC was still happily downloading tasks for both types of wu's. However, I just tried the following .... turned off "allow new workunits", and then paused ALL of the CPU tasks. And what do you know, the one 970 card that I have in at the moment is now crunching two wu's !! So you guessed right :-). ......

So, it looks like something DID happen to the scheduler, although what that might be all of a sudden after more than two years of no issues, I am not sure. And why it would happily download many more wu's of each type if it was thinking it was in deadline trouble, I don't know either (in the three days this has been going on, it has probably downloaded a few hundred new wu's, looking at the wu dates on the screen).

I am going to go back and place both 970's back in, and cancel a bunch of the CPU wu's, and slightly reduce my amount of days of work units (currently 3 days with 0 additional) and see if it all gets back to normal. I will post again in a few hours to see if all is stable.

Thanks for the suggestions and assistance.

Regards

Richard

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7375821687

RAC: 2129070

Richard de Lhorbe

21 Sep 2018 20:21:00 UTC

Message 166918 in response to message 166917

(moderation:

)

Richard de Lhorbe wrote:

cancel a bunch of the CPU wu's, and slightly reduce my amount of days of work units (currently 3 days with 0 additional)

1. Given your recent troubles, and the massive cancellation, I think it would be better for you and for the project to reduce your number of days a lot instead of slightly. I often advocate trying 0.1 + 0, and inching up slowly from there.

Running dissimilar applications on the same host gets the BOINC capability estimation see-sawing up and down al by itself. When the project actually distributes work that varies in required computation (which it has done more than once recently for Gamma-Ray pulsar GPU tasks), this gets even messier. A short queue simplifies life a great deal.

2. If you were to reduce the number of CPU cores that BOINC thinks it will be using (for scheduling and launching purposes), you could better assure that your GPU tasks get enough CPU service to run smoothly. The parameter I have in mind is available on the computing preferences page of your Einstein account web pages, as:

Use at most: nn% of the processors
Keep some CPUs free for other applications.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

I would agree with Archae86

21 Sep 2018 21:40:20 UTC

Message 166919 in response to message 166917

(moderation:

)

I would agree with Archae86 about the cache size, start small and increase when all is working. And keep an eye on the estimated runtime to see how much it differs from reality and if the number of tasks in the cache has an reasonable chance of finishing before the deadline.

As to copying lines from the event log, you should be able to select the relevant lines by clicking while holding down the shift key, then look in the bottom right of the window and there should be a "copy selected" button.
The other option is to open "stdoutdae.txt" in the Boinc data folder and copying the lines from there.

Richard de Lhorbe

Joined: 15 Dec 05

Posts: 46

Credit: 9601863176

RAC: 1257437

ARCHAE86 -- even though this

23 Sep 2018 18:38:07 UTC

Message 166941

(moderation:

)

ARCHAE86 -- even though this particular configuration has been stable for well over two years, I agree that the recent "blip" in run times with one batch of released workunits (an extended duration at "long" run times, then about two weeks at much shorter run times, then back to long) was likely the catalyst for confusing the scheduler. When the project first went to long run times somewhere back near the beginning of this year I think, it was one change from short to long, which seemed to be handled with no worries. This times it was two changes within the normal wu due date deadline, long to short then short to long, and that probably was the trigger. Anyway, I have cut the cache size in half to 1.5 days but am currently running off the backlog in "no new tasks" mode and monitoring how things are going before I turn it back on for new tasks. Internet speeds here can be very bad on weekends (too much Netflix and such use, I think, by everyone) and if I put the cache too low the machine can starve.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119433715531

RAC: 25933746

Hi Richard, I'm sorry I

23 Sep 2018 23:25:26 UTC

Message 166946 in response to message 166941

(moderation:

)

Hi Richard,
I'm sorry I wasn't able to join in sooner - I had commitments over the weekend that took a lot of my time. I was following the discussion and was very happy that Holmis was on the ball and had spotted that BOINC was in panic mode straight away. He did that without the key bit of evidence, ie. on top of the GPU tasks problem, there were six CPU tasks that were running instead of the intended two, that you mentioned later. Congratulations to him!

I'm also happy that you were kind enough to leave the system on NNT and work off the backlog, rather than doing a mass abort. Thanks for doing that.

In looking through all of your descriptions and comments, it's not clear to me that you have identified all the contributing factors. Sure, the GPU task crunch time variations (longer -> shorter for a small period -> longer again) are an important contributing factor, but, I think, not the most important. Sorry to say, but I believe that cache size was the most important, but not for any obvious reason that you might have foreseen. You would be quite entitled to expect a 3 day cache for a 14 day deadline to be perfectly suitable. And it may well be, once you understand all the factors and make allowances for them.

I can't be sure about that until you confirm the configuration for the 4 GPU tasks + 2 CPU tasks mix you had been using "for the last two years". My understanding from what you have said is that you were using the BOINC 'CPU cores allowed' setting of 75%. This would ensure that BOINC is limited to six threads. To get 2 GPU tasks per card (4 in total) you were using a GPU utilization factor of 0.5 in the project preferences. With the default of one CPU thread per GPU task being 'reserved' for GPU support duties, this would mean that BOINC (when not in panic mode) can only use 2 of the 6 allowed threads. So you end up with the 4+2 mix that you describe.

If that is an accurate description of how you had been running prior to the problem, then you are probably quite fortunate that the problem didn't occur much earlier than now. I'll wait to explain it until I know for sure how you achieved the 4+2 mix. I did wonder if you might have been using an app_config.xml file at one point because you did mention a "preferences file" rather than something like the "project preferences page" on the website. I don't want to waste time explaining a potentially non-existent issue if I'm wrong about your setup :-).

For now, I'll comment on the following statement - there was also a similar reference earlier - just to clear up a misunderstanding of the role of the scheduler in all this.

Richard de Lhorbe wrote:

... this particular configuration has been stable for well over two years, I agree that the recent "blip" in run times with one batch of released workunits (an extended duration at "long" run times, then about two weeks at much shorter run times, then back to long) was likely the catalyst for confusing the scheduler.

I agree that the crunch time variations were a catalyst, but NOT for confusing the scheduler. It was your BOINC client that was confused. The scheduler just services the requests. Basically, it assumes the client knows what it is doing and just hands over the tasks when asked.

My guess is that you were "stable for well over two years", just like you would be stable standing on the edge of a cliff - as long as you don't wobble a bit :-). Then your crunch time 'wobble' came along and ... :-).

Sorry - just a joke - completely in poor taste, I know, but I couldn't resist :-).

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119433715531

RAC: 25933746

Richard,At the moment

24 Sep 2018 7:21:00 UTC

Message 166949

(moderation:

)

Richard,
At the moment (6.30AM UTC - definitely sleep time in Canada) you only have 70 GPU tasks and 44 CPU tasks 'in progress'. A 'back of the envelope' calculation, assuming you have returned to your previous 4+2 configuration, says that the GPU tasks will last for less than a further 12 hours whilst the CPU tasks will all be completed in 3 days from now. Nothing will be under deadline pressure if you intend to use a 1.5 day cache size. Of course, things will be different if you are not running 4+2.

Since you have been steadily running down the excess work, and since there are no 'slow CPU tasks' at the moment (they've all been completed), I believe the current estimates for GPU tasks should be pretty close to reality, perhaps somewhat less than the ~40 mins that each one takes. Also, you are doing a faster breed of CPU tasks (not many completed yet) but this should also be helping to keep estimates much closer to reality. I'm basing this on my own experience.

Your CPU tasks are taking not much more than 3 hrs to complete - 7.3 tasks/core/day is what your figures are suggesting. If the estimates are anything like the real crunch times, you should have more than enough CPU tasks to fill double the 1.5 day cache you mentioned. In other words, you could expect that if you `allow new tasks' right now, you would get a flood of GPU tasks but no CPU tasks. This all very much depends on you having returned to the previous 4+2 setting and that setting having been arrived at in the way I described in my previous message..

I would like to suggest a little experiment to illustrate a point. Before you allow new tasks, could you set the cache to 1.0 days temporarily. With that setting, you'll still get lots of GPU tasks. I want you to do this to help prevent also getting lots of unwanted CPU tasks as well. My guess is that even though you have 3 days supply of CPU tasks, you may still get more when you first allow new tasks at that 1.0 day setting. This is not a problem and can easily be dealt with if it happens.

If you get no CPU tasks at first, try inching up to your proposed 1.5 day setting. I think you will start to get some along the way. In any case you are going to need GPU tasks quite soon so I hope you're up early this morning :-). Depending on the results of this little experiment, I'll explain what is going on. If you do get quite a few CPU tasks above a 1 day setting, don't be concerned - it's quite fixable :-).

Also, congratulations on a nice quite recent double milestone - 2M RAC + 2B Total :-).

EDIT: It's now almost 8.10AM UTC and the `in progress' GPU tasks have dropped to 58. I had measured the previous figure of 70 a little before I started writing at 6.30AM. So, 12 tasks returned in a time of perhaps 1hr 45 min or so, fits in pretty well with 4 concurrent GPU tasks.

I wonder what part of Canada Richard hails from? :-). Hopefully it's Quebec :-).

Cheers,
Gary.

Richard de Lhorbe

Joined: 15 Dec 05

Posts: 46

Credit: 9601863176

RAC: 1257437

Hi Gary Yes, I was hoping

25 Sep 2018 3:30:22 UTC

Message 166967

(moderation:

)

Hi Gary

Yes, I was hoping for some time to be the 24th person in the known universe to reach 2B E@H credits .... and I finally managed that ! Even though you have far more credits than I do, I noticed this afternoon that the basic shape of our total credit curves (from BOINCstats) look quite remarkably the same. And although my last name is originally French, I actually live much much further West than Quebec, almost splashing in the Pacific Ocean, so I will let you guess the rest .... :-)

I do have an app_config file on this machine, which is based on some comments you made on another thread with someone else a number of years ago.

I did not see your messages above until late tonight (and so responding now), so could not run your experiment before I turned new tasks back on .... but I could try tomorrow if you wish. When I did turn new tasks back on this morning, I did get some of both tasks, but did not compare the quantities received, certainly many more GPU than CPU as would be expected.

Cheers, Richard

A perplexing GPU problem ....

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports