Problem with info posted by Event Log

Allen

Joined: 23 Jan 06

Posts: 75

Credit: 689583557

RAC: 1040493

5 Sep 2023 16:31:48 UTC

Topic 230048

(moderation:

)

Has anyone seen this situation where the two identical graphic cards are showing different "GFLOPS peak" readings?

9/5/2023 12:18:16 PM | | OpenCL: AMD/ATI GPU 0: Radeon RX 560 Series (driver version 2906.10, device version OpenCL 2.0 AMD-APP (2906.10), 4096MB, 4096MB available, 2449 GFLOPS peak)
9/5/2023 12:18:16 PM | | OpenCL: AMD/ATI GPU 1: Radeon RX 560 Series (driver version 2906.10, device version OpenCL 2.0 AMD-APP (2906.10), 4096MB, 4096MB available, 438 GFLOPS peak)

Thanks for any response to this.

I have been getting frequent blue screens lately and am not sure of the reason for them.

BLUESCREEN says that the dxgkrnl.sys is the problem. I have loaded a few different versions of the drivers and none seems to make any difference.

Keith Myers

Joined: 11 Feb 11

Posts: 5020

Credit: 18920770647

RAC: 6509357

Likely points to a card that

5 Sep 2023 17:18:06 UTC

Message 216562

(moderation:

)

Likely points to a card that is failing or is throttling severely and shows the low reading.

Check your power connections and fan rpm and heatsink for obstruction.

Allen

Joined: 23 Jan 06

Posts: 75

Credit: 689583557

RAC: 1040493

Thanks Keith, That's what

5 Sep 2023 18:27:51 UTC

Message 216566

(moderation:

)

Thanks Keith,

That's what I've been thinking too.

I powered up MSI Afterburner to take a look at some things and it doesn't really show that the two cards are any different, except that one is running about 10c degs warmer than the other, but that likely is because it is the one on top. Fan speeds are about 1250rpm and temps are 73 and 63. I used Afterburner to increase fans to 2400 and they both cooled considerably, but still didn't change the outcome. Presently they are running at 52 and 47.

They both seem to be seated properly. I guess that means one is failing. Wish I could be sure. I've got a RX 570 just waiting in the wing. I'm running 3 WUs on each of them and timing is about every 50 minutes when nothing fails. Do you think changing to 2 WUs would make any difference?

Allen

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4045

Credit: 48034680493

RAC: 35305642

are they both running the

5 Sep 2023 19:52:07 UTC

Message 216567

(moderation:

)

are they both running the same clock speeds?

but otherwise, if they are performing the same, i wouldnt worry too much, it wont impact your credit/RAC at most projects that are using static credit reward, but might impact projects using CreditNew

_________________________________________________________________________

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 118328672457

RAC: 25384245

Allen wrote:Has anyone seen

5 Sep 2023 23:15:21 UTC

Message 216576

(moderation:

)

Allen wrote:

Has anyone seen this situation where the two identical graphic cards are showing different "GFLOPS peak" readings?

I've never noticed anything like this but that's probably because it's not a metric that I pay attention to. If I did happen to notice it, the first thing I would check would be the crunch times for tasks done on each particular GPU to see if there was a difference there.

I've just looked through many pages of results for that host and the validated tasks are showing pretty constant times of close to 51 mins. There doesn't seem to be much variation so both GPUs seem to be performing pretty much the same. This would seem to rule out faulty hardware or overheating causing throttling for one of the pair. From these observations, my guess is that there is a bug in software that is causing the misreporting of the 2nd GFLOPS value. It doesn't appear to be something to worry about.

I have 2 machines each with a single RX 560 (running x2) with crunch times around 31-32 mins (ie. ~16 mins per task) for the same type of tasks. Your times of ~51 min at x3 (ie. ~17 mins per task) are only a little slower. This is to be expected because in the past when I tested x1, x2, x3 situations, x2 was always the best and usually there were slower times (and greater chances of crashes) when trying to run at x3. You should put that host back to x2 and see if the blue screens problem goes away.

You do have configuration problems much more important than an incorrectly reported GFLOPS value. As I was looking through your tasks list, I noticed a bunch that were shown as "canceled - not started by deadline" and several more pages where you had actually aborted even more just before deadline (eg. sent on 22nd Aug, aborted 5th Sept). This is obviously a conscious decision on your part. You need to lower your work cache settings so this doesn't keep repeating. You still have >1000 tasks in progress and at 17 mins per task that equates to a full 12 days of 24/7 work. What are your work cache settings?

I'm pretty sure I've mentioned this same problem to you previously in other threads of yours and asked the same question - which I don't recall you ever answered. Will you answer it this time? :-).

If you change back to x2 running and get a crunch time around 32 mins (ie. 16 mins per task), you could lower the theoretical total time required from approx 12 days to only 11.3 days :-).

Seriously, you need to stop requesting so much excess work.

Cheers,
Gary.

Allen

Joined: 23 Jan 06

Posts: 75

Credit: 689583557

RAC: 1040493

Thanks all! Gary, the

6 Sep 2023 2:54:59 UTC

Message 216582

(moderation:

)

Thanks all!

Gary, the reason the tasks are timing out is because the WUs start to add to the time to complete for some odd reason and then when I catch them and restart Boinc, it shows the correct time for completion. I posted this somewhere before and no one had a solution at the time. When the WUs start going bonkers and say that a starting completion time is 50 minutes and a day later it says 1 day, 10 hours and 30 minutes, something is definitely wrong. I checked some Boinc files, client state, and could find nothing that suggested what the time should be. I think maybe Keith mentioned it to me or maybe Link, I forget just now.

So, I will change to 2 WUs each and see what happens. Hopefully all the problems disappear and I'm done with it. What's strange is it ran for about a month without any trouble, which makes me suspicious of something else being the problem. We'll see.

I will temporarilly cut back on my cache too.

Thanks, Allen

Allen

Joined: 23 Jan 06

Posts: 75

Credit: 689583557

RAC: 1040493

Ian and Steve

6 Sep 2023 2:56:41 UTC

Message 216583

(moderation:

)

Ian and Steve C,said....

are they both running the same clock speeds?

Yes!

Allen

Joined: 23 Jan 06

Posts: 75

Credit: 689583557

RAC: 1040493

Oh yeah, Gary, I cut my WU's

6 Sep 2023 3:44:37 UTC

Message 216585

(moderation:

)

Oh yeah, Gary, I cut my WU's requests to half of what I had before.

I was running a high cache because I hate to run out when there is an unknown Einstein shutdown.

When the system was running well, there was never any trouble completing work before it was due.

We'll see how this works out.

Thanks again,

Allen

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 118328672457

RAC: 25384245

Allen wrote:... the WUs start

6 Sep 2023 10:42:06 UTC

Message 216594 in response to message 216582

(moderation:

)

Allen wrote:

... the WUs start to add to the time to complete for some odd reason and then when I catch them and restart Boinc, it shows the correct time for completion.

That sounds very much like a 'stuck' task - the clock is ticking but the task is not making any actual progress.

Of course, if a task has stopped progressing, the boinc client is going to show the effect of this as a steadily increasing time to completion. When you notice the problem and decide to do a restart, the client will read the details from the last saved checkpoint and so all progress and time stats will be reset to the values they had when the checkpoint was initially saved. This is expected behaviour - exactly what boinc was designed to do in these circumstances.

The best thing to look at is not the remaining time to completion because this is ALWAYS fictitious if the task has been stuck for a while. You should be checking the % progress that should be increasing every second. Maybe tasks are getting stuck because you were trying to run 3 at once. It's not normal behaviour. I only see it under special circumstances (see later)

It's very easy to get a feel for the progress per second you should be seeing. When you were running x3, the 51 min total time equates to around 0.033% per second. Just watch the % progress (when not in the followup stage after 90%) and if you're not seeing ~0.1% increase every 3 secs (that's ~2% every minute), then something is wrong. It really is quite obvious when a task gets stuck.

If you are now at x2, maybe stuck tasks will stop happening. Maybe there were conflicts between the 3 executing tasks that caused one or more to become stuck. At x2, you should be seeing ~0.1% progress every 2 secs (ie. ~3% every minute). You just need to check occasionally to see if the problem recurs. If you don't ever look, the machine could be stuck for days.

In my experience, stuck tasks are quite rare. I have a large number of GPUs and a range of different makes and models. There are just two particular GPU models that give this problem more regularly, the others, hardly ever.

The major one is Asus Expedition RX 570. I've tried swapping this model to different hosts (both AMD and Intel) and the problem goes with the GPU. The minimum time for the problem to show up is a few hours after a reboot. The maximum time is around 10-15 days. The mean is somewhere around 3-5 days. All my hosts get automatically monitored by a script every hour so a message highlighting the issue appears on my daily driver screen. When this happens, I just reboot the problem machine.

I have 3 units of this particular make and model GPU. They all tend to do the same thing with the above-mentioned example being the worst behaving one. It was a couple of years after purchase before they started showing the problem

The other example is an Asus RX 460 Dual. I have around 20 of these bought back in 2017 and recently a couple have started to get stuck tasks, but less frequently than the above example. With these, it's probably age related since they've all been working 24/7 for well over 5 years.

Allen wrote:

I posted this somewhere before and no one had a solution at the time.

When you first started asking for advice, I gave detailed answers and asked specific questions. As I recall, you never gave specific answers to specific questions and didn't seem to be interested in the advice so I figured I was just wasting my time.

Allen wrote:

When the WUs start going bonkers and say that a starting completion time is 50 minutes and a day later it says 1 day, 10 hours and 30 minutes, something is definitely wrong.

That is not "WUs start going bonkers". Workunits don't say or do anything. That is the boinc client telling you a very specific message that you are not understanding. The client is telling you very clearly, exactly this:-

"When this task started, past behaviour indicated that it should take (perhaps very approximately) 50 mins. However, the current rate of progress has dropped so dramatically for such a long period that it's currently more likely to take <insert huge time amount here> and I will keep reprimanding you with even longer time estimates for as long as you continue to ignore this obvious problem situation."

Please don't take the above the wrong way. I'm deliberately trying to make it ridiculously memorable so that any other readers who see a similar situation will more likely remember what it all means. Just enjoy a laugh at the idea of the client reprimanding you :-).

Allen wrote:

I checked some Boinc files, client state, and could find nothing that suggested what the time should be.

Before you actually run the tasks, there is no way to know exactly "what the time should be". It will be what it turns out to be when its completed successfully. The client may have a quite wrong idea of what the estimate should be. The Devs don't really know in advance and different hardware generations/types tend to end up getting different values that aren't necessarily predictable in advance.

Once you successfully run a few, you yourself will know because E@H tends to keep tasks of a specific type fairly constant in their duration. The client will always get it (at least somewhat) wrong when you run multiple task types since there is only a single duration correction factor that will be changing boinc's estimate in incompatible ways depending on how bad the 'baked in' estimates turn out to be. If you must have 'always reliable' estimates on a given host, just run a single task type on that host. Over time, the client will converge to a good estimate with the help of the single duration correction factor.

Allen wrote:

So, I will change to 2 WUs each and see what happens. Hopefully all the problems disappear and I'm done with it. What's strange is it ran for about a month without any trouble, which makes me suspicious of something else being the problem. We'll see.

There is nothing strange at all about this since you have indicated that you left a stuck task situation continue for a long time. Also, you are never really "done with it" since crunching puts a lot of stress on the hardware and problems will crop up at unpredictable times. It would be good practice to take a quick look at a machine at least once or twice a day if that machine has any indication (based on past performance) of being susceptible to having issues. Once things settle down you may check less regularly.

Allen wrote:

I will temporarilly cut back on my cache too.

Once again, you've avoided the question about what cache size you were using. Also, why temporarily? If you have an overly large cache size, you're just inviting future work fetch problems when boinc goes into panic mode after its estimate gets distorted on the high side (or overfetches if the estimate is too low). Einstein tends to be quite reliable and unplanned outages of more than a few hours to a day or so are quite rare. Your machine can complete about 44 tasks per day per GPU. Around 200 tasks waiting to run (say around 2 days cache size) should be more than adequate, since you're only running the FGRPB1G search at the moment. I'm currently using 1.5 days, only because I bumped it up when the electrical work was announced and haven't fully reduced it yet. It's usually 1 day.

Cheers,
Gary.

Allen

Joined: 23 Jan 06

Posts: 75

Credit: 689583557

RAC: 1040493

Hi Gary. Thanks for the

6 Sep 2023 17:48:02 UTC

Message 216612

(moderation:

)

Hi Gary.

Thanks for the very detailed explanations.

I've not been avoiding telling you my cache size. Boinc/E@H had not been deleting WUs, as not being completed in time or that they wouldn't be completed in time, so I saw no problem until there were problems. For a very long time I had been running Seti and Milkyway and had no problems with completing tasks on time, so I didn't think there was a problem. I have always run with 10 and 10, the max. Now I am just running 10. Mind you, I didn't always run 10 and 10, until I got caught shorthanded on an unexpected lengthly shutdown sometime in the beginning with Seti.

I have been running 2 and 2 tasks overnight and it takes 1 minute to achieve 1.636% progress. This seems like half of what you stated above. I checked it on 2 running tasks, one from each GPU.

I've had no bluescreens since I switched to running 2 tasks per GPU. Numbers on the lapsed/remaining columns look normal for running 3 tasks, but quite long for running on 2 per. Don't know why. Boinc timing says, 55 mins 20 secs per task. Temperatures are good.

Before things began happening, I was getting upwards of 400K per day on this machine. Hope that will continue at some point.

If it seems I have avoided answering any of your questions, please remind me and I will let you know what you are asking, if I know how to get the information you seek.

I have just been doing these projects as a contribution and have not studied how everything works. I've never really had to dig into any of this before and just had very good luck (it seems) with not having any problems. Everything just plain worked without any real work on my part, so......

I will continue keeping a close eye on this machine and let you know what is happening and if something happens that I don't understand, perhaps you will be able to teach me to track it down.

Thanks again,

Allen

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 118328672457

RAC: 25384245

Allen wrote:Thanks for the

7 Sep 2023 11:32:07 UTC

Message 216625 in response to message 216612

(moderation:

)

Allen wrote:

Thanks for the very detailed explanations.

You're welcome! I'm just trying to help you understand how things work and why some of your settings are not optimal.

Allen wrote:

Boinc/E@H had not been deleting WUs, as not being completed in time or that they wouldn't be completed in time, so I saw no problem until there were problems.

From the time you started posting here there have been lots of examples of problems. Maybe you just weren't looking closely enough to actually see them.

Allen wrote:

For a very long time I had been running Seti and Milkyway and had no problems with completing tasks on time, so I didn't think there was a problem.

Seti has been gone for a long time and I believe they had much longer deadlines so cache size wasn't an issue. I have no relevant experience with Milkyway but from what others have said, the tasks ran quickly and without deadline issues. The important point is that all projects are different and you can't just assume that what worked somewhere else will also work here.

Allen wrote:

I have always run with 10 and 10, the max. Now I am just running 10.

That is just plain crazy. Consider this. FGRPB1G has a deadline of 14 days. If you ask for 10+10 and the project is stupid enough to supply it, how could you possibly complete and return it all within 14 days? I can understand that if you've experienced lots of outages with work supply at other projects you might believe that E@H is going to be the same. Surely you've been here long enough now to know that E@H is much more reliable than the average project.

Your new 10 day setting is just as bad. Depending on your version of BOINC ( I don't know if more recent versions better handle the problem) there is a tendency for BOINC to go into 'panic mode' if it thinks there is a risk of a deadline miss. A 10 day cache is likely to create that situation. Here is an easy way it could happen. From time to time (for all sorts of reasons) a single task gets delayed to the point where it takes 50 mins instead of 30 mins to complete. As soon as it does complete, every other task in your cache will be assumed by BOINC to also be going to take 50 mins. The result is that the total estimated times will now add up to something like 17 days and BOINC will immediately start doing crazy stuff.

Until you know for sure that your hardware is reliable and you're not going to keep seeing stuck tasks, do yourself a huge favour and set your cache to 2 + 0 days. In fact, even smaller would be better.

Allen wrote:

I have been running 2 and 2 tasks overnight and it takes 1 minute to achieve 1.636% progress. This seems like half of what you stated above. I checked it on 2 running tasks, one from each GPU.

I've had no bluescreens since I switched to running 2 tasks per GPU. Numbers on the lapsed/remaining columns look normal for running 3 tasks, but quite long for running on 2 per. Don't know why. Boinc timing says, 55 mins 20 secs per task. Temperatures are good.

When you finally advised that you still had a 10 day cache setting, my immediate thought was, "surely BOINC is already in panic mode??" The longer and variable times you are now reporting would seem to suggest this. I've had a look just now at your tasks list and can see the longer (and variable) times as well as the fact that there are a whole bunch more that have been canceled as a result of missing the deadline.

Are you running BOINC Manager in "Simple View" mode rather than "Advanced View"??? If you are, that would explain a lot. Please click on View->Advanced (shift+ctrl+A) to get to Advanced view. Click the "Tasks" tab and you will see the first page of tasks. There should be 4 tasks running but there may well be others (perhaps many others) that have started and are showing partial completion percentages. This is a characteristic of BOINC in panic mode.

To be sure you are seeing all 'partial' tasks, click the progress column title (all running tasks may disappear) and click it again and all partially completed tasks (running or not) should be visible in decreasing order of the % done. If you have more than 4 of these showing, this will be why your times are all askew. Welcome to panic mode. If you are in panic mode then carefully read the following instructions.

After seeing if there are a bunch of partial tasks, click (and click again if necessary) the deadline heading so that those tasks closest to their deadline are at the top of the list. Just realise that clicking a column heading orders the data in either ascending or descending order. The second click reverses the order. This will be important for the next stage.

To get out of panic mode, you have to suspend the vast majority of the 'in progress' tasks. That's a huge number since it was close to 1000 last time I looked. Start by clicking the task in the bottom line of the first page. It will become highlighted. Grab the scrollbar at the right and drag it down to the bottom. You will then be on the very bottom of the complete list of tasks. While holding the shift key on your keyboard pressed, click the very bottom task and all the intervening tasks should become highlighted. This will be a seriously large number of tasks - somewhere close to the 1000 mark. Over on the 'commands' section to the left of all the highlighted tasks, click the 'suspend' button. Suspending all these tasks should immediately get BOINC out of panic mode.

Once you have all these selected tasks suspended, drag the scrollbar back to the top and take a look at all the tasks on the first page. If you have partially completed tasks, apart from the ones currently running, you will want to write down a list of the ones closest to deadline so that they get to run immediately after the 4 that are currently running. The auto-canceling mechanism will not touch any tasks that have already started so if you have some of these partials, you should try to complete them in a 'most urgent first' type of order. You can use suspend/resume controls to create that order. If you don't already have a vast number of partials it's relatively easy to do that.

For any that haven't started, look carefully at the deadline displayed (in local time) for each one and select any that are less than a certain amount away from the deadline. You can choose the appropriate amount by doing a quick assessment of how long it will take to do the partials identified in the previous paragraph. For example, if it's going to take 5 hours to do all the partials, any that are going to expire in less than that should be aborted immediately, otherwise they'll just get auto-canceled. CTRL+click will allow you to add them individually into a single selection which can then be aborted with a single click.

Once you have aborted any that can't start before the deadline, you'll need to rinse and repeat for 2nd page tasks that were previously suspended and are now visible on the first page. You need to keep aborting until there is sufficient time for those left to get to the top of the queue and start before getting auto-canceled.

You should realise that I'm just guessing what the real situation might be and it easily could be something entirely different. At the end of the day, you'll need to assess the situation and make the appropriate decisions. I'll just continue the story assuming the panic mode theory is correct.

Once you have aborted any that can't make the deadline, you will (over time) need to slowly resume enough suspended tasks to allow the crunching to continue. If you resume too many at once, BOINC will likely go back into panic mode. The crunch times should become more regular and be closer to the 30-35 min range. So resumimg 40 tasks is about 5 hours worth so if that goes well and crunch times are as assumed, just come back in 5 hours and resume another 40 tasks.

Allen wrote:

Before things began happening, I was getting upwards of 400K per day on this machine. Hope that will continue at some point.

You should understand that a RAC of 400K is quite a low value for dual RX 560 GPUs. I've just had a look through the hosts I have with dual GPUs. There was one that I thought was dual RX 460 units but on interrogating the machine over the LAN, I find it's actually dual 560s and should be comparable to yours. I have a script I can run to probe a machine and get a report on the basic performance characteristics as a printout on my daily driver screen. Here is the output from that machine. For privacy reasons, I've masked the host ID. The rest is actual current data as at the time shown.

[gary@eros ~]$ ssh $H99
[gary@g4560-03 ~]$ psa

Hostname: g4560-03   Host ID: 000000 GPU: Baffin   Checked: Thu Sep 7 07:07:17 PM AEST 2023
Disk: dos/Legacy   Size: 111.8 GiB    Root (Tot/Free): 15G/8.8G   Home (Tot/Free): 93G/88G
Up: 5d 5h 22m 36s BOINC: 7.16.11 OCL: 20.40    Kernel: 5.4.115-pclos1 On_Frac: 0.999782

          PID Run_Time Process_Name         Compute_Progress
         4274 00:13:09 boinc                          --
        16055 00:01:16 hsgamma_FGRPB1G   Frac_done: 0.554906
        16061 00:01:03 hsgamma_FGRPB1G   Frac_done: 0.445446
        16075 00:00:14 hsgamma_FGRPB1G   Frac_done: 0.095094
        16082 00:00:07 hsgamma_FGRPB1G   Frac_done: 0.034038

Current Estimate for Crunch Time: 28m 37s [1717 secs]

Current credits:- User Tot: 104.83B User RAC: 62.4M Host Tot: 1.21B Host RAC: 658.4K

[gary@g4560-03 ~]$

As you can see, I just login over ssh to any host on my LAN (in this case a machine named g4560-03) from my daily driver. The H99 is just an alias pointing to the host with a static IP address whose final octet is .99. psa is just a shell script to print out the information shown. I've color-coded it here basically the same as on my screen (just missing a few highlights). The stuff in red show the actual search strings used to find the boinc/EAH processes of interest. The script just pulls the information from the state file using the boinccmd utility that comes with BOINC.

You will note that the average crunch time for both GPUs running x2 is well under 30 mins. This is the sort of time you should be able to achieve if you follow the advice. I don't think Linux will be that much faster than Windows to make something around the 30-32 min mark impossible for you. Notice the current RAC - almost 660K. Admittedly it's had a bit of a boost from the transition to the 3000 series tasks and you have so much of a backlog that you're still crunching the last of the 4000 series. Your comment about a 400K RAC is way too low for what your machine should be able to achieve. You should be aiming for at least 550-600K. Just remember, when you get it working correctly, it will take quite a while to climb to that level, by which time the whole FGRPB1G series will be history :-).

Allen wrote:

I have just been doing these projects as a contribution and have not studied how everything works.

Your contribution to the project is certainly appreciated so please don't take any of the above as any sort of criticism. It takes time and effort to become familiar with BOINC's characteristics and that is exactly why you should pay serious attention to experienced users who have seen all these potential issues previously. As you can see, I've put a serious effort into trying to help explain things simply because I remember what it was like before earlier volunteers helped me.

Allen wrote:

I've never really had to dig into any of this before and just had very good luck (it seems) with not having any problems.

I hate to disillusion you but I'm firmly of the opinion that you've had continuous problems most of the time and you just haven't realised it.

Just set your work cache to 2 days or less, deal with tasks that can't get started before the deadline, and spend time playing with what you can see in "Advanced view" if you haven't been using it previously. The thing that most makes me suspect that this is the case is that you continue to assert that you haven't really had problems previously. You certainly must have and you're just not looking in the right places in order to see them.

I haven't properly checked all of the above so I'll fix any obvious mistakes I find when I review it all tomorrow. I've gotta go right now :-).

Cheers,
Gary.

Problem with info posted by Event Log

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports