A perplexing GPU problem ....

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119449422105

RAC: 25953022

Hi Richard, Thanks for

25 Sep 2018 4:49:27 UTC

Message 166968 in response to message 166967

(moderation:

)

Hi Richard,
Thanks for responding. Don't worry about the "experiment". It was a 'once off' opportunity whilst you had a low number of tasks on board. The most important thing (hopefully) is that you are probably back to pretty much normal by now. I looked at your tasks on that host when I first started this morning and saw the time when the new lot started to arrive so I'd guessed that you were just a tad west of Quebec :-).

Richard de Lhorbe wrote:

I do have an app_config file on this machine, which is based on some comments you made on another thread with someone else a number of years ago.

OK, I presume you are using that in place of a GPU utilization factor? Would you mind telling me what particular parameter values you use in that file? I guess you would have gpu_usage and cpu_usage and perhaps even max_concurrent - or even others. I had a look recently at the docs and saw more options than there used to be when I first started using it.

I'm still confident there will be a way for you to make an adjustment so that the risk of getting BOINC in a panic can be reduced, even with a larger cache than 1.5 days. Is 1.5 days what you intend to use or are you planning to go higher?

Cheers,
Gary.

Richard de Lhorbe

Joined: 15 Dec 05

Posts: 46

Credit: 9602680843

RAC: 1263489

Hi Gary The app_config file

27 Sep 2018 3:19:38 UTC

Message 166987

(moderation:

)

Hi Gary

The app_config file is not easy to transfer over to the iPad I am typing this on, but the crux of the info is gpu_usage is set to 0.5 and the cpu_usage is set to 0.4. That was your recommendation from the older thread. I would be interested to hear what your current thoughts are.

With the current mix of gpu and cpu workunit time requirements, and 1.5 day cache, the computer certainly seems quite content.

Regards,

Richard

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119449422105

RAC: 25953022

Richard de Lhorbe wrote:...

29 Sep 2018 1:06:18 UTC

Message 167031 in response to message 166987

(moderation:

)

Richard de Lhorbe wrote:

... the crux of the info is gpu_usage is set to 0.5 and the cpu_usage is set to 0.4. That was your recommendation from the older thread.

I've made a number of 'recommendations' to various people at various times, but all designed for the particular circumstances of that person. It's not a 'one size fits all' solution.

I'm struggling to understand how you got the 4+2 outcome with those numbers. Previously, you had mentioned:-

Quote:

I had the 8-core CPU running 4 GPU tasks and two Gamma-ray Pulsar #5 (i.e. 75 % utilization, but with 4 of those GPU tasks that do not use the CPU very much, as you know). With the sudden change, it now (naturally) runs 1 GPU wu and 6 CPU tasks (with one of those CPU tasks running probably inefficiently.

I said at the time that I assumed that the '75% utilization' meant you were using a 75% setting for the 'allowed BOINC cores' preference setting. I don't think you ever corrected that assumption but knowing that the cpu_usage factor is supposed to be 0.4, something can't be correct. Here is the reasoning.

With 4 GPU tasks, the 0.4 cpu_usage would 'reserve' 4x0.4=1.6 which translates to one extra thread (the number is always a rounded down integer). To have only 2 CPU tasks running (and just one reserved thread from GPU tasks) you must have somehow reserved a further 5 threads. You could have easily done that by setting the %cores setting to 37.5% which means only 3 of the 8 threads are available to BOINC. That seems unlikely since BOINC would be fetching CPU tasks for 3 threads only - hardly very likely to put BOINC into panic mode over excess CPU tasks. Also 37.5% would make it impossible to have 1 GPU task with 6 CPU tasks when the BOINC panic occurred. During the panic, BOINC would only be allowed to run 3 CPU tasks.

So how could 6 CPU tasks be running during a panic? The BOINC % cores must be at least 75%.
So how could 2 CPU tasks be running normally? The cpu_usage must be 1.0 ( not 0.4) to reserve 4 threads.

In the light of that, my guess is that you may not be using an app_config.xml at all and the cpu_usage is just the project default of 1.0. You may have installed the file but are you sure that all syntax is correct? Do you have the correct app name? There has got to be something wrong somewhere.

Richard de Lhorbe wrote:

I would be interested to hear what your current thoughts are.

With the current mix of gpu and cpu workunit time requirements, and 1.5 day cache, the computer certainly seems quite content.

You could still get further problems. With a 1.5 day cache and a 14 day deadline, it doesn't seem likely that the previous situation would occur again, but it could. It's certainly not optimal. It's likely that you will continue to have quite a bit more than 1.5 days of CPU work which could trigger another panic. You could easily calculate this for yourself. Just count the number of 'in progress' CPU tasks. Take half that number (you are crunching two at a time) and multiply that by the actual crunch time (not the estimate). That's how long your CPU tasks will last. How much more than 1.5 days is that?

Having laid out the method, I decided to see for myself. I looked at your in progress tasks - there were 70 at the time.
Current completion time = ~10,000sec - 0039F tasks
Total time to complete = 35x10,000 = ~4 days (assuming they all take 10,000sec).

However, there is a new data file 0040F and you have started to get tasks for this. They are going to take a lot longer, my guess is about the same time as 0034F, which took close to 25,000sec - check your completed tasks - you still have a couple left. So if you keep going as you are, worst case is you will stay at about 70 in progress and at the point of changeover you will have:-

Time to complete = 35x25,000 = ~10 days (assuming all the new tasks at that point take 25,000sec).

This is not a comfortable position to be in. You are in this position, not only because of the drastic changes in crunch time but more particularly because BOINC is fetching for probably 6 cores and crunching with only 2. The 1.5 day setting giving pretty much 4.5 days just about proves the factor of 3 which will become much worse when the DCF induced factor kicks in later when the first longer running task is completed.

For the moment, you should reduce your cache to say 0.5 days (a true 1.5 days of CPU work) temporarily to allow the excess CPU tasks to reduce. Leave it there until the first longer running 0040F task has reset all the estimates. At that point you could go back to 1.5 days (if you wanted) but you would need to monitor closely in future for any more crunch time variations like those we have just been through.

If you would really like a 1.5 day cache setting to stay much closer to 1.5 real days of CPU work, you need to convince BOINC to just fetch for 2 threads, not 6. Because that machine is only crunching for Einstein, there are no other projects to worry about. Here is the app_config.xml that will achieve this.

<app_config>
    <app>
        <name>hsgamma_FGRPB1G</name>
        <gpu_versions>
            <gpu_usage>0.5</gpu_usage>
            <cpu_usage>0.2</cpu_usage>
        </gpu_versions>
    </app>
</app_config>

You must also set the %cores that BOINC is allowed to use to 25% - 2 out of 8 threads are available to BOINC. The cpu_usage of 0.2 ensures that the 4 GPU tasks will not cause a further thread to be reserved since 4 x 0.2 < 1.

You could install app_config.xml instead of reducing the cache to 0.5 if you wished. However, you should work out first why your previous app_config.xml didn't appear to be working. If you are in any doubt, reduce your cache setting immediately while you research that issue.

Cheers,
Gary.

A perplexing GPU problem ....

Forums › Problems and Bug Reports

Hi Richard, Thanks for

Hi Gary The app_config file

Richard de Lhorbe wrote:...

Comment viewing options

Forums › Problems and Bug Reports