app_config settings for multiple GPU apps

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3958

Credit: 46986562642

RAC: 64779379

for context I'm running 5

3 Jul 2024 11:26:18 UTC

Message 226574

(moderation:

)

for context I'm running 5 tasks at a time per GPU.

Titan V uses about 135W, the V100 uses about 165W (power limited).

_________________________________________________________________________

hadron

Joined: 27 Jan 23

Posts: 62

Credit: 101249024

RAC: 585652

Ian&Steve C. wrote: for

3 Jul 2024 17:07:20 UTC

Message 226583 in response to message 226574

(moderation:

)

Ian&Steve C. wrote:

for context I'm running 5 tasks at a time per GPU.

I have Einstein configured to run 5 O3AS and 4 BRP7 concurrently, but boinc so far has only allowed 4 GPU tasks at a time.

However, I had left the resource shares alone on all 3 projects, until I could see where this was going to head. Now, my Einstein RAC is 51K and climbing, while the one at LHC is struggling to stay above 22K -- which means that my client is giving priority to LHC (no Rosetta tasks available at the moment). This is, of course, impossible.

I just set the LHC and Rosetta resource shares to 1/5 of Einstein's. Given the number of Theory tasks I have currently running or waiting, it should be a few hours until I can see the results.

Quote:

Titan V uses about 135W, the V100 uses about 165W (power limited).

No idea here what power the GPU is drawing; my monitor doesn't show that. All I know (from my UPS monitor) is that I'm now drawing 380W total -- but alas, I can't remember what it was before the card went in.

pututu

Joined: 6 Apr 17

Posts: 63

Credit: 653417392

RAC: 5

hadron wrote: No idea here

3 Jul 2024 17:25:35 UTC

Message 226584 in response to message 226583

(moderation:

)

hadron wrote:

No idea here what power the GPU is drawing; my monitor doesn't show that.

Try typing this command in your terminal:

nvidia-smi -i 0 --loop-ms=1000 --format=csv,noheader --query-gpu=power.draw

since you only have one gpu. This will show the gpu power draw every one sec.

KLiK

Joined: 1 Apr 14

Posts: 67

Credit: 432713776

RAC: 1257819

Guys, we all know what are

3 Jul 2024 17:37:37 UTC

Message 226585

(moderation:

)

Guys, we all know what are the advantages of running multiple (at least 2 WUs) for O3AS. Especially as those (lets call them) saves from 49,5~50% & from 99,5~100% take a lot of CPU time.

But what are the advantages of running of multiple BRP7? Anybody more throughput?

non-profit org. Play4Life in Zagreb, Croatia, EU

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3958

Credit: 46986562642

RAC: 64779379

KLiK wrote: Guys, we all

3 Jul 2024 18:17:09 UTC

Message 226586 in response to message 226585

(moderation:

)

KLiK wrote:

Guys, we all know what are the advantages of running multiple (at least 2 WUs) for O3AS. Especially as those (lets call them) saves from 49,5~50% & from 99,5~100% take a lot of CPU time.

But what are the advantages of running of multiple BRP7? Anybody more throughput?

in most normal cases, BRP7 will run best with 1x. More just makes the jobs run more than n-times longer, which means less overall production.

personally I’m running 3x with BRP7, but only because I’ve done some custom tweaking to make each job be limited to 40% GPU usage (with MPS, only possible on Linux)

_________________________________________________________________________

hadron

Joined: 27 Jan 23

Posts: 62

Credit: 101249024

RAC: 585652

Ian&Steve C. wrote: KLiK

3 Jul 2024 20:21:21 UTC

Message 226590 in response to message 226586

(moderation:

)

Ian&Steve C. wrote:

KLiK wrote:

Guys, we all know what are the advantages of running multiple (at least 2 WUs) for O3AS. Especially as those (lets call them) saves from 49,5~50% & from 99,5~100% take a lot of CPU time.

But what are the advantages of running of multiple BRP7? Anybody more throughput?

in most normal cases, BRP7 will run best with 1x. More just makes the jobs run more than n-times longer, which means less overall production.

personally I’m running 3x with BRP7, but only because I’ve done some custom tweaking to make each job be limited to 40% GPU usage (with MPS, only possible on Linux)

I never felt I really understood what you guys mean by 1x, 3x etc. The best I can come up with is the number of concurrent tasks. If that is so, then it seems counter-intuitive to me. How could running only 1 task at a time be more efficient than 3, if you have the resources to run 3?

GWGeorge007

Joined: 8 Jan 18

Posts: 3065

Credit: 4970217686

RAC: 1419185

hadron wrote: Ian&Steve C.

3 Jul 2024 21:54:18 UTC

Message 226593 in response to message 226590

(moderation:

)

hadron wrote:

Ian&Steve C. wrote:

in most normal cases, BRP7 will run best with 1x. More just makes the jobs run more than n-times longer, which means less overall production.

personally I’m running 3x with BRP7, but only because I’ve done some custom tweaking to make each job be limited to 40% GPU usage (with MPS, only possible on Linux)

I never felt I really understood what you guys mean by 1x, 3x etc. The best I can come up with is the number of concurrent tasks. If that is so, then it seems counter-intuitive to me. How could running only 1 task at a time be more efficient than 3, if you have the resources to run 3?

Maybe this will help you understand. Type: watch -n 1.0 nvidia-smi into a terminal window.

Then watch what your GPU utility usage is, once it reaches 100% you are likely at your most efficient use of GPU time. When comparing your times to a single (i.e. 1x) task, each time you increase the tasks (i.e. 2x or 3x or 4x, etc.) you will need to divide by the number of tasks being used by your GPU. This will give you a comparison of how much time each of your tasks is taking compared to just one. If it ends up being MORE than your single task, lessen the "2x" or "3x" or "4x" or whatever until it is less than a single task.

For instance, here is my watch -n 1.0 nvidia-smi in a terminal window.

Note that I'm running two GPUs with "3" Einstein@Home tasks O3AS on GPU:1 and "2" GPUGrid on GPU:0

Even though GPU:0 (RTX 3090 Hybrid) has 24GB of VRAM memory and I am only using 3.8Gb of the 24GB of memory, I am at 100% GPU usage, or "GPU Utilization". That is your target.

George

Proud member of the Old Farts Association

hadron

Joined: 27 Jan 23

Posts: 62

Credit: 101249024

RAC: 585652

GWGeorge007 wrote: hadron

3 Jul 2024 23:32:22 UTC

Message 226595 in response to message 226593

(moderation:

)

GWGeorge007 wrote:

hadron wrote:

Ian&Steve C. wrote:

in most normal cases, BRP7 will run best with 1x. More just makes the jobs run more than n-times longer, which means less overall production.

personally I’m running 3x with BRP7, but only because I’ve done some custom tweaking to make each job be limited to 40% GPU usage (with MPS, only possible on Linux)

I never felt I really understood what you guys mean by 1x, 3x etc. The best I can come up with is the number of concurrent tasks. If that is so, then it seems counter-intuitive to me. How could running only 1 task at a time be more efficient than 3, if you have the resources to run 3?

Maybe this will help you understand. Type: watch -n 1.0 nvidia-smi into a terminal window.

Then watch what your GPU utility usage is, once it reaches 100% you are likely at your most efficient use of GPU time. When comparing your times to a single (i.e. 1x) task, each time you increase the tasks (i.e. 2x or 3x or 4x, etc.) you will need to divide by the number of tasks being used by your GPU. This will give you a comparison of how much time each of your tasks is taking compared to just one. If it ends up being MORE than your single task, lessen the "2x" or "3x" or "4x" or whatever until it is less than a single task.

Strange -- with O3AS settings of 1 CPU and 0.16 GPU, boinc only ever had more than 4 GPU tasks running concurrently. I was playing around with the GPU setting to see how that works in the nvidia.sml data -- no change at all to the percentage (100%), but then I inadvertently set it to 0.1. Instantly, boinc 4 more Einstein tasks -- 3 O3AS and 1 BRP7. The nice thing is they all fit into the GPU memory, with about 370 MB to spare. I think I'll leave it like this, unless the GPU temperature goes up too much.

Quote:

For instance, here is my watch -n 1.0 nvidia-smi in a terminal window.

Note that I'm running two GPUs with "3" Einstein@Home tasks O3AS on GPU:1 and "2" GPUGrid on GPU:0

Even though GPU:0 (RTX 3090 Hybrid) has 24GB of VRAM memory and I am only using 3.8Gb of the 24GB of memory, I am at 100% GPU usage, or "GPU Utilization". That is your target.

I can see the 3 O3AS on GPU1, but on GPU2, isn't there a 4th O3AS task there? And I don't see anything I think would be a task from some other project.

KLiK

Joined: 1 Apr 14

Posts: 67

Credit: 432713776

RAC: 1257819

hadron wrote: Ian&Steve C.

4 Jul 2024 11:53:42 UTC

Message 226611 in response to message 226590

(moderation:

)

hadron wrote:

Ian&Steve C. wrote:

KLiK wrote:

Guys, we all know what are the advantages of running multiple (at least 2 WUs) for O3AS. Especially as those (lets call them) saves from 49,5~50% & from 99,5~100% take a lot of CPU time.

But what are the advantages of running of multiple BRP7? Anybody more throughput?

in most normal cases, BRP7 will run best with 1x. More just makes the jobs run more than n-times longer, which means less overall production.

personally I’m running 3x with BRP7, but only because I’ve done some custom tweaking to make each job be limited to 40% GPU usage (with MPS, only possible on Linux)

I never felt I really understood what you guys mean by 1x, 3x etc. The best I can come up with is the number of concurrent tasks. If that is so, then it seems counter-intuitive to me. How could running only 1 task at a time be more efficient than 3, if you have the resources to run 3?

Well, think of this this way:

- you have a card with decent amount of VRAM, not to have limit with concurrent tasks

- single task crunching on GPU lasts 30min

- dual concurrent tasks (CT) on GPU lasts 50min, but you produce 2 WUs

- triple CT on GPU lasts 100min, but you produce 3 WUs

- even less with more CT

When you compare those numbers, you get: 30min/single CT, 25min/dual CT, 33min/triple CT. So in an amount of daily run time (which is the definition of RAC, more or less), you can run: 48WUs with single CT, 57,6WUs with dual CT & 43,63WUs with triple CT. Making it that dual CT are the most efficient way to go with that card!

Do you get it now? ;)

non-profit org. Play4Life in Zagreb, Croatia, EU

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3958

Credit: 46986562642

RAC: 64779379

hadron wrote:I never felt I

4 Jul 2024 12:45:52 UTC

Message 226613 in response to message 226590

(moderation:

)

hadron wrote:

I never felt I really understood what you guys mean by 1x, 3x etc. The best I can come up with is the number of concurrent tasks. If that is so, then it seems counter-intuitive to me. How could running only 1 task at a time be more efficient than 3, if you have the resources to run 3?

sometimes, (not all times, totally application dependent and to a degree GPU dependent) running more than 1 just puts undue stress on the GPU in some way that slows down all tasks. for example, when you have a task that already utilizes 95+% of the GPU core with a single task like BRP7 does, you have time lost with excessive context switching between tasks in the GPU scheduler.

BRP7 acts this way. i haven't run it in some time since I am focusing on the O3AS tasks only right now, but an example was that say 1 task concurrent took 5 mins to run. if you ran two tasks concurrent, it would take 11minutes for each task (5.5 minutes effective), which is less overall tasks per day production than just staying with 1x.

the exception to this with BRP7 is when you throw CUDA MPS into the mix (ref: https://docs.nvidia.com/deploy/mps/). you get better than 1x performance if you cut the active thread percentage (ATP, what % of the GPU core is used per task, rounded to the nearest SM) and run more than 1 task with slight over provisioning. for example, running 40% ATP with 3x tasks, or 70% ATP with 2x tasks are generally faster than 1x 100% ATP. MPS is only available for CUDA workloads (OpenCL tasks will crash/fail when MPS is running) and it is only available in the Linux driver. without MPS, 1x is the best config

the BRP7 app is also fairly memory bound. lots of random access to the memory that slows down how fast the GPU can process. if the memory bus is maxed out (not VRAM occupancy, talking about the bandwidth between GPU core and memory), adding more tasks generally doesnt help. the 40-series cards are pretty neutered in terms of memory bandwidth since Nvidia is trying to do "more with less" by having AI/ML pick up the slack in gaming workloads. allowing them to reduce costs and complexity on the hardware itself. your card will be much better suited to running only the O3AS tasks in my opinion. running BRP7 along side will slow them all down. i would set it up to be more like a prime/backup situation where you run BRP7 when O3AS is not available. but that's just my opinion, you can ultimately run it however you like.

_________________________________________________________________________

app_config settings for multiple GPU apps

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner