Boinc and DCF with GPU tasks

cliff

Joined: 15 Feb 12

Posts: 176

Credit: 283452444

RAC: 0

Hi Mikey, I have

16 Nov 2014 5:00:08 UTC

Message 124641 in response to message 124640

(moderation:

)

Hi Mikey,
I have canned the BRP5 tasks and now run BRP4G on GPU and don't run CPU for E@H.

So I expect DCF will stabilise after a few more WU have been crunched.

The only thing now is I made what might possibly an error in upgrading to the
latest version of BOINC.. Which may or may not lead to increased crunching times for BRP4G tasks, I'm keeping a beady eye on running times:-)

[edit]
I've just reduced the number of CPU cores to 33% in the hopes it may improve GPU throughput.

Regards,
Cliff

Cliff,

Been there, Done that, Still no damm T Shirt.

floyd

Joined: 12 Sep 11

Posts: 133

Credit: 186610495

RAC: 0

RE: I haven't done any

16 Nov 2014 10:38:29 UTC

Message 124642 in response to message 124636

(moderation:

)

Quote:

I haven't done any BRP4G for a very long time but my impression when I last did was that the ratio between estimate and actual was pretty much the same for both searches.

(Meaning BRP4G and BRP5)

I just gave it a try and allowed BRP5 for a host that was running only BRP4G for a while, but I noticed BRP5 tasks come with an estimate of 450000 GFLOPs vs 280000 for BRP4G. That's a factor of 1.6 and far from the real ratio which should be somewhere around 3 taking into account that both applications are rated at the same speed. Under these circumstances a common DCF just can't work. I'll revert to the strategy of running only one type of GPU tasks on any machine and no CPU tasks with it.

mikey

Joined: 22 Jan 05

Posts: 12820

Credit: 1881924328

RAC: 1098642

RE: RE: I haven't done

16 Nov 2014 12:16:55 UTC

Message 124643 in response to message 124642

(moderation:

)

Quote:

Quote:
I haven't done any BRP4G for a very long time but my impression when I last did was that the ratio between estimate and actual was pretty much the same for both searches.

(Meaning BRP4G and BRP5)

I just gave it a try and allowed BRP5 for a host that was running only BRP4G for a while, but I noticed BRP5 tasks come with an estimate of 450000 GFLOPs vs 280000 for BRP4G. That's a factor of 1.6 and far from the real ratio which should be somewhere around 3 taking into account that both applications are rated at the same speed. Under these circumstances a common DCF just can't work. I'll revert to the strategy of running only one type of GPU tasks on any machine and no CPU tasks with it.

I have multiple machines, each with a gpu in it, and only leave once cpu core free for each gpu unit I run. So if I am running 1 gpu unit on the gpu then I only leave one cpu core free. But if I am running 2 gpu units at once on the gpu, which I am here, I leave 2 cpu cores free just to keep the gpu fed and happy. Most of my machines are quad or 6 core machines, so that still leaves me plenty of cpu cores to crunch with. I also never crunch the same project using both the cpu and gpu in the same machine. I always run one project on the gpu and a different project on the cpu. I have run for instance Einstein on the gpu on one machine and some cpu project on the same machine, and then reversed the process on a second machine. But right now I am focusing on a cpu project so most of my cpu's are running it, and it does not have a gpu app.

floyd

Joined: 12 Sep 11

Posts: 133

Credit: 186610495

RAC: 0

RE: RE: I just gave it a

16 Nov 2014 13:30:11 UTC

Message 124644 in response to message 124643

(moderation:

)

Quote:

Quote:
I just gave it a try and allowed BRP5 for a host that was running only BRP4G for a while, but I noticed BRP5 tasks come with an estimate of 450000 GFLOPs vs 280000 for BRP4G. That's a factor of 1.6 and far from the real ratio which should be somewhere around 3 taking into account that both applications are rated at the same speed. Under these circumstances a common DCF just can't work. I'll revert to the strategy of running only one type of GPU tasks on any machine and no CPU tasks with it.

I also never crunch the same project using both the cpu and gpu in the same machine. I always run one project on the gpu and a different project on the cpu.

Right, that's what I wanted to say. No CPU tasks for the same project you're running on the GPU and only one type of GPU tasks if you want to get rid of that DCF issue. Though I'd sometimes like to be able to run a wider variety.

Of course there's no reason not to run CPU tasks for other projects if you have more cores than you need for GPU support, and for a while I have also followed the rule to leave one core for every GPU task and could still do some CPU work, but now I'm running more than six GPU tasks on my six-core end even one CPU task slows things down.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118618327389

RAC: 18068631

RE: ... I just gave it a

17 Nov 2014 4:17:46 UTC

Message 124645 in response to message 124642

(moderation:

)

Quote:

... I just gave it a try and allowed BRP5 for a host that was running only BRP4G for a while, but I noticed BRP5 tasks come with an estimate of 450000 GFLOPs vs 280000 for BRP4G. That's a factor of 1.6 and far from the real ratio which should be somewhere around 3 ...

As I said, I don't do BRP4G at all so I don't know the comparative FLOPS value that's built in to those tasks. Humour me for a moment if you will be so kind. You have both BRP4G and BRP5 tasks at the moment on your Core 2 Duo rig. Can you please tell me what each type is currently estimated to take? I'm just interested to see if the ratio of the time estimates is also 1.6.

Also could you please mention how many simultaneous tasks you run on the HD7770?

Thanks.

Cheers,
Gary.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Ok, I know I'm getting into

17 Nov 2014 4:51:38 UTC

Message 124646 in response to message 124645

(moderation:

)

Ok, I know I'm getting into this topic late. For me at least I'm not sure why Cliff is worried about estimated times with this DCF.

I get he doesn't crunch 24/7 so an exact estimate would help him figure out when these work units should finish.

I guess I've never paid attention to the estimates but just averaged what I was crunching and that was much more accurate than the estimates Boinc has ever given. I just remember what those are.

However, I will say this. Estimates from Boinc only work with 1 work unit per GPU.

They also only work with the same GPU in the same PCIe slot.

This goes back to the topic in another thread where we talked about speed of processing being more in which PCIe slot than the GPU.

The other factor I found was running more than 1 Work unit per GPU affected time to completion in

a) if both work units are the same
b) if the PCIe slot is higher number (ie x16, x8)
c) if you have mixed work units.

In a, you will find that the time to complete is longer but less than 2x the normal amount ot time

In b, the time in a will even be shorter than (a) compared to lower number PCIe slot

In c, we have an unusual case. Arcebio is be accelerated so it's time will be quicker than the time of 2 same work units. However, Perseus time in this mix set up will be completely out of normal (upwards to 50-70% longer) than if it was match with another Perseus

In other words, in a mix match the short Arcebio gets quicker, while the longer Perseus get longer. PCIe slot speed also will affect the times.

So I would say to Cliff, you are going to need to see what the speeds are on your MoBo and then how many work units you are running at a time.

If you are running 1 work unit at a time, then you can get an estimate on the different work types depending on what slot they are in. (unless all the slots are the same ie x8 or x4 in which case they should be the same for the same work unit)

If you run more 2 or more work units per GPU, you will never get an accurate estimate of times

Happy Crunching...

Zalster

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118618327389

RAC: 18068631

RE: Of course there's no

17 Nov 2014 6:54:25 UTC

Message 124647 in response to message 124644

(moderation:

)

Quote:

Of course there's no reason not to run CPU tasks for other projects if you have more cores than you need for GPU support, and for a while I have also followed the rule to leave one core for every GPU task and could still do some CPU work, but now I'm running more than six GPU tasks on my six-core end even one CPU task slows things down.

I was interested to see that the CPU in that machine was a FX-6300. Ever since the days of the Athlon XP killing the Northwood P4, I've been hoping that AMD might again become (somewhat) competitive with Intel. About a year or so ago, I bought an FX-6300 so I could see how they were going. I've been quite disappointed.

For comparison purposes, I set up an i3-3240 and the FX-6300, each with a single AMD HD7850 GPU. The i3 is a dual core CPU with HT - so 4 virtual cores. The 6300 has 6 cores but only 3 FP units so each FP unit is shared between 2 cores.

The i3-3240 crunches 2 FGPP4 CPU tasks concurrently with 4 BRP5 GPU tasks. It has a RAC of around 78K. CPU tasks take under 24ksecs and GPU tasks take around 16ksecs (for 4).

The FX-6300 crunches 3 FGRP4 CPU tasks concurrently with 4 BRP5 GPU tasks. It has a RAC of around 60K. CPU tasks take more than 40ksecs and GPU tasks take over 19ksecs (for 4).

Really, the numbers speak for themselves. I know that I could somewhat improve the GPU crunch times, particularly for the FX-6300 host, if I further reduced the number of CPU tasks. For the i3, the improvement is marginal at best. I really want to participate in the FGRP search so I choose to sacrifice a little on BRP5 crunch times to get more FGRP4 done. In order to quantify the effect on the FX-6300, I've just reduced the CPU tasks to 2 so that there is now a free core for each GPU task. I'll let it run this way for a day or two to get some precise numbers. I might get lucky and the RAC might actually improve a little :-).

People often say to leave a free CPU core for each GPU task but that only seems advisable with recent AMD CPUs. I have a couple of old Phenom II x4 CPUs also with HD7850 GPUs running 2 CPU tasks and 4 GPU tasks - ie 1 free core for each 2 GPU tasks. This is the host and the times are quite interesting. It has a RAC in the mid 70Ks and GPU tasks take around 16ksecs (same as for the i3). CPU tasks take around 39ksecs - slightly better than for the FX-6300. The CPU is about 5 years old.

I also run GTX650 GPUs and for them I don't need to leave any CPU cores free. I run them in Pentium dual core hosts and run 2 CPU tasks with 3 GPU tasks. If I leave a CPU core free, it hardly makes a difference to the GPU crunch time. Obviously it could be quite different for a more powerful GPU. I'm simply making the point that there is no hard and fast 'rule' for the optimal number of free cores to leave. By following some formula, it could mean losing out in overall productivity. Here is a example of a Pentium G645 host with a GTX650 GPU. It runs 2 CPU tasks and 3 GPU tasks concurrently and gives a RAC around 34K.

Cheers,
Gary.

floyd

Joined: 12 Sep 11

Posts: 133

Credit: 186610495

RAC: 0

RE: You have both BRP4G and

17 Nov 2014 17:28:26 UTC

Message 124648 in response to message 124645

(moderation:

)

Quote:

You have both BRP4G and BRP5 tasks at the moment on your Core 2 Duo rig. Can you please tell me what each type is currently estimated to take? I'm just interested to see if the ratio of the time estimates is also 1.6.

I was running two BRP4G in quite stable (and correctly estimated) 2:09 hours. The first BRP5 came with estimated 3:26. The current estimates are 5:03 and 8:07 which are of course both incorrect, but still factor 1.6.

To add to the confusion, the ratio of real run times is neither 1.6, nor is it 3.3, it's more like 2.6.

Quote:

Also could you please mention how many simultaneous tasks you run on the HD7770?

It's a HD7750 and I'm running two BRP4G or four BRP5. The BRP4G seem more demanding.

floyd

Joined: 12 Sep 11

Posts: 133

Credit: 186610495

RAC: 0

RE: I was interested to see

17 Nov 2014 18:36:17 UTC

Message 124649 in response to message 124647

(moderation:

)

Quote:

I was interested to see that the CPU in that machine was a FX-6300. Ever since the days of the Athlon XP killing the Northwood P4, I've been hoping that AMD might again become (somewhat) competitive with Intel. About a year or so ago, I bought an FX-6300 so I could see how they were going. I've been quite disappointed.

The C2D is my first Intel CPU and I've been quite happy with it. After my recent experiences with the FX-6300 I am again. ;-) But when I was looking for an affordable multi-GPU board and a CPU to go with it it became quickly clear that it can't be Intel. Considering the price I'm not really disappointed, but I am somewhat surprised by the old C2D still outperforming the FX.

Quote:

For comparison purposes

(Snip)

Quote:

Really, the numbers speak for themselves.

Sorry, I'm not that much into numbers. I try to keep my systems running smoothly, and efficiently if possible, but I dont't want to take all this too seriously.

Quote:

I know that I could somewhat improve the GPU crunch times, particularly for the FX-6300 host, if I further reduced the number of CPU tasks.

The CPU load of GPU tasks is also something to keep in mind. It's significantly higher on my FX-6300 than on the Intel, running the same type of tasks on the same GPU. When I forced the CPU to run at top speed, even when mostly idle, it made a noticable difference in GPU run times. Unfortunately it also made a noticable difference in power consumption and noise, but if you can live with that it's worth a try.

Quote:

If I leave a CPU core free, it hardly makes a difference to the GPU crunch time. Obviously it could be quite different for a more powerful GPU. I'm simply making the point that there is no hard and fast 'rule' for the optimal number of free cores to leave.

Sure, that depends much on the hardware and the tasks running on it. I've seen apps that hardly needed any CPU support at all, and I've seen one actually benefit from a second CPU core.

mikey

Joined: 22 Jan 05

Posts: 12820

Credit: 1881924328

RAC: 1098642

RE: RE: I was interested

18 Nov 2014 14:22:34 UTC

Message 124650 in response to message 124649

(moderation:

)

Quote:

Quote:
I was interested to see that the CPU in that machine was a FX-6300. Ever since the days of the Athlon XP killing the Northwood P4, I've been hoping that AMD might again become (somewhat) competitive with Intel. About a year or so ago, I bought an FX-6300 so I could see how they were going. I've been quite disappointed.

The C2D is my first Intel CPU and I've been quite happy with it. After my recent experiences with the FX-6300 I am again. ;-) But when I was looking for an affordable multi-GPU board and a CPU to go with it it became quickly clear that it can't be Intel. Considering the price I'm not really disappointed, but I am somewhat surprised by the old C2D still outperforming the FX.

I did a price check yesterday at Newegg.com and Microcenter.com and both had the AMD FX-6300 and the Intel i3-3240 on sale for about $100 US dollars. I too have been buying the FX-6300 cpu's lately due to their cost, and that you get more cpu cores to crunch with for the non gpu projects.

I run Cosmology on my cpu's right now, my AMD 6300 cpu's take between 30k to 84k seconds to run a unit. My Intel i7-3612qm, a laptop, is taking between 16k and 54k seconds per unit. While the Intel is faster, the shear number of AMD cores I can throw at the project makes it a better choice for me. Running 5 times as many units at one time makes it a no brainer compared to the i3 dual core cpu when used in this way.

What I am trying to say is that I think each cpu has it's place in crunching and it depends on the projects you run as to which is better. I tend to try to throw lots of resources at a single project, making the AMD better for me. But the Intel is definitely a faster cruncher per core if you are interested in pure speed. In military terms I like to think of the AMD cpu's as the Chinese Army, LOTS AND LOTS of resources, while the Intel cpu's are more like a US Navy Seal Team, smaller but faster. Both get the job done very well, but in the end the AMD overwhelms it, for me, with pure resources. The AMD loses on each individual unit, but together it overwhelms the Intel, in RAC, with it's number of cores.

Boinc and DCF with GPU tasks

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports