Perhaps I should be more clearer .... what I meant was how were the 0.22, 0.55 and 0.05 figures arrived at, and what does it mean/imply to vary that from 'standard' defaults ??
For the CPUs it basically means here that 5% of one core (or one thread if we take hyperthreading into account) is dedicated to the CPU-bound part of the Milkyway CUDA application. Since we have four cores here, we talk about 1,25 % cpu time of all cores (threads) available (of course only of that part which is available to boinc). I don't have Milkyway CUDA running in the moment, but maybe 0.05 is the standard value here.
So this is quite similar to NCI projects like WUProp@Home, which takes only 1% of one core, hence BOINC will not reserve a core for it, too.
AFAIK it is the same still on (most?) GPUs, but from what I've heard this is changing or could change soom. GPUs might soon be able to run more than one application really in parallel, which means no timeslicing necessary up to a certain point, by dedicating a part of the shaders to one app and another part to another. I don't know why such numbers were chosen here for the GPU (0.55 / 0.22 / 0.22) maybe 0.50 / 0.25 / 0.25 didn't worked out quite right (EDIT: see Alex' explanation above *g*). But it means one Milkyway CUDA task and two Einstein APS2 CUDA tasks share the ressources. This should be done by timeslicing, too.
Since Gary responded quite nicely to Alex' experiments I decided to break cover, too, hoping not to be shot on sight. ;-)
I'm also doing some tests since yesterday, up to now with two tasks in parallel and no other CUDA applications, on this system here. I lost two tasks due to one crash, however, all others were doing fine. Most are still in pending. All tasks up to now, which were reported on 26 Aug 2010 15:21:31 UTC or later were run with coproc count = 0.50, which means two APS2 CUDA tasks per one GPU.
In contrary to my namesake I've used an app_info.xml instead of changing the client_state.xml, that is why you see the 'Anonymous platform' entries in the tasklist. The advantage is, that changed entries in the coproc count tags are not resetted when servers are contacted. The disadvantage is, well, the anonymous platform. :-)
I ( think I ) am getting the gist, now. The idea is to hit on an estimate, or guesstimate, as to the best ratio of CPU thread to GPU work that could reasonably be supported. On a given machine. So a 1.00 CPU is sort of like a 'full time' employee serving up the work to several part time ones ( fractional GPU's ). So the apparent/alleged inefficiency is :
(a) idle GPU capacity. It's only ticking over when there's revs to spare.
and/or
(b) over designation of CPU usage. A whole thread is allotted when it is barely used.
So you fudge the factors, with knowledge and experiment upon a known machine, and BOINC ( various mechanisms ) says 'OK, this host really could go some more over and above the default allocations'.
Thus any core allocation number, if brought downwards, would also potentially benefit other projects in the given machine's BOINC queue, as well as the project you are estimating for ? And similiarly get better usage of GPU for other projects potentially, as well as the one you are estimating for ....
If so, that is indeed a clever move.
[ I don't see anything we should be shooting on sight for thus far .... :-) Polite discussion/criticism/suggestions of the project are healthy. ]
My guess, if I am correctly analysing thus far, is the difficulty at the app/project source code level of yielding 'good' CPU and GPU numbers. That suggests a refined or extended benchmarking facility at the BOINC level - so not just a bland ( cobblestones or whatever ) rating of a machine but an exquisite prediction of CPU/GPU relativities for a given scenario.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Hi,
I was courious about the results, so I decided to turn internet on again.
I had 5 Einstein cuda-wu's ready, some GC's and some MW. The Einstein cuda's are all accepted and granted; I did not yet check the CPU-only apps.
The MW apps are all validatet too.
So these settings seem to work. Let's see what SETI brings up.
My guess, if I am correctly analysing thus far, is the difficulty at the app/project source code level of yielding 'good' CPU and GPU numbers. That suggests a refined or extended benchmarking facility at the BOINC level - so not just a bland ( cobblestones or whatever ) rating of a machine but an exquisite prediction of CPU/GPU relativities for a given scenario.
Cheers, Mike.
Mike, you're welcome!
These settings we talk about are sometimes called 'for advanced users only' (like lunatics does and MW too). A first and VERY helpful step would be not to change the settings in the client_state.xml. This is a setting of the app; MW leaves these settings as they are. I can check SETI when they are online again.
For all crunchers who only want to run Einstein cuda-wu's, a simple setting of 0.5 GPU's should work for most cards; users with Fermi-cards are usually able to do these changes or they can get help in a forum or a team.
I am aware, that a new CUDA-app is coming up ('in about two weeks' , as it was posted a couple of weeks ago :-) ), so I don't expect a change in the current app. But a new app should be adapted as you posted.
This might require updates to the BOINC core client, but couldn't the application keep track of how much time it spends using the GPU compared to the CPU, and relay that information back to BOINC to let it optimize the ratios between different projects? It doesn't seem like something that would be particularly hard to do - on Windows, just do a QueryPerformanceCounter() whenever you switch between CPU and GPU and add up the times when the WU is done.
Mind you, that does mean that applications shouldn't get exclusive access to the GPU (well, the available CUDA/CAL/OpenCL devices) for the length of a WU, but rather request GPU time only when needed. I'm not sure how BOINC currently implements this.
IMO, the FERMI GPU's are capable of doing 96% Memory and/or GPU-Load, whithout getting too hot.
More important is, these cards have a different architecture and more fast memory per SIMD, but my knowledge of CUDA is still almost zero, it has a steep learning curve, not like my RAC.
Why can't there be 2 or more ' threads' or parallel computing, which is the/a solution to gain speed........
(I know, alot is probably already running ' parallel' or threated).
All tasks up to now, which were reported on 26 Aug 2010 15:21:31 UTC or later were run with coproc count = 0.50, which means two APS2 CUDA tasks per one GPU.
Some minor update: since yesterday afternoon I was running threee APS2 CUDA tasks in parallel on my GTX260, coproc count = 0.33. The tasks are all reported on 27 Aug 2010 17:42:49 UTC and 28 Aug 2010 5:35:39 UTC. No errors so far (apart from the crash in the first run already mentioned). GPU load was ~ 25%, GPU temp. 65° C, video mem. load was still below 400 MB, so not even 50%. The card has 896 MB memory and 216 shaders. For comparison: with only one Collatz CUDA task the GPU load is ~ 88% and GPU temp. 74° C, so there should be enough clearance for another run with four tasks in parallel (I can't do more, since I have only a quadcore without HT *g*). Most noticeable, I don't see any significant performace penalty so far.
so there should be enough clearance for another run with four tasks in parallel (I can't do more, since I have only a quadcore without HT *g*). Most noticeable, I don't see any significant performace penalty so far.
Hi,
I tried that (shure, why not?), it works fine, fut the GPU-load is still below 50%. So this is not the goal (al least not for me).
I've got some SETI cuda-wu's, I ran them with the setting 2 Einstein / 1 SETI.
It works fine, SETI finished without timeout (like my MW-wu's). The GPU-load was
RE: Perhaps I should be
)
For the CPUs it basically means here that 5% of one core (or one thread if we take hyperthreading into account) is dedicated to the CPU-bound part of the Milkyway CUDA application. Since we have four cores here, we talk about 1,25 % cpu time of all cores (threads) available (of course only of that part which is available to boinc). I don't have Milkyway CUDA running in the moment, but maybe 0.05 is the standard value here.
So this is quite similar to NCI projects like WUProp@Home, which takes only 1% of one core, hence BOINC will not reserve a core for it, too.
AFAIK it is the same still on (most?) GPUs, but from what I've heard this is changing or could change soom. GPUs might soon be able to run more than one application really in parallel, which means no timeslicing necessary up to a certain point, by dedicating a part of the shaders to one app and another part to another. I don't know why such numbers were chosen here for the GPU (0.55 / 0.22 / 0.22) maybe 0.50 / 0.25 / 0.25 didn't worked out quite right (EDIT: see Alex' explanation above *g*). But it means one Milkyway CUDA task and two Einstein APS2 CUDA tasks share the ressources. This should be done by timeslicing, too.
Since Gary responded quite nicely to Alex' experiments I decided to break cover, too, hoping not to be shot on sight. ;-)
I'm also doing some tests since yesterday, up to now with two tasks in parallel and no other CUDA applications, on this system here. I lost two tasks due to one crash, however, all others were doing fine. Most are still in pending. All tasks up to now, which were reported on 26 Aug 2010 15:21:31 UTC or later were run with coproc count = 0.50, which means two APS2 CUDA tasks per one GPU.
In contrary to my namesake I've used an app_info.xml instead of changing the client_state.xml, that is why you see the 'Anonymous platform' entries in the tasklist. The advantage is, that changed entries in the coproc count tags are not resetted when servers are contacted. The disadvantage is, well, the anonymous platform. :-)
Regards
the other Alexander
Thank you Alex(s)!! :-) I
)
Thank you Alex(s)!! :-)
I ( think I ) am getting the gist, now. The idea is to hit on an estimate, or guesstimate, as to the best ratio of CPU thread to GPU work that could reasonably be supported. On a given machine. So a 1.00 CPU is sort of like a 'full time' employee serving up the work to several part time ones ( fractional GPU's ). So the apparent/alleged inefficiency is :
(a) idle GPU capacity. It's only ticking over when there's revs to spare.
and/or
(b) over designation of CPU usage. A whole thread is allotted when it is barely used.
So you fudge the factors, with knowledge and experiment upon a known machine, and BOINC ( various mechanisms ) says 'OK, this host really could go some more over and above the default allocations'.
Thus any core allocation number, if brought downwards, would also potentially benefit other projects in the given machine's BOINC queue, as well as the project you are estimating for ? And similiarly get better usage of GPU for other projects potentially, as well as the one you are estimating for ....
If so, that is indeed a clever move.
[ I don't see anything we should be shooting on sight for thus far .... :-) Polite discussion/criticism/suggestions of the project are healthy. ]
My guess, if I am correctly analysing thus far, is the difficulty at the app/project source code level of yielding 'good' CPU and GPU numbers. That suggests a refined or extended benchmarking facility at the BOINC level - so not just a bland ( cobblestones or whatever ) rating of a machine but an exquisite prediction of CPU/GPU relativities for a given scenario.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Hi, I was courious about the
)
Hi,
I was courious about the results, so I decided to turn internet on again.
I had 5 Einstein cuda-wu's ready, some GC's and some MW. The Einstein cuda's are all accepted and granted; I did not yet check the CPU-only apps.
The MW apps are all validatet too.
So these settings seem to work. Let's see what SETI brings up.
Regards,
Alexander
RE: Thank you Alex(s)!!
)
Mike, you're welcome!
These settings we talk about are sometimes called 'for advanced users only' (like lunatics does and MW too). A first and VERY helpful step would be not to change the settings in the client_state.xml. This is a setting of the app; MW leaves these settings as they are. I can check SETI when they are online again.
For all crunchers who only want to run Einstein cuda-wu's, a simple setting of 0.5 GPU's should work for most cards; users with Fermi-cards are usually able to do these changes or they can get help in a forum or a team.
I am aware, that a new CUDA-app is coming up ('in about two weeks' , as it was posted a couple of weeks ago :-) ), so I don't expect a change in the current app. But a new app should be adapted as you posted.
Cheers,
Alexander
This might require updates to
)
This might require updates to the BOINC core client, but couldn't the application keep track of how much time it spends using the GPU compared to the CPU, and relay that information back to BOINC to let it optimize the ratios between different projects? It doesn't seem like something that would be particularly hard to do - on Windows, just do a QueryPerformanceCounter() whenever you switch between CPU and GPU and add up the times when the WU is done.
Mind you, that does mean that applications shouldn't get exclusive access to the GPU (well, the available CUDA/CAL/OpenCL devices) for the length of a WU, but rather request GPU time only when needed. I'm not sure how BOINC currently implements this.
In the next alpha release
)
In the next alpha release [trac]changeset:22283[/trac] will be added, which does exactly that.
IMO, the FERMI GPU's are
)
IMO, the FERMI GPU's are capable of doing 96% Memory and/or GPU-Load, whithout getting too hot.
More important is, these cards have a different architecture and more fast memory per SIMD, but my knowledge of CUDA is still almost zero, it has a steep learning curve, not like my RAC.
Why can't there be 2 or more ' threads' or parallel computing, which is the/a solution to gain speed........
(I know, alot is probably already running ' parallel' or threated).
RE: All tasks up to now,
)
Some minor update: since yesterday afternoon I was running threee APS2 CUDA tasks in parallel on my GTX260, coproc count = 0.33. The tasks are all reported on 27 Aug 2010 17:42:49 UTC and 28 Aug 2010 5:35:39 UTC. No errors so far (apart from the crash in the first run already mentioned). GPU load was ~ 25%, GPU temp. 65° C, video mem. load was still below 400 MB, so not even 50%. The card has 896 MB memory and 216 shaders. For comparison: with only one Collatz CUDA task the GPU load is ~ 88% and GPU temp. 74° C, so there should be enough clearance for another run with four tasks in parallel (I can't do more, since I have only a quadcore without HT *g*). Most noticeable, I don't see any significant performace penalty so far.
Regards
RE: so there should be
)
Hi,
I tried that (shure, why not?), it works fine, fut the GPU-load is still below 50%. So this is not the goal (al least not for me).
I've got some SETI cuda-wu's, I ran them with the setting 2 Einstein / 1 SETI.
It works fine, SETI finished without timeout (like my MW-wu's). The GPU-load was
Regards,
Alexander
I tried to do the same in the
)
I tried to do the same in the past, (when the cuda app was first released) I would be able to run 3-4 WUs on my GTX260-216 at a time usage-wise!
However I never managed to make a app_info file that einstein accepted.
Could anyone please help me or does anyone still have an old beta einstein xml, I could use to modify?
Thanks!