Actually, the policy could be "spin when small and yield when big".
Hence big kernel launches allow small CPU usage and app with small kernel launches would suffer from increased CPU usage.
I want to understand if this guess is true or the reason in differencies we observe lies in something another.
Our custom kernels (see source code) are small in my opinion, but we have a large number of work items (up to 2^25 IIRC). The work group size is determined (limited) dynamically at runtime to respect the underlying hardware.
Update: I just noticed that our "kernelPowerSpectrum*" kernels became more complex these days :-) I've to admit that I haven't looked at the code for quite a while. So your profiling efforts could indeed be interesting.
This gets curiouser and curiouser. I pointed out that my Einstein tasks run for about 11 minutes when the CPU is 75% loaded, and only record ~20 seconds CPU time. But if the CPU is 100% loaded, the elapsed time jumps massively to around 80 minutes (extrapolated - I didn't complete a full one).
I've just tried the reverse experiment with SETI (astropulse) running on the same hardware - task list.
The most recent one - 15081890, reported 17 Oct 2013, 14:00:10 UTC - was run with 100% CPU loading throughout, and the extra elapsed time is barely noticeable (though some of the SIMAP tasks running in parallel took longer than usual).
From the outside, it looks as if Oliver's app needs a free CPU, but doesn't actually use it much: Raistmer's app uses a lot of CPU, but doesn't actually need it.
Just to check, I had a poke through Task Manager while the Einstein app was running in 75% CPU mode, to see if extra CPU time was being accounted for in other places. Probably not a robust enough test (even with 'show processes for all users' checked), but the most I could see was occasional spikes up to 2% in 'NT Kernel & System'. A fairly steady 23% was being allocated to 'System Idle Process'
From the outside, it looks as if Oliver's app needs a free CPU, but doesn't actually use it much: Raistmer's app uses a lot of CPU, but doesn't actually need it.
Doesn't this make sense? I mean if our app doesn't use much CPU it depends on very fast context switching for the few parts that do run on the CPU. SETI's app does seem to use the CPU extensively, which is why it doesn't depend so much on the context switching - it already is running more or less continuously on the CPU, so it doesn't compete with other CPU processes for time slices, but our app does...
Yes, cause AP uses ~100% of CPU it's not correct to say that AP "doesn't need it". It just does not share it ;)
The strange thing is that such reaction on fully loaded CPU came only after last driver update on my host (from OpenCL 1.1 to OpenCL 1.2 driver). Cause Richard runs OpenCL Einstein more than I it's quite possibly he had updated driver (Einstein requires OpenCL 1.2 driver) long before I did upgrade so has no pre-OpenCL 1.2 points to compare with. I have. Older driver reacted in different way: on full load elapsed time increased considerably, but CPU time decreased. It's very fact I consider as confirmation for my "spend CPU on synching" theory. When CPU not available immediately app has less time for waiting loops (GPU mostly ready already to switching time).
EDIT: maybe with last driver change priority of corresponding driver thread was changed or smth alike - now app continue to use CPU under full load.
EDIT2: typical example for old driver (fully loaded CPU):
32,186.11 16,924.63 (elapsed/CPU).
And current situation:
26,569.22 26,345.06 (elapsed/CPU)
If you were following the v7.2.18 release thread on boinc_alpha, and the discussion about OpenCL detection on CPU (yes, not GPU - I got that wrong too), you'll remember that this machine was supplied from the factory with OpenCL 1.2 drivers for the HD 4600 - so no, I have no Astropulse times under OpenCL 1.1 driver for comparison.
As part of the v7.2.18 testing, I downloaded the additional Intel SDK and runtime support for OpenCL on CPU - that wasn't pre-installed.
Unfortunately, have absolutely no idea what that "kernels per reduction" does in his app. So hard to comment how it would change load.
I'm not that deep into CC either. But in principle they're checking huge integers (many in parallel for the GPUs) for the Collatz conjecture by running some algorithm on them ("3+1"). Thereby the numbers gradually become smaller, until the algorithm terminates with "CC still holds true" or "not" (never happened so far). So.. the "reductions per kernel" could be the algorithm iterations performed per kernel called.
If you were following the v7.2.18 release thread on boinc_alpha, and the discussion about OpenCL detection on CPU
No, I don't follow that conversation and feel big temptation to unjsubscribe from all BOINC lists at all right now.
Quota management fundamental issue ignored completely but very lively discussion where to put button in Android interface screen and how to properly detect Windows 8.1...
Looked at Einstein's sources - only one difference spotted.
You directly call clFinish every time synching needed. I do indirect synching on blocking reads when required.
RE: Actually, the policy
)
Our custom kernels (see source code) are small in my opinion, but we have a large number of work items (up to 2^25 IIRC). The work group size is determined (limited) dynamically at runtime to respect the underlying hardware.
Update: I just noticed that our "kernelPowerSpectrum*" kernels became more complex these days :-) I've to admit that I haven't looked at the code for quite a while. So your profiling efforts could indeed be interesting.
Oliver
Einstein@Home Project
This gets curiouser and
)
This gets curiouser and curiouser. I pointed out that my Einstein tasks run for about 11 minutes when the CPU is 75% loaded, and only record ~20 seconds CPU time. But if the CPU is 100% loaded, the elapsed time jumps massively to around 80 minutes (extrapolated - I didn't complete a full one).
I've just tried the reverse experiment with SETI (astropulse) running on the same hardware - task list.
The most recent one - 15081890, reported 17 Oct 2013, 14:00:10 UTC - was run with 100% CPU loading throughout, and the extra elapsed time is barely noticeable (though some of the SIMAP tasks running in parallel took longer than usual).
From the outside, it looks as if Oliver's app needs a free CPU, but doesn't actually use it much: Raistmer's app uses a lot of CPU, but doesn't actually need it.
Just to check, I had a poke through Task Manager while the Einstein app was running in 75% CPU mode, to see if extra CPU time was being accounted for in other places. Probably not a robust enough test (even with 'show processes for all users' checked), but the most I could see was occasional spikes up to 2% in 'NT Kernel & System'. A fairly steady 23% was being allocated to 'System Idle Process'
RE: From the outside, it
)
Doesn't this make sense? I mean if our app doesn't use much CPU it depends on very fast context switching for the few parts that do run on the CPU. SETI's app does seem to use the CPU extensively, which is why it doesn't depend so much on the context switching - it already is running more or less continuously on the CPU, so it doesn't compete with other CPU processes for time slices, but our app does...
Einstein@Home Project
Yes, cause AP uses ~100% of
)
Yes, cause AP uses ~100% of CPU it's not correct to say that AP "doesn't need it". It just does not share it ;)
The strange thing is that such reaction on fully loaded CPU came only after last driver update on my host (from OpenCL 1.1 to OpenCL 1.2 driver). Cause Richard runs OpenCL Einstein more than I it's quite possibly he had updated driver (Einstein requires OpenCL 1.2 driver) long before I did upgrade so has no pre-OpenCL 1.2 points to compare with. I have. Older driver reacted in different way: on full load elapsed time increased considerably, but CPU time decreased. It's very fact I consider as confirmation for my "spend CPU on synching" theory. When CPU not available immediately app has less time for waiting loops (GPU mostly ready already to switching time).
EDIT: maybe with last driver change priority of corresponding driver thread was changed or smth alike - now app continue to use CPU under full load.
EDIT2: typical example for old driver (fully loaded CPU):
32,186.11 16,924.63 (elapsed/CPU).
And current situation:
26,569.22 26,345.06 (elapsed/CPU)
If you were following the
)
If you were following the v7.2.18 release thread on boinc_alpha, and the discussion about OpenCL detection on CPU (yes, not GPU - I got that wrong too), you'll remember that this machine was supplied from the factory with OpenCL 1.2 drivers for the HD 4600 - so no, I have no Astropulse times under OpenCL 1.1 driver for comparison.
As part of the v7.2.18 testing, I downloaded the additional Intel SDK and runtime support for OpenCL on CPU - that wasn't pre-installed.
RE: Unfortunately, have
)
I'm not that deep into CC either. But in principle they're checking huge integers (many in parallel for the GPUs) for the Collatz conjecture by running some algorithm on them ("3+1"). Thereby the numbers gradually become smaller, until the algorithm terminates with "CC still holds true" or "not" (never happened so far). So.. the "reductions per kernel" could be the algorithm iterations performed per kernel called.
MrS
Scanning for our furry friends since Jan 2002
Then the more number is the
)
Then the more number is the bigger kernel, not reverse.
RE: If you were following
)
No, I don't follow that conversation and feel big temptation to unjsubscribe from all BOINC lists at all right now.
Quota management fundamental issue ignored completely but very lively discussion where to put button in Android interface screen and how to properly detect Windows 8.1...
RE: Then the more number is
)
That's how I understnad it as well. But still, I was able to get CPU usage under control by setting smaller values.
MrS
Scanning for our furry friends since Jan 2002
Looked at Einstein's sources
)
Looked at Einstein's sources - only one difference spotted.
You directly call clFinish every time synching needed. I do indirect synching on blocking reads when required.
That is, Einstein:
enqueue();
clFinish();
bufferRead(false);
clFinish();
SETI:
enqueue();
bufferRead(true);
Could this difference lead to such big consequencies or not - no idea right now.
Next thing will be size of kernel calls determination.