CPU utilization while running GPU OpenCL applications is a problem for NVidia as well as Intel. The application uses only asynchronous kernel and memory copy functions but the driver "overrides" and decides to block anyway. One "hack" suggested by NVidia was to create allocate a large number of command queues so that the driver would think it should use the actual asynchronous calls, but that has not worked for me. If it wasn't driver related, then why would the same application work fine on AMD GPUs? So, write your local Intel developer and ask them why their driver needs CPU time to run asynchronous OpenCL kernels. You would think it would work properly with NVidia GPUs since OpenCL came from them. Nope. NVidia can't even follow their own specs properly. And it they can't, why should Intel? AMD isn't off the hook though. While their driver does async properly, it won't always compile the app properly. Maybe they should be renamed from NVidia, Intel and AMD to Larry, Moe, and Curly!
Unfortunately, has absolutely no idea what that "kernels per reduction" does in his app. So hard to comment how it would change load.
regarding many queues - this didn't work for me either. I tried to create smth like 20 queues - not change in app behavior was detected. Unlike Slicker I use simpler approach most of time, synchronous launches vs async ones... but looks like going to async will not help either.
If we start to share Intel OpenCL experience here, I would like to discuss another issue I have with SETI iGPU AP. Loss of precision.
Looks like this app produces slightly different results and more often lead to inconclusives. So far I tracked it to FFT call that results in FFT with ALL values slightly bigger that reference array (AMD implementation was used as reference it validates with CPU stock most of time). Such non-random deviation in same side surely will lead to deviation in final result but where this systematic shift appears first time not quite clear.
What about Einstein's Intel app validation? Any issues over same app for AMD/NV ?
If we start to share Intel OpenCL experience here, I would like to discuss another issue I have with SETI iGPU AP.
Hm, I think any deeper discussions should move to (a) separate thread(s).
Quote:
Loss of precision.
Looks like this app produces slightly different results and more often lead to inconclusives. So far I tracked it to FFT call that results in FFT with ALL values slightly bigger that reference array (AMD implementation was used as reference it validates with CPU stock most of time). Such non-random deviation in same side surely will lead to deviation in final result but where this systematic shift appears first time not quite clear.
Are you sure AMD's FFT is portable to other vendors' GPUs? Also, as far as I know they generate different kernel implementations (which you then can dump) based on the FFT setup and potentially even the hardware. We use a customised version of Apple's reference FFT, originally designed for NV's G80 architecture. You can get our version here.
Quote:
What about Einstein's Intel app validation? Any issues over same app for AMD/NV ?
If anything, the Intel GPU are even more stable in terms of validation than the AMD GPUs. Our Intel tasks exhibit less than 0.1% validation issues, which is about the level for CPUs. NVIDIA GPUs are solely covered by CUDA in our case, so there's no point in comparing them to the OpenCL tasks. FYI, we build all our OpenCL apps using AMD's APP SDK.
More precisely: SETI AP is OpenCL 1.0 app. It can run under all OpenCL drivers starting from very first ones (it appears quite long ago, right when AMD starts to implement openCL on their GPUs).
We didn't care about the 1.0 models as they were so slow that even a contemporary CPU core was faster :-)
Quote:
If Einstein's app is true OpenCL 1.1 one it can use some different methods for hostGPU synching (events). So I'm wonder is it the case or not ?
Nope, nothing fancy there. You may a have look at the source code (binary radio pulsar search application: src/opencl/app) to see for yourself (I know, not the best design). It's not the very latest version but there's been only one (irrelevant) functional OpenCL change since then.
So I'm interesting about typical kernel launch size for Einstein app. I'm not aware if some profiling tools exist for intel, but as both our apps capable to run on all 3 GPU types profiling on NV for example (or on Ati) would be quite enough for my purpose. I have rich profiling data on ATi GPUs for comparison.
As stated in a previous post, NVIDIA is out of the comparison as we only use CUDA on those devices (for the time being). While the algorithm is more or less the same, the CUDA app itself is only roughly comparable to the OpenCL app as it has some natural differences and, most importantly, uses a different FFT implementation (CUFFT).
While we did use NVIDIA's CUDA profiler at some point we didn't yet get around to use the AMD profiler because of technical constraints (cross-compiling for Windows, headless Linux nodes). Not sure I can find the time right now to try this again with the latest tools like CodeXL, though...
Quote:
Also, are Einstein's app sources available and where if yes?
See my previous post :-) You should be able to build the app following the link above and the (basic) instructions provided on that page. You may then give it a try and profile it on your hardware, for a direct comparison.
Are you sure AMD's FFT is portable to other vendors' GPUs? Also, as far as I know they generate different kernel implementations (which you then can dump) based on the FFT setup and potentially even the hardware. We use a customised version of Apple's reference FFT, originally designed for NV's G80 architecture. You can get our version here.
Actually, I use modded oclFFT (Apple's implementation) too. I just meant "reference AMD" in sense that results got from app running on AMD GPU, not on Intel. But both apps use oclFFT, not AMD-specific one.
Quote:
FYI, we build all our OpenCL apps using AMD's APP SDK.
Oliver
Thanks, it differs from my approach, I build app for particular GPU vs particular vendor's SDK.
We didn't care about the 1.0 models as they were so slow that even a contemporary CPU core was faster :-)
Well, HD4870 was fast enough even being used via Brook+ ;)
Quote:
You may a have look at the source code (binary radio pulsar search application: src/opencl/app) to see for yourself (I know, not the best design). It's not the very latest version but there's been only one (irrelevant) functional OpenCL change since then.
Oliver
Thanks, will look.
PS. If I understood right, Intel's GPU binary I have on my system could run on NV GPU too w/o modification, right?
Is there any bench package for offline testing? AFAIK Einstein uses lot of additional data files so offline testing could be big issue...
Actually, I use modded oclFFT (Apple's implementation) too. I just meant "reference AMD" in sense that results got from app running on AMD GPU, not on Intel. But both apps use oclFFT, not AMD-specific one.
Oh, in that case I recommend you try our version. Apple's implementation had some issues on a few AMD Radeon series GPUs (the 6900 series IIRC) because of their use of the faster but less accurate native_sin/native_cos functions. Have a look at the commit log (starting at 48a3c01) to get an idea.
I know, it's unrelated to the CPU usage issue but it might help with your potential validation problems.
PS. If I understood right, Intel's GPU binary I have on my system could run on NV GPU too w/o modification, right?
Well, in principle yes. However, this OpenCL app never really worked on NVIDIA GPUs, primarily in terms of validation - even when we used their SDK to build it. NVIDIA's support of OpenCL is a dead end anyway.
Quote:
Is there any bench package for offline testing? AFAIK Einstein uses lot of additional data files so offline testing could be big issue...
Not provided with that source code package but you can always hook up your host to our project and get a task for it. Take the files downloaded and the command line and you should be good to go. I recommend you use the BRP4 app as it required the least number of input/data files (just 3).
LoL, good
)
LoL, good written:
http://boinc.thesonntags.com/collatz/forum_thread.php?id=1019&postid=16833
At least I'm not alone in this boat :)
Unfortunately, has absolutely no idea what that "kernels per reduction" does in his app. So hard to comment how it would change load.
regarding many queues - this didn't work for me either. I tried to create smth like 20 queues - not change in app behavior was detected. Unlike Slicker I use simpler approach most of time, synchronous launches vs async ones... but looks like going to async will not help either.
If we start to share Intel
)
If we start to share Intel OpenCL experience here, I would like to discuss another issue I have with SETI iGPU AP. Loss of precision.
Looks like this app produces slightly different results and more often lead to inconclusives. So far I tracked it to FFT call that results in FFT with ALL values slightly bigger that reference array (AMD implementation was used as reference it validates with CPU stock most of time). Such non-random deviation in same side surely will lead to deviation in final result but where this systematic shift appears first time not quite clear.
What about Einstein's Intel app validation? Any issues over same app for AMD/NV ?
RE: What about Einstein's
)
Any issue like that would (should) be reported in the BRP4 Intel GPU app feedback thread in Problems and Bug Reports.
A quick search through doesn't reveal any validation problems except when ETA deliberately lowered the chip voltage and went too far.
My Haswell (154,193 credits in a month, so at least 2,500 completed tasks) is showing zero errors and zero invalid at the moment.
RE: If we start to share
)
Hm, I think any deeper discussions should move to (a) separate thread(s).
Are you sure AMD's FFT is portable to other vendors' GPUs? Also, as far as I know they generate different kernel implementations (which you then can dump) based on the FFT setup and potentially even the hardware. We use a customised version of Apple's reference FFT, originally designed for NV's G80 architecture. You can get our version here.
If anything, the Intel GPU are even more stable in terms of validation than the AMD GPUs. Our Intel tasks exhibit less than 0.1% validation issues, which is about the level for CPUs. NVIDIA GPUs are solely covered by CUDA in our case, so there's no point in comparing them to the OpenCL tasks. FYI, we build all our OpenCL apps using AMD's APP SDK.
Oliver
Einstein@Home Project
RE: More precisely: SETI
)
We didn't care about the 1.0 models as they were so slow that even a contemporary CPU core was faster :-)
Nope, nothing fancy there. You may a have look at the source code (binary radio pulsar search application: src/opencl/app) to see for yourself (I know, not the best design). It's not the very latest version but there's been only one (irrelevant) functional OpenCL change since then.
Oliver
Einstein@Home Project
RE: So I'm interesting
)
As stated in a previous post, NVIDIA is out of the comparison as we only use CUDA on those devices (for the time being). While the algorithm is more or less the same, the CUDA app itself is only roughly comparable to the OpenCL app as it has some natural differences and, most importantly, uses a different FFT implementation (CUFFT).
While we did use NVIDIA's CUDA profiler at some point we didn't yet get around to use the AMD profiler because of technical constraints (cross-compiling for Windows, headless Linux nodes). Not sure I can find the time right now to try this again with the latest tools like CodeXL, though...
See my previous post :-) You should be able to build the app following the link above and the (basic) instructions provided on that page. You may then give it a try and profile it on your hardware, for a direct comparison.
Oliver
Einstein@Home Project
RE: Are you sure AMD's FFT
)
Actually, I use modded oclFFT (Apple's implementation) too. I just meant "reference AMD" in sense that results got from app running on AMD GPU, not on Intel. But both apps use oclFFT, not AMD-specific one.
Thanks, it differs from my approach, I build app for particular GPU vs particular vendor's SDK.
RE: We didn't care about
)
Well, HD4870 was fast enough even being used via Brook+ ;)
Thanks, will look.
PS. If I understood right, Intel's GPU binary I have on my system could run on NV GPU too w/o modification, right?
Is there any bench package for offline testing? AFAIK Einstein uses lot of additional data files so offline testing could be big issue...
RE: Actually, I use modded
)
Oh, in that case I recommend you try our version. Apple's implementation had some issues on a few AMD Radeon series GPUs (the 6900 series IIRC) because of their use of the faster but less accurate native_sin/native_cos functions. Have a look at the commit log (starting at 48a3c01) to get an idea.
I know, it's unrelated to the CPU usage issue but it might help with your potential validation problems.
Oliver
Einstein@Home Project
RE: PS. If I understood
)
Well, in principle yes. However, this OpenCL app never really worked on NVIDIA GPUs, primarily in terms of validation - even when we used their SDK to build it. NVIDIA's support of OpenCL is a dead end anyway.
Not provided with that source code package but you can always hook up your host to our project and get a task for it. Take the files downloaded and the command line and you should be good to go. I recommend you use the BRP4 app as it required the least number of input/data files (just 3).
Oliver
Einstein@Home Project