Support for (integrated) Intel GPUs (Ivy Bridge and later)

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181400534
RAC: 8964

LoL, good

LoL, good written:

Quote:
CPU utilization while running GPU OpenCL applications is a problem for NVidia as well as Intel. The application uses only asynchronous kernel and memory copy functions but the driver "overrides" and decides to block anyway. One "hack" suggested by NVidia was to create allocate a large number of command queues so that the driver would think it should use the actual asynchronous calls, but that has not worked for me. If it wasn't driver related, then why would the same application work fine on AMD GPUs? So, write your local Intel developer and ask them why their driver needs CPU time to run asynchronous OpenCL kernels. You would think it would work properly with NVidia GPUs since OpenCL came from them. Nope. NVidia can't even follow their own specs properly. And it they can't, why should Intel? AMD isn't off the hook though. While their driver does async properly, it won't always compile the app properly. Maybe they should be renamed from NVidia, Intel and AMD to Larry, Moe, and Curly!

http://boinc.thesonntags.com/collatz/forum_thread.php?id=1019&postid=16833

At least I'm not alone in this boat :)

Unfortunately, has absolutely no idea what that "kernels per reduction" does in his app. So hard to comment how it would change load.
regarding many queues - this didn't work for me either. I tried to create smth like 20 queues - not change in app behavior was detected. Unlike Slicker I use simpler approach most of time, synchronous launches vs async ones... but looks like going to async will not help either.

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181400534
RAC: 8964

If we start to share Intel

If we start to share Intel OpenCL experience here, I would like to discuss another issue I have with SETI iGPU AP. Loss of precision.
Looks like this app produces slightly different results and more often lead to inconclusives. So far I tracked it to FFT call that results in FFT with ALL values slightly bigger that reference array (AMD implementation was used as reference it validates with CPU stock most of time). Such non-random deviation in same side surely will lead to deviation in final result but where this systematic shift appears first time not quite clear.

What about Einstein's Intel app validation? Any issues over same app for AMD/NV ?

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2954713277
RAC: 711935

RE: What about Einstein's

Quote:
What about Einstein's Intel app validation? Any issues over same app for AMD/NV ?


Any issue like that would (should) be reported in the BRP4 Intel GPU app feedback thread in Problems and Bug Reports.

A quick search through doesn't reveal any validation problems except when ETA deliberately lowered the chip voltage and went too far.

My Haswell (154,193 credits in a month, so at least 2,500 completed tasks) is showing zero errors and zero invalid at the moment.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171376
RAC: 43

RE: If we start to share

Quote:
If we start to share Intel OpenCL experience here, I would like to discuss another issue I have with SETI iGPU AP.

Hm, I think any deeper discussions should move to (a) separate thread(s).

Quote:
Loss of precision.
Looks like this app produces slightly different results and more often lead to inconclusives. So far I tracked it to FFT call that results in FFT with ALL values slightly bigger that reference array (AMD implementation was used as reference it validates with CPU stock most of time). Such non-random deviation in same side surely will lead to deviation in final result but where this systematic shift appears first time not quite clear.

Are you sure AMD's FFT is portable to other vendors' GPUs? Also, as far as I know they generate different kernel implementations (which you then can dump) based on the FFT setup and potentially even the hardware. We use a customised version of Apple's reference FFT, originally designed for NV's G80 architecture. You can get our version here.

Quote:

What about Einstein's Intel app validation? Any issues over same app for AMD/NV ?

If anything, the Intel GPU are even more stable in terms of validation than the AMD GPUs. Our Intel tasks exhibit less than 0.1% validation issues, which is about the level for CPUs. NVIDIA GPUs are solely covered by CUDA in our case, so there's no point in comparing them to the OpenCL tasks. FYI, we build all our OpenCL apps using AMD's APP SDK.

Oliver

Einstein@Home Project

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171376
RAC: 43

RE: More precisely: SETI

Quote:

More precisely: SETI AP is OpenCL 1.0 app. It can run under all OpenCL drivers starting from very first ones (it appears quite long ago, right when AMD starts to implement openCL on their GPUs).

We didn't care about the 1.0 models as they were so slow that even a contemporary CPU core was faster :-)

Quote:

If Einstein's app is true OpenCL 1.1 one it can use some different methods for hostGPU synching (events). So I'm wonder is it the case or not ?


Nope, nothing fancy there. You may a have look at the source code (binary radio pulsar search application: src/opencl/app) to see for yourself (I know, not the best design). It's not the very latest version but there's been only one (irrelevant) functional OpenCL change since then.

Oliver

Einstein@Home Project

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171376
RAC: 43

RE: So I'm interesting

Quote:

So I'm interesting about typical kernel launch size for Einstein app. I'm not aware if some profiling tools exist for intel, but as both our apps capable to run on all 3 GPU types profiling on NV for example (or on Ati) would be quite enough for my purpose. I have rich profiling data on ATi GPUs for comparison.

As stated in a previous post, NVIDIA is out of the comparison as we only use CUDA on those devices (for the time being). While the algorithm is more or less the same, the CUDA app itself is only roughly comparable to the OpenCL app as it has some natural differences and, most importantly, uses a different FFT implementation (CUFFT).

While we did use NVIDIA's CUDA profiler at some point we didn't yet get around to use the AMD profiler because of technical constraints (cross-compiling for Windows, headless Linux nodes). Not sure I can find the time right now to try this again with the latest tools like CodeXL, though...

Quote:

Also, are Einstein's app sources available and where if yes?

See my previous post :-) You should be able to build the app following the link above and the (basic) instructions provided on that page. You may then give it a try and profile it on your hardware, for a direct comparison.

Oliver

Einstein@Home Project

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181400534
RAC: 8964

RE: Are you sure AMD's FFT

Quote:

Are you sure AMD's FFT is portable to other vendors' GPUs? Also, as far as I know they generate different kernel implementations (which you then can dump) based on the FFT setup and potentially even the hardware. We use a customised version of Apple's reference FFT, originally designed for NV's G80 architecture. You can get our version here.

Actually, I use modded oclFFT (Apple's implementation) too. I just meant "reference AMD" in sense that results got from app running on AMD GPU, not on Intel. But both apps use oclFFT, not AMD-specific one.

Quote:

FYI, we build all our OpenCL apps using AMD's APP SDK.

Oliver


Thanks, it differs from my approach, I build app for particular GPU vs particular vendor's SDK.

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181400534
RAC: 8964

RE: We didn't care about

Quote:

We didn't care about the 1.0 models as they were so slow that even a contemporary CPU core was faster :-)


Well, HD4870 was fast enough even being used via Brook+ ;)

Quote:


You may a have look at the source code (binary radio pulsar search application: src/opencl/app) to see for yourself (I know, not the best design). It's not the very latest version but there's been only one (irrelevant) functional OpenCL change since then.

Oliver

Thanks, will look.

PS. If I understood right, Intel's GPU binary I have on my system could run on NV GPU too w/o modification, right?
Is there any bench package for offline testing? AFAIK Einstein uses lot of additional data files so offline testing could be big issue...

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171376
RAC: 43

RE: Actually, I use modded

Quote:

Actually, I use modded oclFFT (Apple's implementation) too. I just meant "reference AMD" in sense that results got from app running on AMD GPU, not on Intel. But both apps use oclFFT, not AMD-specific one.

Oh, in that case I recommend you try our version. Apple's implementation had some issues on a few AMD Radeon series GPUs (the 6900 series IIRC) because of their use of the faster but less accurate native_sin/native_cos functions. Have a look at the commit log (starting at 48a3c01) to get an idea.

I know, it's unrelated to the CPU usage issue but it might help with your potential validation problems.

Oliver

Einstein@Home Project

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171376
RAC: 43

RE: PS. If I understood

Quote:

PS. If I understood right, Intel's GPU binary I have on my system could run on NV GPU too w/o modification, right?


Well, in principle yes. However, this OpenCL app never really worked on NVIDIA GPUs, primarily in terms of validation - even when we used their SDK to build it. NVIDIA's support of OpenCL is a dead end anyway.

Quote:

Is there any bench package for offline testing? AFAIK Einstein uses lot of additional data files so offline testing could be big issue...


Not provided with that source code package but you can always hook up your host to our project and get a task for it. Take the files downloaded and the command line and you should be good to go. I recommend you use the BRP4 app as it required the least number of input/data files (just 3).

Oliver

Einstein@Home Project

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.