Funny, I know Gergely and could ask him whether he could solve his issue. Are the symptoms the same in your case or did you indeed just list "similar" ones...?
Well, he under Linux, I'm under Windows.
But I get same message too when abort app. In short - no profiling even started (EDIT: I mean, no data written to any file. CodeXL GUI reports that "GPU profiling in progress"), looks like app got some OpenCL runtime crash at very beginning (same behavior as after driver crash and restart: no progress+ full CPU core usage). But there were not messages about driver restart in this case.
Is it possible to build app under MSVC w/o re-writing your build script completely?
What external libs required (besides BOINC ones, FFTW and OpenCL) ?
Is it possible to build app under MSVC w/o re-writing your build script completely?
Hm, should be. But that will surely require some work.
Quote:
What external libs required (besides BOINC ones, FFTW and OpenCL) ?
Have a look at build.sh, it downloads every third party lib it needs. In addition to what you already mentioned you'll need GSL, libxml2 and our OpenCL FFT library I referenced earlier.
I had no time so far to rebuild einstein's app or profile it on NV (but want to do such profiling still) but I tried to eliminate known differencies between einstein's app and SETi's astropulse. Looks like AMD SDK vs Intel SDK doesn't matter for CPU consumption. But what really matters is the synching style. When each runtime call followed by clFinish CPU usage drops considerably. So, synching on blocking read not the same as synching on clFinish for intel (and i suspect for NV too) GPU. AMD GPUs don't affected.
total synching, of course, leads to some performance drop for astropulse, but later i could eliminate excessive synching and performance should almost (or even totally) restore.
Here are examples of bench runs that Richard made on his host:
for free CPU core:
WU : Clean_01LC.wu
astropulse_6.01_windows_intelx86.exe -verbose :
Elapsed 422.874 secs
CPU 420.251 secs
AP6_win_x86_SSE2_OpenCL_Intel_r1922.exe -verbose :
Elapsed 85.395 secs, speedup: 79.81% ratio: 4.95x
CPU 82.259 secs, speedup: 80.43% ratio: 5.11x
AP6_win_x86_SSE2_OpenCL_Intel_r1922_OCL_SYNCHED.exe -verbose :
Elapsed 91.791 secs, speedup: 78.29% ratio: 4.61x CPU 8.502 secs, speedup: 97.98% ratio: 49.43x
But what really matters is the synching style. When each runtime call followed by clFinish CPU usage drops considerably. So, synching on blocking read not the same as synching on clFinish for intel (and i suspect for NV too) GPU.
That rings a bell somewhere. I think we also ran into this issue of implicit vs explicit synchronisation at some point but I can't find any reference to that anymore. However, we only sync where we actually need to in order to not spoil the performance.
Quote:
So, I would say this enigmatic difference is explained, thanks Oliver for code sharing and Richard for initiating this comparison.
Sigh, that's standard C99. One reason why we use GCC for all binaries. Converting 2PI from decimal to float will often incur some minor rounding errors so your code might be slightly less accurate.
Yeah, but too bound to MSVS for now to make change :)
And thanks again, your hint about too poor native_sin/cos implementation on Intel GPUs was very useful too. Your changes in clFFT + replacement native_sin/cos to sincos inside SETIs dechirping function made result much more accurate. Now validation passed on test tasks.
Activated exception handling...
20:05:56 (2824): Can't set up shared mem: -1. Will run in standalone mode.
[20:05:56][2824][INFO ] Starting data processing...
[20:05:56][2824][INFO ] Using OpenCL platform provided by: NVIDIA Corporation
[20:05:56][2824][ERROR] Couldn't find any suitable OpenCL GPU device!
[20:05:56][2824][ERROR] Demodulation failed (error: 2004)!
20:05:56 (2824): called boinc_finish
Cause AMD profiler refused to profile I tried to use CUDA profiler instead on NV GPU. But looks like app doesn't accept my NV host config as valid OpenCL environment for it.
I use GTX260 GPU and my app reports:
Number of OpenCL platforms: 1
OpenCL Platform Name: NVIDIA CUDA
Number of devices: 1
Max compute units: 27
Max work group size: 512
Max clock frequency: 1242Mhz
Max memory allocation: 234799104
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 939196416
Constant buffer size: 65536
Max number of constant args: 9
Local memory type: Scratchpad
Local memory size: 16384
Queue properties:
Out-of-Order: Yes
Name: GeForce GTX 260
Vendor: NVIDIA Corporation
Driver version: 263.06
Version: OpenCL 1.0 CUDA
Extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64
Perhaps, OpenCL 1.0 is no go for your app. Well, I'm afraid I have no PC with OpenCL 1.1 on NV so suhc profiling will be hard to make.
RE: RE: http://devgurus.a
)
Well, he under Linux, I'm under Windows.
But I get same message too when abort app. In short - no profiling even started (EDIT: I mean, no data written to any file. CodeXL GUI reports that "GPU profiling in progress"), looks like app got some OpenCL runtime crash at very beginning (same behavior as after driver crash and restart: no progress+ full CPU core usage). But there were not messages about driver restart in this case.
Is it possible to build app
)
Is it possible to build app under MSVC w/o re-writing your build script completely?
What external libs required (besides BOINC ones, FFTW and OpenCL) ?
RE: Is it possible to build
)
Hm, should be. But that will surely require some work.
Have a look at build.sh, it downloads every third party lib it needs. In addition to what you already mentioned you'll need GSL, libxml2 and our OpenCL FFT library I referenced earlier.
Oliver
Einstein@Home Project
I had no time so far to
)
I had no time so far to rebuild einstein's app or profile it on NV (but want to do such profiling still) but I tried to eliminate known differencies between einstein's app and SETi's astropulse. Looks like AMD SDK vs Intel SDK doesn't matter for CPU consumption. But what really matters is the synching style. When each runtime call followed by clFinish CPU usage drops considerably. So, synching on blocking read not the same as synching on clFinish for intel (and i suspect for NV too) GPU. AMD GPUs don't affected.
total synching, of course, leads to some performance drop for astropulse, but later i could eliminate excessive synching and performance should almost (or even totally) restore.
Here are examples of bench runs that Richard made on his host:
for free CPU core:
WU : Clean_01LC.wu
astropulse_6.01_windows_intelx86.exe -verbose :
Elapsed 422.874 secs
CPU 420.251 secs
AP6_win_x86_SSE2_OpenCL_Intel_r1922.exe -verbose :
Elapsed 85.395 secs, speedup: 79.81% ratio: 4.95x
CPU 82.259 secs, speedup: 80.43% ratio: 5.11x
AP6_win_x86_SSE2_OpenCL_Intel_r1922_OCL_SYNCHED.exe -verbose :
Elapsed 91.791 secs, speedup: 78.29% ratio: 4.61x
CPU 8.502 secs, speedup: 97.98% ratio: 49.43x
for fully loaded CPU:
WU : Clean_01LC.wu
astropulse_6.01_windows_intelx86.exe -verbose :
Elapsed 422.874 secs
CPU 420.251 secs
AP6_win_x86_SSE2_OpenCL_Intel_r1922.exe -verbose :
Elapsed 85.005 secs, speedup: 79.90% ratio: 4.97x
CPU 81.589 secs, speedup: 80.59% ratio: 5.15x
AP6_win_x86_SSE2_OpenCL_Intel_r1922_OCL_SYNCHED.exe -verbose :
Elapsed 90.389 secs, speedup: 78.63% ratio: 4.68x
CPU 7.675 secs, speedup: 98.17% ratio: 54.76x
So, I would say this enigmatic difference is explained, thanks Oliver for code sharing and Richard for initiating this comparison.
RE: But what really matters
)
That rings a bell somewhere. I think we also ran into this issue of implicit vs explicit synchronisation at some point but I can't find any reference to that anymore. However, we only sync where we actually need to in order to not spoil the performance.
Great news! Glad I could help!
Cheers,
Oliver
Einstein@Home Project
Hi again. I'm trying to
)
Hi again.
I'm trying to follow your other hint regarding possible FFT inaccuracy so incorporating your changes into clFFT.
Unfortunately, code can't be built under MSVC compiler.
This line
gives next error:
Any ideas how to make it more portable?
EIDT:
indeed, MSVC 2008 doesn't know any suffixes in hexadecimal numbers: [url]http://msdn.microsoft.com/en-us/library/2k2xf226(VS.90).aspx[/url]
EDIT2:
perhaps
float pi2=2.0f*M_PI;
could go.
Sigh, that's standard C99.
)
Sigh, that's standard C99. One reason why we use GCC for all binaries. Converting 2PI from decimal to float will often incur some minor rounding errors so your code might be slightly less accurate.
Oliver
Einstein@Home Project
Yeah, but too bound to MSVS
)
Yeah, but too bound to MSVS for now to make change :)
And thanks again, your hint about too poor native_sin/cos implementation on Intel GPUs was very useful too. Your changes in clFFT + replacement native_sin/cos to sincos inside SETIs dechirping function made result much more accurate. Now validation passed on test tasks.
That's great news!
)
That's great news!
Einstein@Home Project
Unfortunately, failed again
)
Unfortunately, failed again to profile your app:
Activated exception handling...
20:05:56 (2824): Can't set up shared mem: -1. Will run in standalone mode.
[20:05:56][2824][INFO ] Starting data processing...
[20:05:56][2824][INFO ] Using OpenCL platform provided by: NVIDIA Corporation
[20:05:56][2824][ERROR] Couldn't find any suitable OpenCL GPU device!
[20:05:56][2824][ERROR] Demodulation failed (error: 2004)!
20:05:56 (2824): called boinc_finish
Cause AMD profiler refused to profile I tried to use CUDA profiler instead on NV GPU. But looks like app doesn't accept my NV host config as valid OpenCL environment for it.
I use GTX260 GPU and my app reports:
Number of OpenCL platforms: 1
OpenCL Platform Name: NVIDIA CUDA
Number of devices: 1
Max compute units: 27
Max work group size: 512
Max clock frequency: 1242Mhz
Max memory allocation: 234799104
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 939196416
Constant buffer size: 65536
Max number of constant args: 9
Local memory type: Scratchpad
Local memory size: 16384
Queue properties:
Out-of-Order: Yes
Name: GeForce GTX 260
Vendor: NVIDIA Corporation
Driver version: 263.06
Version: OpenCL 1.0 CUDA
Extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64
Perhaps, OpenCL 1.0 is no go for your app. Well, I'm afraid I have no PC with OpenCL 1.1 on NV so suhc profiling will be hard to make.