The BRP4G application should be intelligent enough to figure what it is running on, and if that piece of hardware is even capable of doing what it wants to do.
Yes, that seems to be exactly what it's doing.
Or not, not when you see [Einstein@Home] [coproc] ATI instance 1: confirming 0.500000 instance for p2030.20131122.G177.14+00.50.S.b3s0g0.00000_3792_0
That still implies it's running on GPU 1, which isn't capable of running the task.
@Heinz, in case you think it's done on the CPU, I don't think so. Despite the CPU being OpenCL capable, I don't think you can run a GPU app indiscriminately on a CPU. It would also not show as having been done on the Cypress GPU.
The BRP4G application should be intelligent enough to figure what it is running on, and if that piece of hardware is even capable of doing what it wants to do.
Yes, that seems to be exactly what it's doing.
Or not, not when you see [Einstein@Home] [coproc] ATI instance 1: confirming 0.500000 instance for p2030.20131122.G177.14+00.50.S.b3s0g0.00000_3792_0
That still implies it's running on GPU 1, which isn't capable of running the task.
No, that merely implies that BOINC thinks the task is running on GPU 1 (which it shouldn't, being excluded).
I wouldn't call BOINC a reliable witness as to what is actually happening behind the scenes, in this instance.
It might be interesting to see what exactly BOINC had directed the application to do, by examining init_data.xml from the slot directory - but I'm not sure even that would be definitive, because I suspect the application is capable of over-riding an impossible directive (using its own internal OpenCL capability check).
Because at the time this task was running Boinc Manager showed that the task was running on device 1. I have since added an exclusion to the cc_config.xml to ensure that it only runs on device 0.
Quote:
I wouldn't call BOINC a reliable witness as to what is actually happening behind the scenes, in this instance.
It being the managing program, detailing which task should run with what application, on what piece of hardware, it should be a reliable witness.
And if it isn't, then that should be added PDQ.
For if you can't trust BOINC anymore to manage what task runs at what time, on what hardware, using how much memory and which application, then there's no use for BOINC really. We can do it all stand-alone.
Oh, PS I am not (yet) writing anything about this to David.
but the task p2030.20131122.G177.14+00.50.S.b3s0g0.00000_3792_0 would have been Binary Radio Pulsar Search (Arecibo, GPU) v1.39 (BRP4G-opencl-ati), or einsteinbinary_BRP4G for short. When did we lose "using [short_name]" from the event log?
I still think that BOINC can be relied on to report on what instructions it has given, but I don't think it is (or was ever meant to be) an analytic tool for detecting that the instructions it has given are being accurately obeyed.
@ Darell,
I'd still be interested in seeing BOINC's directive from init_data.xml, but don't post the whole thing - there's a surprising amount of private data in there. The bit that would be useful is the section below:
@Heinz, in case you think it's done on the CPU, I don't think so. Despite the CPU being OpenCL capable, I don't think you can run a GPU app indiscriminately on a CPU. It would also not show as having been done on the Cypress GPU.
No, in fact the BRP code does not run on OpenCL CPUs at all (it would fail with an error message).
I suspected that there might be a problem because inclusion or exclusion of CPUs in the list of OpenCL enabled devices might prevent client and science app to pick the right GPU.
From the perspective of the science app, there isn't that much you can do wrong, actually: the science app calls a BOINC api function to get the correct OpenCL platform & device, and the API function does this by parsing information that the core client wrote to the init_data.xml file in the respective slot directory of the task. As you can see from the example snippet above, the device is basically encoded by its position in a list. If OpenCL CPUs share the same list as OpenCL GPUs and can appear at lower indices...we have a problem. Device nr n would have a different meaning to the client and to older APIs. I need to inspect the source code when I have time to see how it works.
The BRP app from Einstein@Home will use the values returned by this API function and will log failures to use the device selected by the BOINC API.
I'm sure we should flag this for all the developers involved.
But also, if Darrell could de-exclude the GPU for this project, then wait until he sees something similar happen again, stop BOINC and run everything through the client simulator (http://boinc.berkeley.edu/trac/wiki/ClientSim) and point out the run he had, one of us could point that to the developers as well.
Apropos, BOINC 7.2.28 had OpenCL for CPUs already. So the present recommended 7.2.42 does so as well.
I can do this this weekend, have college classes, Tuesday, Wednesday, and Thursday after working eight hours.
@Heinz, in case you think it's done on the CPU, I don't think so. Despite the CPU being OpenCL capable, I don't think you can run a GPU app indiscriminately on a CPU. It would also not show as having been done on the Cypress GPU.
Utilization factor set at the default value in the preferences.
contents Einstein@Home app_config.xml:
einsteinbinary_BRP4
0.5
0.25
einsteinbinary_BRP4G
0.5
0.25
einsteinbinary_BRP5
0.5
0.25
hsgamma_FGRP3
0.5
0.25
einstein_S6CasA
0.5
0.25
Note: The BRP4G and S6CasA apps exclusion were added this week to the client config xml, after they had completed tasks on the HD3000. I just thought it was strange that most opencl tasks fail immediately when trying to run on the HD3000 as you would expect, and suddenly along comes a couple that don't.
Stranger and
)
Stranger and stranger.
Would it be possible for you to download and run
http://boinc.berkeley.edu/dl/clinfo.zip
-from memory, I think the best way is to run it at an administrative command prompt and redirect the output to a text file.
RE: RE: The BRP4G
)
Or not, not when you see [Einstein@Home] [coproc] ATI instance 1: confirming 0.500000 instance for p2030.20131122.G177.14+00.50.S.b3s0g0.00000_3792_0
That still implies it's running on GPU 1, which isn't capable of running the task.
I wonder... @Darrell, what have you set for the GPU Utilization factor for BRP tasks in http://einstein.phys.uwm.edu/prefs.php?subset=project and what are the contents of your app_config.xml file?
@Heinz, in case you think it's done on the CPU, I don't think so. Despite the CPU being OpenCL capable, I don't think you can run a GPU app indiscriminately on a CPU. It would also not show as having been done on the Cypress GPU.
RE: RE: RE: The BRP4G
)
No, that merely implies that BOINC thinks the task is running on GPU 1 (which it shouldn't, being excluded).
I wouldn't call BOINC a reliable witness as to what is actually happening behind the scenes, in this instance.
It might be interesting to see what exactly BOINC had directed the application to do, by examining init_data.xml from the slot directory - but I'm not sure even that would be definitive, because I suspect the application is capable of over-riding an impossible directive (using its own internal OpenCL capability check).
RE: No, that merely implies
)
Darrell only added the exclusion after that task ran.
In this post,
It being the managing program, detailing which task should run with what application, on what piece of hardware, it should be a reliable witness.
And if it isn't, then that should be added PDQ.
For if you can't trust BOINC anymore to manage what task runs at what time, on what hardware, using how much memory and which application, then there's no use for BOINC really. We can do it all stand-alone.
Oh, PS I am not (yet) writing anything about this to David.
Ah, yes. I was looking
)
Ah, yes. I was looking at
but the task p2030.20131122.G177.14+00.50.S.b3s0g0.00000_3792_0 would have been Binary Radio Pulsar Search (Arecibo, GPU) v1.39 (BRP4G-opencl-ati), or einsteinbinary_BRP4G for short. When did we lose "using [short_name]" from the event log?
I still think that BOINC can be relied on to report on what instructions it has given, but I don't think it is (or was ever meant to be) an analytic tool for detecting that the instructions it has given are being accurately obeyed.
@ Darell,
I'd still be interested in seeing BOINC's directive from init_data.xml, but don't post the whole thing - there's a surprising amount of private data in there. The bit that would be useful is the section below:
- or even just the bit I've picked out in blue.
RE: @Heinz, in case you
)
No, in fact the BRP code does not run on OpenCL CPUs at all (it would fail with an error message).
I suspected that there might be a problem because inclusion or exclusion of CPUs in the list of OpenCL enabled devices might prevent client and science app to pick the right GPU.
From the perspective of the science app, there isn't that much you can do wrong, actually: the science app calls a BOINC api function to get the correct OpenCL platform & device, and the API function does this by parsing information that the core client wrote to the init_data.xml file in the respective slot directory of the task. As you can see from the example snippet above, the device is basically encoded by its position in a list. If OpenCL CPUs share the same list as OpenCL GPUs and can appear at lower indices...we have a problem. Device nr n would have a different meaning to the client and to older APIs. I need to inspect the source code when I have time to see how it works.
The BRP app from Einstein@Home will use the values returned by this API function and will log failures to use the device selected by the BOINC API.
HB
OK, I've re-enabled my
)
OK, I've re-enabled my OpenCL/CPU testbed to match this problem as closely as I can here (no AMD devices, sorry):
The init_data.xml in slot 0 contains:
intel_gpu
0
0
1.000000
1.000000
- index 0 throughout, even though clinfo detects the GPU second:
Now that's set up, let me know if there are any useful tests I can run on this platform.
RE: I'm sure we should
)
I can do this this weekend, have college classes, Tuesday, Wednesday, and Thursday after working eight hours.
This weekend is quickly
)
This weekend is quickly enough. :)
RE: I wonder... @Darrell,
)
Utilization factor set at the default value in the preferences.
contents Einstein@Home app_config.xml:
einsteinbinary_BRP4
0.5
0.25
einsteinbinary_BRP4G
0.5
0.25
einsteinbinary_BRP5
0.5
0.25
hsgamma_FGRP3
0.5
0.25
einstein_S6CasA
0.5
0.25
Note: The BRP4G and S6CasA apps exclusion were added this week to the client config xml, after they had completed tasks on the HD3000. I just thought it was strange that most opencl tasks fail immediately when trying to run on the HD3000 as you would expect, and suddenly along comes a couple that don't.