I thought we went over this already. That issue doesn’t seem to be in play anymore. Or at least is not implemented in the way you’re describing.
ive personally watched my 3GB nvidia card happily use 2GB of GPU memory (66%) and my 4GB nvidia cards happily use 3.2GB memory (80%) on these Einstein OpenCL tasks. That’s much more than 25% and on single tasks.
I think the limit is only applied to single buffer size. Which is worked around by nearly all implementations (including SETI and Einstein) by running multiple buffers, effectively lifting that limit.
Nvidia driver in practice seems to successfully allocate single memory chunks (for OpenCL) far beyond result of CL_DEVICE_MAX_MEM_ALLOC_SIZE value, and everything works well (but of course such allocations are inappropriate for production code)
And this (posted by NVIDIA development team representative)
Developers can try to allocate more memory than CL_DEVICE_MAX_MEM_ALLOC_SIZE, but the successful allocation is not guaranteed (this is same for any allocation call). The developers should check for error returned by clCreateBuffer and use the allocation only if the call returns CL_SUCCESS
So looks like this limit is reported but not actually enforced. Just allocation beyond this value is "not guaranteed". But usually all works OK as if no such limit in practice exist.
Strange decisions from NV... And they not gonna fix it. Probable another trick to push more software devs from using open industry standards (openCL) to theirs own private (CUDA) as there is no such "fake" limits in CUDA
But the limit is there. It is defined in the host details whenever OpenCL is probed for the card.
Whether the limit can be managed by some clever code writing in your application is an unknown to me as I am not a developer.
I only state the 25% limit is real. I don't know how the project app developers get around the issue.
all I can say is that this limit doesn’t seem to affect Einstein at all with the way the apps here are coded. So there’s really no point mentioning it.
the Gamma Ray tasks use only a small amount of GPU ram, and the Gravitational Wave tasks must be using multiple buffers as they can use 100% of the GPU memory. This is actually the cause of all the issues here with GW tasks. The GPU ram filling up past 100%, not 25%
but it’s demonstrably not applicable here. It’s an interesting nugget of knowledge, but why mention it if it’s not applicable to Einstein. It won’t help anyone solve any problems here, since it causes no problems here. It’s really info for the devs to know their limits when writing their software. And the devs here have either knowingly or unknowingly coded the apps in such a way as to not be limited by this. How else could we be seeing successful executions using >25% if it was a hard limit?
It probably hasn’t been an issue for nvidia GPUs for a LONG time since the amount of GPU ram on most cards has grown. Maybe if you had a GPU with less than 1GB of ram, you might run into this 25% single buffer limit, but how many people are running GPUs with that little VRAM anymore?
it’s kind of like the whole PCIe bandwidth thing that likes to be brought up every now and then. Times have changed. Maybe it was a problem in the past, but no use bringing it up anymore if it’s not a problem anymore. (GPUGRID is the only project I’ve found that has a noticeable impact with PCIe bandwidth)
I do see the VRAM allocation (via nvidia-smi) ramp up in fairly small steps when an openCl GW task starts. This morning I've had several consecutive "large", i.e. >3 GB, tasks run on my GTX 1060 (6 GB). It takes about 20 seconds from the start of the task to top-out. Just one example, probing nvidia-smi at roughly 2-second intervals, the VRAM usage (MB) went like: 8, 317, 439, 691, 819, 997, 1253, 1381, 1571, 1715, 1963, 3031, 3295. Mostly steps of a few hundred MB, except for next to the last. Suggests, to me, that the app IS managing multiple buffers and thus avoiding the 25% openCl constraint. It also sheds light on why low-ram GPUs don't crash immediately when trying to run a WU that (eventually) needs more VRAM. It isn't all requested in one chunk.
I thought we went over this
)
I thought we went over this already. That issue doesn’t seem to be in play anymore. Or at least is not implemented in the way you’re describing.
ive personally watched my 3GB nvidia card happily use 2GB of GPU memory (66%) and my 4GB nvidia cards happily use 3.2GB memory (80%) on these Einstein OpenCL tasks. That’s much more than 25% and on single tasks.
I think the limit is only applied to single buffer size. Which is worked around by nearly all implementations (including SETI and Einstein) by running multiple buffers, effectively lifting that limit.
_________________________________________________________________________
But the limit is there. It
)
But the limit is there. It is defined in the host details whenever OpenCL is probed for the card.
Whether the limit can be managed by some clever code writing in your application is an unknown to me as I am not a developer.
I only state the 25% limit is real. I don't know how the project app developers get around the issue.
If you follow own link to the
)
If you follow own link from previous post (https://forums.developer.nvidia.com/t/why-is-cl-device-max-mem-alloc-size-never-larger-than-25-of-cl-device-global-mem-size-only-on-nvidia/47745) to the end you will find this (posted in 2017y):
And this (posted by NVIDIA development team representative)
So looks like this limit is reported but not actually enforced. Just allocation beyond this value is "not guaranteed". But usually all works OK as if no such limit in practice exist.
Strange decisions from NV... And they not gonna fix it. Probable another trick to push more software devs from using open industry standards (openCL) to theirs own private (CUDA) as there is no such "fake" limits in CUDA
Keith Myers wrote: But the
)
all I can say is that this limit doesn’t seem to affect Einstein at all with the way the apps here are coded. So there’s really no point mentioning it.
the Gamma Ray tasks use only a small amount of GPU ram, and the Gravitational Wave tasks must be using multiple buffers as they can use 100% of the GPU memory. This is actually the cause of all the issues here with GW tasks. The GPU ram filling up past 100%, not 25%
_________________________________________________________________________
Ok, I will try and refrain
)
Ok, I will try and refrain from posting from long learned muscle memory. Not an issue so should never be mentioned in the future.
but it’s demonstrably not
)
but it’s demonstrably not applicable here. It’s an interesting nugget of knowledge, but why mention it if it’s not applicable to Einstein. It won’t help anyone solve any problems here, since it causes no problems here. It’s really info for the devs to know their limits when writing their software. And the devs here have either knowingly or unknowingly coded the apps in such a way as to not be limited by this. How else could we be seeing successful executions using >25% if it was a hard limit?
It probably hasn’t been an issue for nvidia GPUs for a LONG time since the amount of GPU ram on most cards has grown. Maybe if you had a GPU with less than 1GB of ram, you might run into this 25% single buffer limit, but how many people are running GPUs with that little VRAM anymore?
it’s kind of like the whole PCIe bandwidth thing that likes to be brought up every now and then. Times have changed. Maybe it was a problem in the past, but no use bringing it up anymore if it’s not a problem anymore. (GPUGRID is the only project I’ve found that has a noticeable impact with PCIe bandwidth)
_________________________________________________________________________
I do see the VRAM allocation
)
I do see the VRAM allocation (via nvidia-smi) ramp up in fairly small steps when an openCl GW task starts. This morning I've had several consecutive "large", i.e. >3 GB, tasks run on my GTX 1060 (6 GB). It takes about 20 seconds from the start of the task to top-out. Just one example, probing nvidia-smi at roughly 2-second intervals, the VRAM usage (MB) went like: 8, 317, 439, 691, 819, 997, 1253, 1381, 1571, 1715, 1963, 3031, 3295. Mostly steps of a few hundred MB, except for next to the last. Suggests, to me, that the app IS managing multiple buffers and thus avoiding the 25% openCl constraint. It also sheds light on why low-ram GPUs don't crash immediately when trying to run a WU that (eventually) needs more VRAM. It isn't all requested in one chunk.
It not just requested in
)
It not just requested in steps - CPU part of code need to prepare data fist before loading it to GPU RAM.
Actual GPU computations start only after GPU RAM usage reaches maximum (all needed data prepared on CPU and loaded to GPU RAM)