Thanks for the insight. I leave a few cores open to help prevent bottlenecking which definitely helps. Gpu memory shouldn't be an issue with our gpus. I am still learning the details of gpus but do single precision tasks use the same cores of the gpu as double precision? I know that our ampere cards have more active tesor cores which are for double precision tasks? Am I off base with this?
Thanks for the insight. I leave a few cores open to help prevent bottlenecking which definitely helps. Gpu memory shouldn't be an issue with our gpus. I am still learning the details of gpus but do single precision tasks use the same cores of the gpu as double precision? I know that our ampere cards have more active tesor cores which are for double precision tasks? Am I off base with this?
the tensor cores are not the same as FP cores. tensor cores are specialized hardware for inferencing workloads like ML and AI. No BOINC project (yet) uses this hardware.
GA102 die like in your A6000 or higher end GeForce 30-series cards (or any GA10x really) don’t really have dedicated FP64 hardware. Pretty sure they just double up FP32 cores for that. But the higher end Nvidia cards based on the GA100 core like the A100 do have dedicated FP64 cores.
edit, correction:
the GA10x (Geforce 30x0, Ax000 "Quadro", etc) cards have only 2 FP64 cores per SM. but this is not depicted in most architecture diagrams so I missed it, had to dig into the white paper to find that. but with 128 FP32 cores/SM that explains why there's a 1:64 ratio in performance.
while the GA100 (A100) cards have 32 FP64 cores per SM and 64 FP32 cores per SM, for that nice 1:2 ratio in performance.
they basically swapped out the FP64 cores for the Ray Tracing cores on GA10x, which are not present on GA100.
It seems with v. 0.11 we now have a working CUDA version on Windows. The first step is to get something working at all, then to improve validation,...
let's see how validation goes.
I've run dozens of tasks on the windows/AMD v.011 code. Most of these WUs are old ones with multiple tasks run on various hosts with older versions of the application, and I've gotten no validations from them.
But I recently got a validation, and the fabulous news is that my quorum partner was Windows/Cuda55 (so Nvidia) also running v0.11
As I had seen zero previous cases of successful validation for my AMD against Nvidia quorum partners of any description, this is a ray of hope on the validation front.
Actually the only validation we're interested in is the one among the four latest app versions (0.08 on Linux, 0.11 on Windows). All other app versions that results may be still around from (including, unfortunately, the Windows 0.08) have a slightly different computation code that would likely prevent successful validation.
But it seems we're getting closer, so far we have 361 valid and only one invalid from comparison of these app versions. Best validation by far so far.
Hi! I'm doing v0.11 on Windows right now and it has been stuck at 99.998% for almost an hour. I guess I'm gonna abort it if computation time exceeds 3 hours.
Nearly all of the v0.08-WUs have massive problems on my computers. They start but the GPU doesn't compute anything from the beginning. The load is between 1% and 10%, mostly at 1%. Maybe that's the regular load caused by Linux.
Even though they show very low progress, I aborted them because they get stuck earlier or later.
Hi! I'm doing v0.11 on Windows right now and it has been stuck at 99.998% for almost an hour. I guess I'm gonna abort it if computation time exceeds 3 hours.
Have aborted the task at 4:13:12 of running time and still at 99.998%.
Maybe there's something wrong with v0.11 because v0.03 worked fine.
FWIW The memory issue that probably caused the "hang" problem has been in the BRP7 code all along, since version 0.01. It's a matter of the data (i.e. workunit) whether and when it's triggered, not so much of the app version.
Thanks for the insight. I
)
Thanks for the insight. I leave a few cores open to help prevent bottlenecking which definitely helps. Gpu memory shouldn't be an issue with our gpus. I am still learning the details of gpus but do single precision tasks use the same cores of the gpu as double precision? I know that our ampere cards have more active tesor cores which are for double precision tasks? Am I off base with this?
Boca Raton Community HS
)
the tensor cores are not the same as FP cores. tensor cores are specialized hardware for inferencing workloads like ML and AI. No BOINC project (yet) uses this hardware.
GA102 die like in your A6000 or higher end GeForce 30-series cards (or any GA10x really)
don’t really have dedicated FP64 hardware. Pretty sure they just double up FP32 cores for that. But the higher end Nvidia cards based on the GA100 core like the A100 do have dedicated FP64 cores.edit, correction:
the GA10x (Geforce 30x0, Ax000 "Quadro", etc) cards have only 2 FP64 cores per SM. but this is not depicted in most architecture diagrams so I missed it, had to dig into the white paper to find that. but with 128 FP32 cores/SM that explains why there's a 1:64 ratio in performance.
while the GA100 (A100) cards have 32 FP64 cores per SM and 64 FP32 cores per SM, for that nice 1:2 ratio in performance.
they basically swapped out the FP64 cores for the Ray Tracing cores on GA10x, which are not present on GA100.
https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf
_________________________________________________________________________
Unless an application is
)
Unless an application is coded specifically to use Tensor cores they won't be used.
Bernd Machenschalk wrote: It
)
I've run dozens of tasks on the windows/AMD v.011 code. Most of these WUs are old ones with multiple tasks run on various hosts with older versions of the application, and I've gotten no validations from them.
But I recently got a validation, and the fabulous news is that my quorum partner was Windows/Cuda55 (so Nvidia) also running v0.11
As I had seen zero previous cases of successful validation for my AMD against Nvidia quorum partners of any description, this is a ray of hope on the validation front.
https://einsteinathome.org/workunit/667201533
Time will tell if this is a false dawn, or a harbinger of better times ahead.
Actually the only validation
)
Actually the only validation we're interested in is the one among the four latest app versions (0.08 on Linux, 0.11 on Windows). All other app versions that results may be still around from (including, unfortunately, the Windows 0.08) have a slightly different computation code that would likely prevent successful validation.
But it seems we're getting closer, so far we have 361 valid and only one invalid from comparison of these app versions. Best validation by far so far.
BM
Hi! I'm doing v0.11 on
)
Hi! I'm doing v0.11 on Windows right now and it has been stuck at 99.998% for almost an hour. I guess I'm gonna abort it if computation time exceeds 3 hours.
Nearly all of the v0.08-WUs
)
Nearly all of the v0.08-WUs have massive problems on my computers. They start but the GPU doesn't compute anything from the beginning. The load is between 1% and 10%, mostly at 1%. Maybe that's the regular load caused by Linux.
Even though they show very low progress, I aborted them because they get stuck earlier or later.
Alexander Favorsky
)
Have aborted the task at 4:13:12 of running time and still at 99.998%.
Maybe there's something wrong with v0.11 because v0.03 worked fine.
0.12 is out, it should fix
)
0.12 is out, it should fix the memory issue reported by Petri33 (Thanks!.
Hopefully this is the only reason for the "hang" problem.
BM
FWIW The memory issue that
)
FWIW The memory issue that probably caused the "hang" problem has been in the BRP7 code all along, since version 0.01. It's a matter of the data (i.e. workunit) whether and when it's triggered, not so much of the app version.
BM