When Maxwell with its relatively large L2 cache launched we had a short but interesting discussion: the current app streams the entire data array for each operation. That's the usual way suitable for GPUs, as they have massive memory bandwidth, a massive amount of execution units with long latencies and small caches.
For CPUs one would do it the other way around: perform several calculations on a subset of data fitting into the cache. And only go through the entire array step-by-step once the previous block has finished.
Currently we have a nicely optimized GPU app, which uses almost all the GPU memory bandwidth it can get. And which shows strong signs of being limited by that bandwidth. With modern GPUs such as Maxwell moving to larger caches and generally focussing on keeping the execution units busy, the question arose: was the traditional scheme still the best option? We didn't pursue this thought any further, as the PCIe communication optimization had higher priority. I think it would be worth to give this a further look. Apart from Maxwell the AMD and Intel integrated GPUs could benefit especially, since they have limited bandwidth but comparably huge caches.
What do you guys think? Obviously I haven't seen the code, so I can only speculate. But from this comfortable distance it surely sounds worth a try.
Edit: the "current working set", i.e. the number of data points the chip would have to keep in flight, would not need to fit into the cache entirely. Even if it exceeds the cache size by a factor of 3, still 1/3 of all memory operations would be performed within the cache. Which should (to 1st approximation) reduce the memory bandwidth requirement by 1/3.
MrS
Scanning for our furry friends since Jan 2002
Copyright © 2024 Einstein@Home. All rights reserved.
Any thoughts or work on this
)
Any thoughts or work on this topic? Seeing how the new GPUs have an ever decreasing amount of bandwidth to TFlops (and make up for that in games by using delta color compression, which is of no help here) the benefit of relieving the memory bandwidth requirements of the BRP app is increasing.
MrS
Scanning for our furry friends since Jan 2002
In short: We don't have any
)
In short: We don't have any resources to work on the BRP app any further.
As far as we are concerned, the BRP app has been developed to full extent. In this environment of very heterogeneous GPUs on E@H, the benefit from further development for any particular GPU type doesn't justify the effort we would need to invest.
BM
Fair enough.. thanks! MrS
)
Fair enough.. thanks!
MrS
Scanning for our furry friends since Jan 2002