Binary Radio Pulsar Search (Parkes PMPS XT) "BRP6"

mountkidd
mountkidd
Joined: 14 Jun 12
Posts: 176
Credit: 12595532555
RAC: 8017208

Hi HBE, RE: It's

Hi HBE,

Quote:

It's like this: The search code can be thought of as a loop over "templates", where the loop has different stages.

After several years of incremental optimization, almost the complete search code within this main loop runs on the GPU. The only exception now is the management of the list of candidates to send back to the server, the "toplist". This is still done on the CPU, e.g. to periodically write the list of candidates found so far to the disk as "checkpoints", something that code on the GPU cannot do.

Originally, near the end of each loop iteration, we copied the entire result from the GPU processing step back to main RAM, where the candidate-selection code would go sequentially thru those results and put them into the toplist of candidates to keep if they make it to this toplist (candidates that are "better" than the last entry in the toplist).

This is somewhat wasteful. In the new version we look at the toplist *before* starting the GPU part of the iteration to give us a threshold of the minimum "strength" of a candidate for it to make it to the toplist. During the GPU processing, we take note when this threshold is crossed. If we find that the threshold was never crossed during the GPU processing, we can completely skip writing the results back to the main memory in that iteration because there can't be anything in it that will make it to the toplist. This saves PCIe bandwidth (for dedicated GPU cards) and CPU processing time because we don't need to inspect those results for candidates either.

This also explains why some workunits can be "lucky": if many strong signal candidates are found early in the search, this sets higher thresholds for all the rest of the templates and cuts down on the number of transfers needed. If a work unit has no clear outliers at all however, the toplist will build up with candidates more evenly during the runtime and the saving effect is much less.

This is a bit simplified and doesn't explain all the details but the gist of it should describe this effect quite well. A further optimization I'll do now is to allow for partial transfers of results from GPU memory to host memory instead of the yes/no decision implemented now.


Is the processing methodology described above in the opencl-ati beta app or is it something that can/will be added in the future?

Gord

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 578476871
RAC: 195737

As far as I know the part

As far as I know the part after "In the new version..." is in the current beta, whereas "A further optimization..." is still in development.

MrS

Scanning for our furry friends since Jan 2002

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 729203928
RAC: 1196987

RE: As far as I know the

Quote:

As far as I know the part after "In the new version..." is in the current beta, whereas "A further optimization..." is still in development.

MrS

Exactly !
HB

|MatMan|
|MatMan|
Joined: 22 Jan 05
Posts: 24
Credit: 249005261
RAC: 0

An update of the cuda version

An update of the cuda version (toolkit) from the old 3.2 to a more recent 5.5 or even 6.5 was discussed some time ago. It should be quite easy to do and could yield a few extra % in processing speed. Is this still on the road map or was it dropped?

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 729203928
RAC: 1196987

RE: An update of the cuda

Quote:
An update of the cuda version (toolkit) from the old 3.2 to a more recent 5.5 or even 6.5 was discussed some time ago. It should be quite easy to do and could yield a few extra % in processing speed. Is this still on the road map or was it dropped?

Planning is still like described here: http://einsteinathome.org/node/197990&nowrap=true#138717

In a nutshell, once we have this app version stable we are planning to offer both CUDA 3.2 and 5.5 app versions for a transition period, and then we will see a) what we gain by including CUDA 5.5 support but also b) how many hosts we would lose by dropping CUDA 3.2 support and requiring CUDA 5.5+ in the future. We hope to be able to drop CUDA 3.2 support and switch to 5.5. We'll see.

HB

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2958852863
RAC: 713702

I see I'm getting an updated

I see I'm getting an updated v1.52 for cuda32 and - new this time - intel-gpu. Anything in particular you'd like us to watch out for?

Gavin
Gavin
Joined: 21 Sep 10
Posts: 191
Credit: 40644337738
RAC: 1

Getting them for AMD also...

Getting them for AMD also... promoted a few to run now.

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 578476871
RAC: 195737

RE: I'm getting an updated

Quote:
I'm getting an updated v1.52 for ... intel-gpu. Anything in particular you'd like us to watch out for?


My first quick feedback: link

MrS

Scanning for our furry friends since Jan 2002

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7225244931
RAC: 1040956

I promoted a full set of

I promoted a full set of 1.52, so have run a total of eleven, on five different GPUs residing on three hosts. Uneventful during run time, so far as I could tell, with execution times and CPU times never far above the base population for 1.47/1.50. Perhaps this means 1.52 implements the tail-curtailing scheme Bikeman has been forshadowing, and it works nicely, or perhaps it means this first batch I got just happened to be in the base population anyway, and the real change is something else.

Sadly, of the eleven one raised a Validate error (58:00111010). This was one the GPU which had already generated more than one on 1.50, so may have nothing specific to do with the 1.52 changes.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 729203928
RAC: 1196987

RE: Perhaps this means

Quote:
Perhaps this means 1.52 implements the tail-curtailing scheme Bikeman has been forshadowing,

Yes, the version 1.52 beta apps hopefully have a more uniform run time, and not far from the mean runtime of the previus beta app.

Cheers
HB

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.