Problem with scheduling rules at BRP search?

PulsarOperator

Joined: 29 Jun 20

Posts: 4

Credit: 22420875

RAC: 0

1 Nov 2020 9:00:27 UTC

Topic 223872

(moderation:

)

Hello all,

I recently switched my E@H effort from Gamma-ray pulsar binary search #1 by GPU to Binary Radio Pulsar Search by GPU. Checking my account I found an increased number of calculations marked as invalid. For example, work unit

495030425

Here, one can see that my computer

12839283

got the task

p2030.20170613.G38.11+01.32.S.b5s0g0.00000_3263_4

(at 28 Oct 2020 6:31:36 UTC) before the second open task

p2030.20170613.G38.11+01.32.S.b5s0g0.00000_3263_1

had been closed (28 Oct 2020 10:18:00 UTC).

Since task _1 was successful it seems to me that my calculation _4 was marked as invalid because not needed anymore, esp. because the calculation log of my task is reporting successful calculation.

So for me, it seems to be a problem with the scheduling rules. Could you please doublecheck this.

Cheers,

PulsarOperator

archae86

Joined: 6 Dec 05

Posts: 3161

Credit: 7272551730

RAC: 1815434

That machine has generated 25

1 Nov 2020 13:36:59 UTC

Message 180794

(moderation:

)

That machine has generated 25 invalid results on _0 and _1 tasks on the same application within the past week, with only 107 valid ones, from a mix of _0, _1, _2, and _3 tasks.

I think your assertion that a scheduler error condition is somehow responsible for one out of 26 invalid tasks, from a machine which is generating invalid results at a much higher rate than we see on healthy machines is highly speculative.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 118372142170

RAC: 25542123

PulsarOperator wrote:.... For

1 Nov 2020 23:37:48 UTC

Message 180810

(moderation:

)

PulsarOperator wrote:

.... For example, work unit

495030425

Here, one can see that my computer

12839283

got the task

p2030.20170613.G38.11+01.32.S.b5s0g0.00000_3263_4

If you would like people to easily see things like work unit IDs, computer IDs, or task IDs, please consider making them clickable links. It saves a lot of unnecessary stuffing around for any volunteers who are otherwise prepared to offer assistance. The lack of such links is probably a major reason why you might not get responses at all, particularly in cases where your computers are also 'hidden', as yours are.

Creating a clickable link is really quite trivial for you to do. If you don't know how, please read the BBCode Help which is always accessible by clicking to expand that help section which is immediately below the message composition box where you are preparing your problem report.

PulsarOperator wrote:

Since task _1 was successful it seems to me that my calculation _4 was marked as invalid because not needed anymore, esp. because the calculation log of my task is reporting successful calculation.

So for me, it seems to be a problem with the scheduling rules. Could you please doublecheck this.

"Scheduling rules", which are applied by the scheduler, have nothing to do with the validation process. A separate program (the validator) checks the returned results for agreement. Your result would only be rejected if it didn't agree closely enough with the others. Additional results are always accepted if they meet the validation tolerances and are returned prior to the task deadline. A "successful" result doesn't guarantee that the data returned does meet those tolerances.

There are two general reasons why the Arecibo radio pulsar search is likely to continue giving you these problems. Firstly, that search is really designed to supply work for small portable devices like phones, tablets, Raspberry Pis, etc. There will be a wide range of hardware types, operating systems, drivers, crunching applications, math libraries, rounding errors, etc, which will affect the accuracy of the calculated answers. It could easily be that the accumulated rounding error forces the validator to request an additional result and that two such results with different rounding errors to yours might cause your result to be rejected, even if yours happened to be the 'most accurate'.

The second reason is to do with using Intel GPUs. There has been a long history of all the different types of Intel GPUs with many different driver versions often giving results that fail validation. From what has been reported previously, it seems that some driver versions give imprecise answers when used for crunching.

If you look at the particular quorum that contained the task ID you listed, you can see 5 results, 3 of which went to arm type devices with the other two going to Intel GPUs. One of those two was listed as "INTEL Intel(R) HD Graphics 4000 (1400MB)" whilst yours is listed as "INTEL Intel(R) HD Graphics 630 (3230MB)".

You will notice that the other result was validated but yours wasn't. This is a classic example of different Intel devices, probably with different driver versions, and giving answers that don't agree closely enough with each other.

My advice to you is to go back to using your nvidia GPU on the gamma-ray pulsar tasks (which you know works well) and get rid of the problems you will continue to have with the radio pulsar tasks on your Intel GPU.

Cheers,
Gary.

PulsarOperator

Joined: 29 Jun 20

Posts: 4

Credit: 22420875

RAC: 0

Do you have any clue why so

2 Nov 2020 12:11:01 UTC

Message 180816 in response to message 180794

(moderation:

)

Problem with scheduling rules at BRP search?

Forums › Problems and Bug Reports

That machine has generated 25

PulsarOperator wrote:.... For

Do you have any clue why so

Comment viewing options

Forums › Problems and Bug Reports