All-Sky Gravitational Wave Search on O3 data (O3ASHF1)

Mad_Max

Joined: 2 Jan 10

Posts: 163

Credit: 2,242,749,713

RAC: 675,147

Nope, it didn't hang, but

29 Dec 2024 6:43:48 UTC

Message 231446

(moderation:

)

Nope, it didn't hang, but actually continues to work. Just terribly slow for some unknown reason. I know about the fake "emulation" of the task progress in BOINC manager if it don't get actual info from the running app. But I didn't look at numbers in manager, but rather at the "boinc_task_state.xml" file in working "slot" folder which store the tasks progress reported by the running application. Now it looks like this:

<active_task>
    <project_master_url>https://einstein.phys.uwm.edu/</project_master_url>
    <result_name>h1_0170.80_O3aLC01Cl1In0__O3ASBu_171.00Hz_49994_1</result_name>
    <checkpoint_cpu_time>124833.600000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>126383.105768</checkpoint_elapsed_time>
    <fraction_done>0.005222</fraction_done>
    <peak_working_set_size>1040556032</peak_working_set_size>
    <peak_swap_size>1094828032</peak_swap_size>
    <peak_disk_usage>30425920</peak_disk_usage>
</active_task>

So app reports 0.5222% done. And it increments in about 0.00111% steps, not 0.001%
Which makes perfects sense as each GW WU contains 9000 "sky dots" needs to be processed:

2024-12-28 00:08:05.7205 (6012) [normal]: Cpt:8, total:9000, sky:1/225, f1dot:9/40

225*40 = 9000 dots total to process

And 1/9000 = +0.0111% after each successfully processed dot

0.5222% done = 47/9000 "dots" were processed by the app so far.
".c" sequence repeated in the log means only that app write checkpoint (marked by 'c' char) after every processed dot (marked by '.' char).
While RX570(and any other normally working decent GPU) able to do few hundreds "dots" between checkpoints intervals so logs output looks like "...............................................c................".
BOTH works but with ~1000 times speed difference.

And I'm very interested in why this might happen, because even software emulation of all OpenCL functions on the CPU should be much faster. We've already had this with some other application, but then it looked different: also a 100% CPU usage of 1 CPU core, but completely zero GPU load and a resulting difference in processing speed of "only" few dozens time, not more than 1000 times as in this example.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4,081

Credit: 48,693,412,899

RAC: 34,718,182

i'm confident that it's not

29 Dec 2024 16:41:36 UTC

Message 231454

(moderation:

)

i'm confident that it's not actually doing any work. it's just counting at the minimal rate by default since that's the BOINC behavior. it's checkpointing after some regular time interval defined in BOINC, not because it reached some milestone in the app to trigger a checkpoint. i said .001 as a general response, i wasnt trying to be precise, if the minimal counting step is .00111 instead of .001, then so be it. all of your proof about the run percentage and task state is a consequence of the task ticking along at default behavior, not any proof that the task is actually writing any meaningful results.

go ahead and let the task "finish". it wont. it's hung or stuck in some kind of infinite loop. it might get to 99.999 but will never complete

just stick to the BRP7 tasks. the GCN 1.0 GPUs are too old.

_________________________________________________________________________

Tom M

Joined: 2 Feb 06

Posts: 6,676

Credit: 9,694,006,492

RAC: 2,188,084

Mad_Max wrote: For example,

30 Dec 2024 20:17:26 UTC

Message 231497 in response to message 231439

(moderation:

)

Mad_Max wrote:

For example, AMD RX570 (GCN 4.0 micro-architecture, supports OpenCL 2.0) performs one task in about 1.5 hours, provided sufficient CPU support to avoid GPU starvation (in the statistics of my computers, the execution time is usually about 2 times longer, but this is because they process 2 GW tasks in parallel or sometimes even 1 GW task + 2 BRP7/MeerKAT tasks).

Usually having Brp7/meerKat and GW tasks running at the same time on the same gpu results in BOTH running slower.

A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor) I want some more patience. RIGHT NOW!

Mad_Max

Joined: 2 Jan 10

Posts: 163

Credit: 2,242,749,713

RAC: 675,147

But it is not BOINC report

5 Jan 2025 20:27:44 UTC

Message 231731 in response to message 231454

(moderation:

)

But it is not BOINC report progress and write this log, its APP itself.

it's checkpointing after some regular time interval defined in BOINC, not because it reached some milestone in the app to trigger a checkpoint.

You're getting it wrong. That's not how it works at all. Applications make checkpoints only when they reach certain points in calculation set by the APPLICATION programmers. Where it is possible/convenient to record it (and then later restore from it). BOINC client simply can not influence this. All that the corresponding setting in the BOINC client does is say to app "please do not write checkpoints more than once xx minutes." But when, in which places, and how often to write them is up to the scientific application alone. The corresponding option is even worded accordingly:

"Request task to checkpoint at most every xxx seconds"

APP can follow this recommendation by skip writing the next checkpoint if less than the specified interval has passed since the previous one was recorded. Or ignore this recommendation. But in any case, checkpoints are written only at points predefined by app programmer when the calculation process reaches it. This is both theoretically and has been tested repeatedly in practice by me and many other users. For example, a fresh example with an FGRP5 application in another topic: https://einsteinathome.org/content/strange-wus-names-and-checkpoint-issues-latest-fgrp5-batch

go ahead and let the task "finish". it wont. it's hung or stuck in some kind of infinite loop. it might get to 99.999 but will never complete

That's exactly what I was going to do out of curiosity, to check whether the result produced is valid or not. But unfortunately the application is interrupted after a few days of work by a "watchdog timer" implemented by the E@H programmers to limit the maximum working time of one WU.

But up to this point, the app has managed to process 10 sky positions (out of a total of 225, i.e. about 4.5% were processed at the time of the interruption), 40 dots each. In addition to the dots, the log also contains the number of the sky position that is currently being processed, and they change just like in all normally working tasks after successful processing of 40 dots.
I also restarted the task on purpose once, and it successfully read the recorded checkpoints and continued working from the last recorded one (at 6/225 sky position), which proves the validity of the intermediate data.

Link to the task in the database: https://einsteinathome.org/task/1703855450

And another one: https://einsteinathome.org/task/1703741537

A copy of the log (because the tasks at the links above will soon be deleted from the E@H database): https://pastebin.com/vG89f1fv

just stick to the BRP7 tasks

That's what I originally intended to do. But as I wrote last time, BRP7 WUs calculates fine on client but produce a huge number of validation errors on server side. Probably also because of the different additional SW platform (OpenCL 1.2 vs OpenCL 2.0 vs CUDA). So I doubt if it does any good at all to let it run BRP7 or it even harms more. Because often two additional tasks are sent to other participants for calculation after each validation error before WUs finally gets "canonical" scientific result.

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 523

Credit: 10,505,917,332

RAC: 5,748,435

Thanks for the rock solid

6 Jan 2025 7:27:21 UTC

Message 231751 in response to message 231731

(moderation:

)

Thanks for the rock solid explanations ...

tish

Joined: 12 Jan 20

Posts: 5

Credit: 1,212,998,249

RAC: 7,018,634

hello im not receiving any

12 Jan 2025 15:05:31 UTC

Message 232012

(moderation:

)

hello

im not receiving any new O3 tasks

is it only me?

Harri Liljeroos

Joined: 10 Dec 05

Posts: 4,519

Credit: 3,300,800,873

RAC: 1,972,757

Me neither...

12 Jan 2025 17:25:45 UTC

Message 232013

(moderation:

)

Me neither...

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 523

Credit: 10,505,917,332

RAC: 5,748,435

... one of those typical

12 Jan 2025 21:08:07 UTC

Message 232017

(moderation:

)

... one of those typical weekends ...

Harri Liljeroos

Joined: 10 Dec 05

Posts: 4,519

Credit: 3,300,800,873

RAC: 1,972,757

They seem to be back again.

12 Jan 2025 22:12:13 UTC

Message 232021

(moderation:

)

They seem to be back again.

tish

Joined: 12 Jan 20

Posts: 5

Credit: 1,212,998,249

RAC: 7,018,634

yep!

13 Jan 2025 14:46:31 UTC

Message 232035 in response to message 232021

(moderation:

)

yep!

All-Sky Gravitational Wave Search on O3 data (O3ASHF1)

Forums › Technical News

Comment viewing options

Forums › Technical News