Nope, it didn't hang, but actually continues to work. Just terribly slow for some unknown reason. I know about the fake "emulation" of the task progress in BOINC manager if it don't get actual info from the running app. But I didn't look at numbers in manager, but rather at the "boinc_task_state.xml" file in working "slot" folder which store the tasks progress reported by the running application. Now it looks like this:
So app reports 0.5222% done. And it increments in about 0.00111% steps, not 0.001%
Which makes perfects sense as each GW WU contains 9000 "sky dots" needs to be processed:
And 1/9000 = +0.0111% after each successfully processed dot
0.5222% done = 47/9000 "dots" were processed by the app so far.
".c" sequence repeated in the log means only that app write checkpoint (marked by 'c' char) after every processed dot (marked by '.' char).
While RX570(and any other normally working decent GPU) able to do few hundreds "dots" between checkpoints intervals so logs output looks like "...............................................c................".
BOTH works but with ~1000 times speed difference.
And I'm very interested in why this might happen, because even software emulation of all OpenCL functions on the CPU should be much faster. We've already had this with some other application, but then it looked different: also a 100% CPU usage of 1 CPU core, but completely zero GPU load and a resulting difference in processing speed of "only" few dozens time, not more than 1000 times as in this example.
i'm confident that it's not actually doing any work. it's just counting at the minimal rate by default since that's the BOINC behavior. it's checkpointing after some regular time interval defined in BOINC, not because it reached some milestone in the app to trigger a checkpoint. i said .001 as a general response, i wasnt trying to be precise, if the minimal counting step is .00111 instead of .001, then so be it. all of your proof about the run percentage and task state is a consequence of the task ticking along at default behavior, not any proof that the task is actually writing any meaningful results.
go ahead and let the task "finish". it wont. it's hung or stuck in some kind of infinite loop. it might get to 99.999 but will never complete
just stick to the BRP7 tasks. the GCN 1.0 GPUs are too old.
For example, AMD RX570 (GCN 4.0 micro-architecture, supports OpenCL 2.0) performs one task in about 1.5 hours, provided sufficient CPU support to avoid GPU starvation (in the statistics of my computers, the execution time is usually about 2 times longer, but this is because they process 2 GW tasks in parallel or sometimes even 1 GW task + 2 BRP7/MeerKAT tasks).
Usually having Brp7/meerKat and GW tasks running at the same time on the same gpu results in BOTH running slower.
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor) I want some more patience. RIGHT NOW!
But it is not BOINC report progress and write this log, its APP itself.
it's checkpointing after some regular time interval defined in BOINC, not because it reached some milestone in the app to trigger a checkpoint.
You're getting it wrong. That's not how it works at all. Applications make checkpoints only when they reach certain points in calculation set by the APPLICATION programmers. Where it is possible/convenient to record it (and then later restore from it). BOINC client simply can not influence this. All that the corresponding setting in the BOINC client does is say to app "please do not write checkpoints more than once xx minutes." But when, in which places, and how often to write them is up to the scientific application alone. The corresponding option is even worded accordingly:
"Request task to checkpoint at most every xxx seconds"
APP can follow this recommendation by skip writing the next checkpoint if less than the specified interval has passed since the previous one was recorded. Or ignore this recommendation. But in any case, checkpoints are written only at points predefined by app programmer when the calculation process reaches it. This is both theoretically and has been tested repeatedly in practice by me and many other users. For example, a fresh example with an FGRP5 application in another topic: https://einsteinathome.org/content/strange-wus-names-and-checkpoint-issues-latest-fgrp5-batch
go ahead and let the task "finish". it wont. it's hung or stuck in some kind of infinite loop. it might get to 99.999 but will never complete
That's exactly what I was going to do out of curiosity, to check whether the result produced is valid or not. But unfortunately the application is interrupted after a few days of work by a "watchdog timer" implemented by the E@H programmers to limit the maximum working time of one WU.
But up to this point, the app has managed to process 10 sky positions (out of a total of 225, i.e. about 4.5% were processed at the time of the interruption), 40 dots each. In addition to the dots, the log also contains the number of the sky position that is currently being processed, and they change just like in all normally working tasks after successful processing of 40 dots.
I also restarted the task on purpose once, and it successfully read the recorded checkpoints and continued working from the last recorded one (at 6/225 sky position), which proves the validity of the intermediate data.
A copy of the log (because the tasks at the links above will soon be deleted from the E@H database): https://pastebin.com/vG89f1fv
just stick to the BRP7 tasks
That's what I originally intended to do. But as I wrote last time, BRP7 WUs calculates fine on client but produce a huge number of validation errors on server side. Probably also because of the different additional SW platform (OpenCL 1.2 vs OpenCL 2.0 vs CUDA). So I doubt if it does any good at all to let it run BRP7 or it even harms more. Because often two additional tasks are sent to other participants for calculation after each validation error before WUs finally gets "canonical" scientific result.
Nope, it didn't hang, but
)
Nope, it didn't hang, but actually continues to work. Just terribly slow for some unknown reason. I know about the fake "emulation" of the task progress in BOINC manager if it don't get actual info from the running app. But I didn't look at numbers in manager, but rather at the "boinc_task_state.xml" file in working "slot" folder which store the tasks progress reported by the running application. Now it looks like this:
So app reports 0.5222% done. And it increments in about 0.00111% steps, not 0.001%
Which makes perfects sense as each GW WU contains 9000 "sky dots" needs to be processed:
225*40 = 9000 dots total to process
And 1/9000 = +0.0111% after each successfully processed dot
0.5222% done = 47/9000 "dots" were processed by the app so far.
".c" sequence repeated in the log means only that app write checkpoint (marked by 'c' char) after every processed dot (marked by '.' char).
While RX570(and any other normally working decent GPU) able to do few hundreds "dots" between checkpoints intervals so logs output looks like "...............................................c................".
BOTH works but with ~1000 times speed difference.
And I'm very interested in why this might happen, because even software emulation of all OpenCL functions on the CPU should be much faster. We've already had this with some other application, but then it looked different: also a 100% CPU usage of 1 CPU core, but completely zero GPU load and a resulting difference in processing speed of "only" few dozens time, not more than 1000 times as in this example.
i'm confident that it's not
)
i'm confident that it's not actually doing any work. it's just counting at the minimal rate by default since that's the BOINC behavior. it's checkpointing after some regular time interval defined in BOINC, not because it reached some milestone in the app to trigger a checkpoint. i said .001 as a general response, i wasnt trying to be precise, if the minimal counting step is .00111 instead of .001, then so be it. all of your proof about the run percentage and task state is a consequence of the task ticking along at default behavior, not any proof that the task is actually writing any meaningful results.
go ahead and let the task "finish". it wont. it's hung or stuck in some kind of infinite loop. it might get to 99.999 but will never complete
just stick to the BRP7 tasks. the GCN 1.0 GPUs are too old.
_________________________________________________________________________
Mad_Max wrote: For example,
)
Usually having Brp7/meerKat and GW tasks running at the same time on the same gpu results in BOTH running slower.
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor) I want some more patience. RIGHT NOW!
But it is not BOINC report
)
But it is not BOINC report progress and write this log, its APP itself.
You're getting it wrong. That's not how it works at all. Applications make checkpoints only when they reach certain points in calculation set by the APPLICATION programmers. Where it is possible/convenient to record it (and then later restore from it). BOINC client simply can not influence this. All that the corresponding setting in the BOINC client does is say to app "please do not write checkpoints more than once xx minutes." But when, in which places, and how often to write them is up to the scientific application alone. The corresponding option is even worded accordingly:
APP can follow this recommendation by skip writing the next checkpoint if less than the specified interval has passed since the previous one was recorded. Or ignore this recommendation. But in any case, checkpoints are written only at points predefined by app programmer when the calculation process reaches it. This is both theoretically and has been tested repeatedly in practice by me and many other users. For example, a fresh example with an FGRP5 application in another topic: https://einsteinathome.org/content/strange-wus-names-and-checkpoint-issues-latest-fgrp5-batch
That's exactly what I was going to do out of curiosity, to check whether the result produced is valid or not. But unfortunately the application is interrupted after a few days of work by a "watchdog timer" implemented by the E@H programmers to limit the maximum working time of one WU.
But up to this point, the app has managed to process 10 sky positions (out of a total of 225, i.e. about 4.5% were processed at the time of the interruption), 40 dots each. In addition to the dots, the log also contains the number of the sky position that is currently being processed, and they change just like in all normally working tasks after successful processing of 40 dots.
I also restarted the task on purpose once, and it successfully read the recorded checkpoints and continued working from the last recorded one (at 6/225 sky position), which proves the validity of the intermediate data.
Link to the task in the database: https://einsteinathome.org/task/1703855450
And another one: https://einsteinathome.org/task/1703741537
A copy of the log (because the tasks at the links above will soon be deleted from the E@H database): https://pastebin.com/vG89f1fv
That's what I originally intended to do. But as I wrote last time, BRP7 WUs calculates fine on client but produce a huge number of validation errors on server side. Probably also because of the different additional SW platform (OpenCL 1.2 vs OpenCL 2.0 vs CUDA). So I doubt if it does any good at all to let it run BRP7 or it even harms more. Because often two additional tasks are sent to other participants for calculation after each validation error before WUs finally gets "canonical" scientific result.
Thanks for the rock solid
)
Thanks for the rock solid explanations ...
hello im not receiving any
)
hello
im not receiving any new O3 tasks
is it only me?
Me neither...
)
Me neither...
... one of those typical
)
... one of those typical weekends ...
They seem to be back again.
)
They seem to be back again.
yep!
)
yep!