There is another effect that can hinder the regular checkpointing of tasks. Suppose BOINC processes a mix of CPU tasks (BRP4X64, FGRP5, O3MD1). At some point, BOINC will request additional work as soon as there are not enough tasks left according to the client configuration ("store x days of work ... and up to additional y days"). Assuming a mix of CPU tasks FGRP5 and O3MD1 tasks is currently being executed. These have a deadline of two weeks. Now, when new workunits are downloaded then maybe BRP4X64 tasks happen to be included. These have a shorter deadline of just one week. Depending on the mix of tasks to run and tasks currently being processed, it often happens that BOINC's scheduler then changes the processing order. One or more running FGRP5 and O3MD1 tasks will be suspended immediately and the just downloaded BRP4X64 tasks with shorter deadline will be prioritized. If the "leave non-gpu tasks in memory while suspended" option is set, the suspended tasks remain paused in RAM, keeping the current processing state, although the last checkpoint may have been written a long time ago. Only after the BRP4X64 tasks (with shorter deadlines) have been finished, the suspended FGRP5/O3MD1 tasks (longer deadlines) resume processing. I don't know if the BOINC scheduler would also suspend tasks (and terminate processes in the OS) if the option "keep non-gpu tasks in memory" is not set (instantly wasting much computation progress). In any case, you will loose lot of computation time, when simply shutting down a computer processing a mix of FGRP5 and O3MD1 CPU tasks.
Gary, Thank you for your very clear and detailed explanation of the log file outputs. And also your rough estimation of computation time between checkpoints of 20..30 minutes for "djejoine's" machine.
djejoine wrote:
So there was no checkpoint between 0 and 3h.
I've spend (3+1)h x 12 for nothing.
The problem may not be the checkpoint frequency of probably 20..30 minutes (Gary's estimate), but e.g. too many (12?) parallel tasks using way too much memory (BOINC configuration: "Memory: when computer is in use, use at most: xx % of memory"; and "use at most yy % of the processors"). Maybe virtual memory is used here, way larger than the physical memory... eventually continuously swapping to HDD/SSD? One could hear rattling noise of HDDs in the good old times then, but not with SSDs anymore.
Your computer offers eight physical cores (16 virtual ones through hyper threading). It has 16 GiB RAM which allows to run maybe four... maximum five O3MD1 tasks in parallel (up to 2 GiB each) when running Windows 11. It's never possible to run 12 tasks of O3MD1 on this machine in parallel, maybe a mix of very few O3MD1 and further FGRP5 (700 MiB each), and BRP4X64 (200 MiB each).
In some GPU configuration thread (and in BOINC's client documentation) it is explained how one can specify for certain science apps that only a maximum number of parallel tasks of this type (e.g. O3MD1 CPU) should run.
e.g. creating a file app_config.xml into the einstein@home directory under .../ProgramData/BOINC/projects/einstein.phys.uwm.edu/ containing at least (e.g. limiting to three O3MD1 tasks in parallel):
Gary, Thank you for your very clear and detailed explanation of the log file outputs. And also your rough estimation of computation time between checkpoints of 20..30 minutes for "djejoine's" machine.
Thanks. Glad you liked it.
I've stumbled across an example that's relevant to my earlier comments. It's an O3MD1 task that was interrupted after about 1.5 checkpoints so you see the first complete line of dots and a following partial line. Here is a snip of the startup and subsequent shutdown.
Then the crunching was restarted and went to completion with no further interruptions. This one had a total of 50 checkpoints - all very clear to see. An interesting part is the very end so I've included below a snip of the final 3 checkpoints followed by information about the followup stage (Recalculating statistics) where the 'toplist' (the set of the most important candidate signals) is reprocessed.
There are two takeaways. No checkpointing occurs in that final stage. It actually took ~50 minutes according to the timestamps. I didn't realise that the final stage could take that long.
...............c
...............c
...............c
2023-01-16 13:09:09.9200 (762) [normal]: Finished main analysis.
2023-01-16 13:09:09.9209 (762) [normal]: Recalculating statistics for the final toplist...
2023-01-16 13:59:48.4314 (762) [normal]: Finished recalculating toplist statistics.
There are two takeaways. No checkpointing occurs in that final stage. It actually took ~50 minutes according to the timestamps. I didn't realise that the final stage could take that long.
That's interesting. In all my finished tasks which I checked (all doing 32 checkpoints) the final stage took less than 10 minutes on an old machine (Core i7-4770) which runs throttled (to about 60% of maximum speed via tool 'TThrottle').
The example task you mentioned is from an Apple host with Apple's own "M1" CPU. The science app logs:
einstein_O3MD1_1.03_x86_64-apple-darwin__GW-SSE2
which is an Intel x64 app for Darwin. That means there had to be some Apple magic layer emulating x86 code to run on Apple's M1 CPU (which is some type of ARM64, I assume). Maybe that's the reason for ~50 minutes in final stage.
Thanks Gary, your intervention was very instructive.
Gary Roberts wrote:
You cannot force a task to checkpoint any faster than the data analysis allows. You should interpret the default BOINC setting of 60s as a limit, below which the app will not be allowed to create one, even if it wants to.
Since i never had issues with other project, i thought that BOINC was the one creating checkpoint. I know better now.
Gary Roberts wrote:
After clicking the link, find the heading Stderr output and a few lines below that look for the line:-
putenv 'LAL_DEBUG_LEVEL=3'</p>
<p>... INFO: No checkpoint checkpoint.cpt found - starting from scratch</p>
<p><br />
2023-02-27 20:37:17.8515 (4752) [normal]: Cpt:0, total:29, sky:1/1, f1dot:1/29</p>
<p>.....putenv 'LAL_DEBUG_LEVEL=3'
Note that there are two things mixed here, a line of 5 dots and a 'putenv' startup string. The 5 dots indicate that there were 5 sub-loops of calculations towards the very first checkpoint when BOINC was stopped and some time later, restarted. There is a timestamp on the next line (2023-02-27 20:46:56.4279). This is less than 10 mins after crunching first started and it represents the sum of any run time plus the time it took to stop and restart. The important takeaway from this is that a checkpoint would have been created if the task had been allowed to run for a bit longer. It was stopped at less than 10 mins. It didn't crash.
Ok let me take another example to see if i understood correctly.
After that i notice the CPU time was not taking account for the morning sessions and i experiment a little by stopping and starting BOINC again. And i see that :
Does it mean that all the work done before wasn't lost in the ninth dimension and i was close to a checkpoint ?
If so the CPU Time at 0 was misleading, letting me think the wu restarted from the begginning.
And second question why does it go backward from 1 dot to 3 dot?
I'm sure you can enlight me on this matter.
Gary Roberts wrote:
A final comment about the expected run time. It will be slowed down if your machine becomes overloaded so you need to experiment on how many simultaneous threads you allow BOINC to start. You haven't indicated if you run other compute intensive apps apart from BOINC stuff. You also have a high performance GPU. Do you use that for gaming? I noticed some MeerKAT (BRP) tasks you must have run on that. The run times should be fairly stable but I saw a low of 483s and a high of 7,478s so you're obviously running something else on the GPU to cause such a huge variation. That GPU should be able to give run times consistently towards the low end of the range.
Well i don't really do any other very intensive thing with my computer, but i have enough power to let BOINC compute 16 threads while i'm using my computer. It could be intensive (Visual Studio, GIMP, ...) for short period of time but the 75% CPU non-BOINC limit does the job for me.
The only other thing i've limited is BOINC RAM at 50%, original 75% was a nightmare, 60% was ok but not very smooth because i was starting to use virtual memory for the system.
I've done some testing today, BRP7 takes 6 mins when it's the only wu and around 10 mins when i let all thread being used. And that's when i'm not doing anything on the computer. Maybe BOINC schedulder doesn't manage well the 0.5 CPU for the GPU task. My next try is to only allow 1 GPU task and add more CPU tasks to see when it starts to be a problem.
I know that when i'm working, it has an impact on the duration. It's clear that it's slowing down BOINC and i'm ok with it as long as i can work properly. And when i'm gaming, all game are define as exclusive application so BOINC is suspended.
Like you said i'll try to tune it for the best (maybe 80% CPU)
Scrooge McDuck wrote:
Your computer offers eight physical cores (16 virtual ones through hyper threading). It has 16 GiB RAM which allows to run maybe four... maximum five O3MD1 tasks in parallel (up to 2 GiB each) when running Windows 11. It's never possible to run 12 tasks of O3MD1 on this machine in parallel, maybe a mix of very few O3MD1 and further FGRP5 (700 MiB each), and BRP4X64 (200 MiB each).
It was not my experience, i remember that O3MD1 was aroud 1GiB, and right now BRP is 200MiB and FGRP5 is 300MiB.
I had 5 FGRP, 8 O3MD with 1 waiting for memory, 3 BRP4 and 1 BRP7. That was with RAM at 75% max (too much) :
It was taking about 11Go of RAM but was too much for the rest.
No, you had 14 sub-loops towards the first checkpoint. There is no 'c' character at the end of the line of dots so there were more sub-loops needed before a checkpoint could be created. Unfortunately, for this example, all the calculations represented by the 14 dots have been lost.
djejoine wrote:
... and in the afternoon when i restarted it i got this ...
Unfortunately, this is entirely as expected.
In an earlier response to other suggestions, you had said that you would try the 'hibernation' mechanism as a low power method for saving state and not losing all that progress. If you really need to continue shutting down your machine at regular intervals, hibernation is probably the best solution. Otherwise, you will continue to lose lots of progress, exactly as this example shows.
In case you're not aware of the 'task properties' feature of BOINC Manager, here is what you can do. Pick a running task on the tasks list to select it. Click the properties button in the side bar to see a full list of properties for that task. I'm not running any CPU tasks, so I can't show you a relevant example. I've chosen a GPU task to show an abbreviated list (some lines omitted). I'm using BOINC version 7.16.11. The properties display could be a bit different with different BOINC versions.
Application Gamma-ray pulsar binary search #1 on GPUs 1.18 (FGRPopencl1K-ati)<br />
Resources 0.3 CPUs + 0.5 AMD/ATI GPUs<br />
CPU time 00:00:16<br />
CPU time since checkpoint 00:00:02<br />
Elapsed time 00:04:54<br />
Estimated time remaining 00:11:58<br />
Fraction done 26.834%<br />
My checkpoints occur at ~60s intervals - the BOINC default. - and the CPU time component is very small (just 16s in nearly 5 mins) for these GPU tasks. The next checkpoint for the above would happen just after 5 mins elapsed when the CPU time since checkpoint would revert to zero.
I'm not at all suggesting that checking properties is a viable method for timing a shut down. It's absolutely not. I'm pointing it out just as a technique you could use while experimenting, simply to verify if there is a recent checkpoint or not.
Please first reduce the number of parallel CPU tasks to a maximum of four (BOINC settings: "use at most 25% of the processors": 25% of 16 = 4). You can gradually increase this later as soon as it runs smoothly and you know how long tasks run in total. So please test the limits of your computer later. The 16 GiB RAM in your computer isn't enoughto load eight cores (16 virtual cores) with BOINC CPU tasks when O3MD1 tasks are involved. 16 GiB isn't enough for more than four parallel O3MD1 CPU tasks (based on my own observation: quad-core CPU; 8 virtual cores; 16 GiB RAM; Windows). Too many concurrent O3MD1 tasks will always cause problems then. These O3MD1 tasks are large resource-hungry chunks. They initially require up to 2.1GiB (own observation). During processing, memory usage drops to 1.3 ... 1.1 GiB, as you observed. Then it rises again up to 1.8 GiB. This varies depending on the specific O3MD1 task and the parameters set in it. Please use the Windows task manager. Look at the process list and observe the memory usage of each Einstein task over a longer period of time.
Your BOINC manager's screenshot:
7 x O3MD1 tasks (up to 2.1 GiB each) running
1 x O3MD1 task is waiting for memory (BOINC scheduler's decision)
5 x FGRP5 tasks (up to 760 MiB each) running
3 x BRP4X64 (up to 210 MiB each) running
With only 16 GiB RAM you are clearly exceeding your memory ressources. I don't know how this works, extreme amount of virtual memory, much larger than phys. memory, or continuosly swapping to/from disk. I don't know.
In the BOINC manager, please also look into task details (memory requirement for each task are listed there).
With a BOINC configuration, as you said, of: "use max. 50% of memory" It should not be possible to start more than four O3MD1 CPU tasks. Maybe Windows virtual memory is much larger than phys. memory size. And BOINC calculates: 50% of maybe 32 GB virtual memory is 16 GiB usable memory for BOINC. That's crazy.
Does it mean that all the work done before wasn't lost in the ninth dimension and i was close to a checkpoint ?
If so the CPU Time at 0 was misleading, letting me think the wu restarted from the begginning.
And second question why does it go backward from 1 dot to 3 dot?
I'm sure you can enlight me on this matter.
The logfile output is misleading if you are new to O3MD1 CPU tasks. Whenever the O3MD1 app is started it begins to write into the logfile. If it's started the first time for a task the logfile is empty. The first string written into the logfile after app startup always is:
putenv 'LAL_DEBUG_LEVEL=3'
If the science app runs for a while it logs a lot of preparation steps (reading input data, maybe already saved checkpoints...) until entering the analysis subroutine consisting of analysing a number of skypoints which number differs between (32 and maybe 64... the highest I have seen so far was 58). That's also the number of checkpoints for the whole task. After each completely processed skypoint, a checkpoint is written. Processing a single sykpoint (of total 32...64) runs a number of sub-loops. Each finished sub-loop is logged with a single dot ("."). The number of sub-loops within a skypoint differs depending on the number of skypoints in the task and task parameters. For my last finished tasks I observed:
I also observed a task containing 35 skypoints doing 21 sub-loops per skypoint. So it differs from task to task.
Whenever a O3MD1 task is terminated which is either by exiting BOINC manually, shutting down the computer, OR maybe process is killed by the OS* because of missing memory. Then the science app terminates somewhere within running a sub-loop. That means all computation since the last checkpoint is lost. If no checkpoint had been written so far it will start again at 0% progress. This can also happen repeatedly if app is always terminated before checkpointing.
(* MS Windows' process scheduler kills large processes early when memory becomes very scarce)
When a previously terminated O3MD1 task is started again by BOINC, it will at first write the string "putenv 'LAL_DEBUG_LEVEL=3'" at the end of the existing logfile:
...putenv 'LAL_DEBUG_LEVEL=3'
That's in the last line after the last dot(s) of a previous task run. The three dots here, in front of "putenv..." now represent the lost computation effort from terminating the previous run of the app. Those (hopefully few) dots/sub-loops had to be computed again.
A successfully read checkpoint (13 of 32) after restarting a task looks like this:
No, you had 14 sub-loops towards the first checkpoint. There is no 'c' character at the end of the line of dots so there were more sub-loops needed before a checkpoint could be created. Unfortunately, for this example, all the calculations represented by the 14 dots have been lost.
Sorry that's what i was trying to say (toward). I have some troubles explaining myself very clearly in english.
Gary Roberts wrote:
In an earlier response to other suggestions, you had said that you would try the 'hibernation' mechanism as a low power method for saving state and not losing all that progress. If you really need to continue shutting down your machine at regular intervals, hibernation is probably the best solution. Otherwise, you will continue to lose lots of progress, exactly as this example shows.
Yes sorry for not updating on the matter but it works very well. As of now i will put the computer in hibernation.
Scrooge McDuck wrote:
Please first reduce the number of parallel CPU tasks to a maximum of four (BOINC settings: "use at most 25% of the processors": 25% of 16 = 4). You can gradually increase this later as soon as it runs smoothly and you know how long tasks run in total. So please test the limits of your computer later. The 16 GiB RAM in your computer isn't enoughto load eight cores (16 virtual cores) with BOINC CPU tasks when O3MD1 tasks are involved. 16 GiB isn't enough for more than four parallel O3MD1 CPU tasks (based on my own observation: quad-core CPU; 8 virtual cores; 16 GiB RAM; Windows). Too many concurrent O3MD1 tasks will always cause problems then. These O3MD1 tasks are large resource-hungry chunks. They initially require up to 2.1GiB (own observation). During processing, memory usage drops to 1.3 ... 1.1 GiB, as you observed. Then it rises again up to 1.8 GiB. This varies depending on the specific O3MD1 task and the parameters set in it. Please use the Windows task manager. Look at the process list and observe the memory usage of each Einstein task over a longer period of time.
I'm sure memory will grow up with time. I will configure BOINC to run only 2 concurrent O3MD1 task like you suggested in your first message, and all will be good (i hope). I've limited BOINC to 90% CPU (it's doing 14 CPU wu + 1 GPU) and already BRP7 run time is stable between 6 and 7 mins which is the duration when i tried it alone.
I will check again when new O3MD1 tasks will start.
To conclude, i want to thank you all for you advices and all your explanations. It's now very clear for me. I've been using BOINC for quite some time but only with WCG and i never had to tweak anything for it (install, launch, full power, let it run and forget about it). Einstein is different and i learn lots of things in the process.
I'll keep you posted about O3MD1 checkpoint and if more tweaking was needed.
I'm sure memory will grow up with time. I will configure BOINC to run only 2 concurrent O3MD1 task like you suggested in your first message, and all will be good (i hope). I've limited BOINC to 90% CPU (it's doing 14 CPU wu + 1 GPU) and already BRP7 run time is stable between 6 and 7 mins which is the duration when i tried it alone.
Your BRP7 (MeerKAT) tasks run on external ATI GPU card which has its own 16 GiB GPU memory. So these do not stress system RAM and they only consume CPU cycles part of the time (0.5 CPU can be seen as set by rule of thumb).
You should also try out different number of CPU tasks running in parallel (independent of discussed memory constraints). Your CPU has 8 physical cores, presenting 16 virtual ones to the OS. It depends on the current mix of tasks (O3MD1, FGRP5, BRP4, other BOINC projects) running in parallel if this hyper threading feature (2 virtual instead of 1 physical core) improves analysis throughput (earned credits) or if it adds overhead reducing throughput. You can measure task runtimes (yes they differ also between tasks...anyway) for a different number of tasks running in parallel. Sometimes its better to limit number of tasks in parallel even down to 50% (no hyper threading) to increase task troughput per day.
memory constraints:
I configured max 3 concurrent O3MD1 tasks (via app_config.xml) because I encountered memory allocation problems without limits for O3MD1. It solved my problems. Four or five O3MD1 tasks can be started manually in the BOINC manager, carefully, waiting some time between, but not automaticly by BOINC's scheduler. Even if there is just enough memory MS Windows sometimes rejects four or five of these O3MD1 processes trying to allocate 8..10 GiB at the same time. This often ends with at least one task erroring out with a mem alloc failure (wasting ALL computation done so far, deleting checkpoint and all files in task's slot directory immediately).
It's good to have such problem discussion here. It's a science project. Problems can only be discovered (or excluded) this way.
I think Einstein isn't very different than other projects. Only memory requirements of O3MD1 CPU tasks are challenging. So it's always a good idea to have a look at the computers process list and memory usage when running CPU task on large proportion of available (virtual) CPU cores. I don't know of other projects, but some FGRP5 and all O3MD1 tasks checkpoint rarely. Developers seem to have good reasons for this. We client users simply have to adapt (or disable such apps in the preferences).
There is another effect that
)
There is another effect that can hinder the regular checkpointing of tasks. Suppose BOINC processes a mix of CPU tasks (BRP4X64, FGRP5, O3MD1). At some point, BOINC will request additional work as soon as there are not enough tasks left according to the client configuration ("store x days of work ... and up to additional y days"). Assuming a mix of CPU tasks FGRP5 and O3MD1 tasks is currently being executed. These have a deadline of two weeks. Now, when new workunits are downloaded then maybe BRP4X64 tasks happen to be included. These have a shorter deadline of just one week. Depending on the mix of tasks to run and tasks currently being processed, it often happens that BOINC's scheduler then changes the processing order. One or more running FGRP5 and O3MD1 tasks will be suspended immediately and the just downloaded BRP4X64 tasks with shorter deadline will be prioritized. If the "leave non-gpu tasks in memory while suspended" option is set, the suspended tasks remain paused in RAM, keeping the current processing state, although the last checkpoint may have been written a long time ago. Only after the BRP4X64 tasks (with shorter deadlines) have been finished, the suspended FGRP5/O3MD1 tasks (longer deadlines) resume processing. I don't know if the BOINC scheduler would also suspend tasks (and terminate processes in the OS) if the option "keep non-gpu tasks in memory" is not set (instantly wasting much computation progress). In any case, you will loose lot of computation time, when simply shutting down a computer processing a mix of FGRP5 and O3MD1 CPU tasks.
Gary, Thank you for your very
)
Gary, Thank you for your very clear and detailed explanation of the log file outputs. And also your rough estimation of computation time between checkpoints of 20..30 minutes for "djejoine's" machine.
The problem may not be the checkpoint frequency of probably 20..30 minutes (Gary's estimate), but e.g. too many (12?) parallel tasks using way too much memory (BOINC configuration: "Memory: when computer is in use, use at most: xx % of memory"; and "use at most yy % of the processors"). Maybe virtual memory is used here, way larger than the physical memory... eventually continuously swapping to HDD/SSD? One could hear rattling noise of HDDs in the good old times then, but not with SSDs anymore.
Your computer offers eight physical cores (16 virtual ones through hyper threading). It has 16 GiB RAM which allows to run maybe four... maximum five O3MD1 tasks in parallel (up to 2 GiB each) when running Windows 11. It's never possible to run 12 tasks of O3MD1 on this machine in parallel, maybe a mix of very few O3MD1 and further FGRP5 (700 MiB each), and BRP4X64 (200 MiB each).
In some GPU configuration thread (and in BOINC's client documentation) it is explained how one can specify for certain science apps that only a maximum number of parallel tasks of this type (e.g. O3MD1 CPU) should run.
e.g. creating a file app_config.xml into the einstein@home directory under .../ProgramData/BOINC/projects/einstein.phys.uwm.edu/ containing at least (e.g. limiting to three O3MD1 tasks in parallel):
Scrooge McDuck wrote:Gary,
)
Thanks. Glad you liked it.
I've stumbled across an example that's relevant to my earlier comments. It's an O3MD1 task that was interrupted after about 1.5 checkpoints so you see the first complete line of dots and a following partial line. Here is a snip of the startup and subsequent shutdown.
Then the crunching was restarted and went to completion with no further interruptions. This one had a total of 50 checkpoints - all very clear to see. An interesting part is the very end so I've included below a snip of the final 3 checkpoints followed by information about the followup stage (Recalculating statistics) where the 'toplist' (the set of the most important candidate signals) is reprocessed.
There are two takeaways. No checkpointing occurs in that final stage. It actually took ~50 minutes according to the timestamps. I didn't realise that the final stage could take that long.
Cheers,
Gary.
Gary Roberts schrieb:There
)
That's interesting. In all my finished tasks which I checked (all doing 32 checkpoints) the final stage took less than 10 minutes on an old machine (Core i7-4770) which runs throttled (to about 60% of maximum speed via tool 'TThrottle').
The example task you mentioned is from an Apple host with Apple's own "M1" CPU. The science app logs:
which is an Intel x64 app for Darwin. That means there had to be some Apple magic layer emulating x86 code to run on Apple's M1 CPU (which is some type of ARM64, I assume). Maybe that's the reason for ~50 minutes in final stage.
Thanks Gary, your
)
Thanks Gary, your intervention was very instructive.
Since i never had issues with other project, i thought that BOINC was the one creating checkpoint. I know better now.
Ok let me take another example to see if i understood correctly.
Trying this one O3MD1V2a
So i have 14 sub-loop for the first checkpoint.
then i stopped my computer and in the afternoon when i restarted it i got this
There's only 4 dots left.
After that i notice the CPU time was not taking account for the morning sessions and i experiment a little by stopping and starting BOINC again. And i see that :
1 dot and then
3 dot???
Does it mean that all the work done before wasn't lost in the ninth dimension and i was close to a checkpoint ?
If so the CPU Time at 0 was misleading, letting me think the wu restarted from the begginning.
And second question why does it go backward from 1 dot to 3 dot?
I'm sure you can enlight me on this matter.
Well i don't really do any other very intensive thing with my computer, but i have enough power to let BOINC compute 16 threads while i'm using my computer. It could be intensive (Visual Studio, GIMP, ...) for short period of time but the 75% CPU non-BOINC limit does the job for me.
The only other thing i've limited is BOINC RAM at 50%, original 75% was a nightmare, 60% was ok but not very smooth because i was starting to use virtual memory for the system.
I've done some testing today, BRP7 takes 6 mins when it's the only wu and around 10 mins when i let all thread being used. And that's when i'm not doing anything on the computer. Maybe BOINC schedulder doesn't manage well the 0.5 CPU for the GPU task. My next try is to only allow 1 GPU task and add more CPU tasks to see when it starts to be a problem.
I know that when i'm working, it has an impact on the duration. It's clear that it's slowing down BOINC and i'm ok with it as long as i can work properly. And when i'm gaming, all game are define as exclusive application so BOINC is suspended.
Like you said i'll try to tune it for the best (maybe 80% CPU)
It was not my experience, i remember that O3MD1 was aroud 1GiB, and right now BRP is 200MiB and FGRP5 is 300MiB.
I had 5 FGRP, 8 O3MD with 1 waiting for memory, 3 BRP4 and 1 BRP7. That was with RAM at 75% max (too much) :
It was taking about 11Go of RAM but was too much for the rest.
Thanks to you all.
djejoine wrote:Ok let me take
)
No, you had 14 sub-loops towards the first checkpoint. There is no 'c' character at the end of the line of dots so there were more sub-loops needed before a checkpoint could be created. Unfortunately, for this example, all the calculations represented by the 14 dots have been lost.
Unfortunately, this is entirely as expected.
In an earlier response to other suggestions, you had said that you would try the 'hibernation' mechanism as a low power method for saving state and not losing all that progress. If you really need to continue shutting down your machine at regular intervals, hibernation is probably the best solution. Otherwise, you will continue to lose lots of progress, exactly as this example shows.
In case you're not aware of the 'task properties' feature of BOINC Manager, here is what you can do. Pick a running task on the tasks list to select it. Click the properties button in the side bar to see a full list of properties for that task. I'm not running any CPU tasks, so I can't show you a relevant example. I've chosen a GPU task to show an abbreviated list (some lines omitted). I'm using BOINC version 7.16.11. The properties display could be a bit different with different BOINC versions.
My checkpoints occur at ~60s intervals - the BOINC default. - and the CPU time component is very small (just 16s in nearly 5 mins) for these GPU tasks. The next checkpoint for the above would happen just after 5 mins elapsed when the CPU time since checkpoint would revert to zero.
I'm not at all suggesting that checking properties is a viable method for timing a shut down. It's absolutely not. I'm pointing it out just as a technique you could use while experimenting, simply to verify if there is a recent checkpoint or not.
Cheers,
Gary.
@djejoine:Please first
)
@djejoine:
Please first reduce the number of parallel CPU tasks to a maximum of four (BOINC settings: "use at most 25% of the processors": 25% of 16 = 4). You can gradually increase this later as soon as it runs smoothly and you know how long tasks run in total. So please test the limits of your computer later. The 16 GiB RAM in your computer isn't enough to load eight cores (16 virtual cores) with BOINC CPU tasks when O3MD1 tasks are involved. 16 GiB isn't enough for more than four parallel O3MD1 CPU tasks (based on my own observation: quad-core CPU; 8 virtual cores; 16 GiB RAM; Windows). Too many concurrent O3MD1 tasks will always cause problems then. These O3MD1 tasks are large resource-hungry chunks. They initially require up to 2.1GiB (own observation). During processing, memory usage drops to 1.3 ... 1.1 GiB, as you observed. Then it rises again up to 1.8 GiB. This varies depending on the specific O3MD1 task and the parameters set in it. Please use the Windows task manager. Look at the process list and observe the memory usage of each Einstein task over a longer period of time.
Your BOINC manager's screenshot:
With only 16 GiB RAM you are clearly exceeding your memory ressources. I don't know how this works, extreme amount of virtual memory, much larger than phys. memory, or continuosly swapping to/from disk. I don't know.
In the BOINC manager, please also look into task details (memory requirement for each task are listed there).
With a BOINC configuration, as you said, of: "use max. 50% of memory" It should not be possible to start more than four O3MD1 CPU tasks. Maybe Windows virtual memory is much larger than phys. memory size. And BOINC calculates: 50% of maybe 32 GB virtual memory is 16 GiB usable memory for BOINC. That's crazy.
djejoine schrieb:3
)
The logfile output is misleading if you are new to O3MD1 CPU tasks. Whenever the O3MD1 app is started it begins to write into the logfile. If it's started the first time for a task the logfile is empty. The first string written into the logfile after app startup always is:
If the science app runs for a while it logs a lot of preparation steps (reading input data, maybe already saved checkpoints...) until entering the analysis subroutine consisting of analysing a number of skypoints which number differs between (32 and maybe 64... the highest I have seen so far was 58). That's also the number of checkpoints for the whole task. After each completely processed skypoint, a checkpoint is written. Processing a single sykpoint (of total 32...64) runs a number of sub-loops. Each finished sub-loop is logged with a single dot ("."). The number of sub-loops within a skypoint differs depending on the number of skypoints in the task and task parameters. For my last finished tasks I observed:
a O3MD1 task containing 29 skypoints:
--> 23 sub-loops until checkpoint "c"
another task containing 32 skypoints:
--> also 23 sub-loops until checkpoint "c"
I also observed a task containing 35 skypoints doing 21 sub-loops per skypoint. So it differs from task to task.
Whenever a O3MD1 task is terminated which is either by exiting BOINC manually, shutting down the computer, OR maybe process is killed by the OS* because of missing memory. Then the science app terminates somewhere within running a sub-loop. That means all computation since the last checkpoint is lost. If no checkpoint had been written so far it will start again at 0% progress. This can also happen repeatedly if app is always terminated before checkpointing.
(* MS Windows' process scheduler kills large processes early when memory becomes very scarce)
When a previously terminated O3MD1 task is started again by BOINC, it will at first write the string "putenv 'LAL_DEBUG_LEVEL=3'" at the end of the existing logfile:
That's in the last line after the last dot(s) of a previous task run. The three dots here, in front of "putenv..." now represent the lost computation effort from terminating the previous run of the app. Those (hopefully few) dots/sub-loops had to be computed again.
A successfully read checkpoint (13 of 32) after restarting a task looks like this:
I'll give an example run and some time measurements later.
Gary Roberts wrote:No, you
)
Sorry that's what i was trying to say (toward). I have some troubles explaining myself very clearly in english.
Yes sorry for not updating on the matter but it works very well. As of now i will put the computer in hibernation.
I'm sure memory will grow up with time. I will configure BOINC to run only 2 concurrent O3MD1 task like you suggested in your first message, and all will be good (i hope). I've limited BOINC to 90% CPU (it's doing 14 CPU wu + 1 GPU) and already BRP7 run time is stable between 6 and 7 mins which is the duration when i tried it alone.
I will check again when new O3MD1 tasks will start.
To conclude, i want to thank you all for you advices and all your explanations. It's now very clear for me. I've been using BOINC for quite some time but only with WCG and i never had to tweak anything for it (install, launch, full power, let it run and forget about it). Einstein is different and i learn lots of things in the process.
I'll keep you posted about O3MD1 checkpoint and if more tweaking was needed.
Again thanks to the community.
djejoine schrieb: I'm sure
)
Your BRP7 (MeerKAT) tasks run on external ATI GPU card which has its own 16 GiB GPU memory. So these do not stress system RAM and they only consume CPU cycles part of the time (0.5 CPU can be seen as set by rule of thumb).
You should also try out different number of CPU tasks running in parallel (independent of discussed memory constraints). Your CPU has 8 physical cores, presenting 16 virtual ones to the OS. It depends on the current mix of tasks (O3MD1, FGRP5, BRP4, other BOINC projects) running in parallel if this hyper threading feature (2 virtual instead of 1 physical core) improves analysis throughput (earned credits) or if it adds overhead reducing throughput. You can measure task runtimes (yes they differ also between tasks...anyway) for a different number of tasks running in parallel. Sometimes its better to limit number of tasks in parallel even down to 50% (no hyper threading) to increase task troughput per day.
memory constraints:
I configured max 3 concurrent O3MD1 tasks (via app_config.xml) because I encountered memory allocation problems without limits for O3MD1. It solved my problems. Four or five O3MD1 tasks can be started manually in the BOINC manager, carefully, waiting some time between, but not automaticly by BOINC's scheduler. Even if there is just enough memory MS Windows sometimes rejects four or five of these O3MD1 processes trying to allocate 8..10 GiB at the same time. This often ends with at least one task erroring out with a mem alloc failure (wasting ALL computation done so far, deleting checkpoint and all files in task's slot directory immediately).
It's good to have such problem discussion here. It's a science project. Problems can only be discovered (or excluded) this way.
I think Einstein isn't very different than other projects. Only memory requirements of O3MD1 CPU tasks are challenging. So it's always a good idea to have a look at the computers process list and memory usage when running CPU task on large proportion of available (virtual) CPU cores. I don't know of other projects, but some FGRP5 and all O3MD1 tasks checkpoint rarely. Developers seem to have good reasons for this. We client users simply have to adapt (or disable such apps in the preferences).