Erroneously low values for memory bounds in properties of O3MD1 tasks

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1035
Credit: 17697180
RAC: 12898
Topic 229389

At the moment, O3MD1 CPU tasks are distributed that have a very high memory requirement: After the task starts, it allocates 3.2 GiB of RAM in virtual memory. The task's properties state ONLY 1.9 GiB as its own memory bound:

======== Workunits ========
1) -----------
  name: h1_1331.40_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1332.00Hz_49
  FP estimate: 1.440000e+014
  FP bound: 2.880000e+015
  memory bound: 1931.00 MB
  disk bound: 100.00 MB

My host computer is configured to use max. 50 % of memory (of 4 GiB RAM) for BOINC. That's 2,048 MiB. This task is within the limit and was therefore distributed to my computer by the einstein@home scheduler. But this computer cannot run a task that in reality needs 3.2 GiB RAM.

After starting the task it immediately allocates 3.2 GiB (3,262,648 KiB) of virtual memory (sufficient swap space available). It takes more than a minute for the actually used physical memory to increase to 2 GiB (2048 MiB). Then the BOINC scheduler suspends the task ("waiting for memory") because the configured memory limit has been reached. BOINC would now wait forever until, miraculously, more memory is available. If one manually increases the usable memory share in the BOINC configuration to, e.g. 60%, then BOINC resumes the task. The task's used physical memory increases further to around 2.5 GiB. Then the task crashes with the end status:

198 (0x000000C6) EXIT_MEM_LIMIT_EXCEEDED

The problem here is the incorrectly specified memory requirements in the task's properties that are too small.

My computer processes tasks that require a maximum of 2 GiB RAM without errors. Example.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3927
Credit: 45682082642
RAC: 63982667

could you increase the 50%

could you increase the 50% limit to 95% in BOINC compute properties? that should prevent it from going into the wait state.

_________________________________________________________________________

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1035
Credit: 17697180
RAC: 12898

Ian&Steve C. schrieb:could

Ian&Steve C. wrote:
could you increase the 50% limit to 95% in BOINC compute properties? that should prevent it from going into the wait state.

Yes, of course. But that's not the problem here. The task was way 'heavier' than specified. In reality you can't utilize 95% of physical memory for BOINC alone. The OS consumes ~25 % of (only) 4 GiB after startup. Increasing BOINC's memory limit to 60..70% allows BOINC to resume the task. But the lacking physical memory will crash the task. I don't know if the science tasks itself crashed because 'Out of Memory' or if BOINC terminated the task with result: ERROR because the process exceeded memory bound extremely.

If there is a memory bound within the task's properties, which specifies 1.9 GiB then this task shouldn't allocate 3.2 GiB. There's also a FP bound set to each task so that BOINC will terminate a task that runs extremely longer than the estimated run time. One would assume BOINC will also kill processes of tasks which consume a lot more memory than specified within the tasks preferences ('memory bound').

Harri Liljeroos
Harri Liljeroos
Joined: 10 Dec 05
Posts: 4283
Credit: 3172545311
RAC: 1945114

Yes, the task should fail if

Yes, the task should fail if it uses more memory than specified (rsc_memory_bound). Same applies also to disk usage (rsc_disk_bound).

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1035
Credit: 17697180
RAC: 12898

I think I've seen erroneously

I think I've seen erroneously low memory bounds also in (all?) previous O3MD1 tasks that required less memory. The real memory usage of these tasks was 2.1 GB and their specified memory bound in the task's properties was, I think, 1.6...1.8 GiB. So the delta was much smaller at about 300...500 MiB and has not caused any problems so far.

Now the delta is: 3.2 GiB - 1.9 GiB = 1.3 GiB. These current monster tasks now use 70% more RAM than specified in the task's properties. I assume, this causes problems.

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 3034
Credit: 4938584357
RAC: 756113

Scrooge, since you only have

Scrooge, since you only have 4 GB of memory, wouldn't it be wise, and simpler, to add 4 GB of ram so you have a total of 8 GB?  I assume that this particular computer is an older laptop running Win7.  Does it even have the ability to add 4 GB?  As you said, you essentially are running BOINC with just 2 GB of memory, and most of BOINC's Einstein projects are getting be memory hungry over 2 GB.

George

Proud member of the Old Farts Association

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1035
Credit: 17697180
RAC: 12898

GWGeorge007

GWGeorge007 wrote:

Scrooge, since you only have 4 GB of memory, wouldn't it be wise, and simpler, to add 4 GB of ram so you have a total of 8 GB?  I assume that this particular computer is an older laptop running Win7.  Does it even have the ability to add 4 GB?  As you said, you essentially are running BOINC with just 2 GB of memory, and most of BOINC's Einstein projects are getting be memory hungry over 2 GB.

Yes, that would work around the problem... until tasks eventually need even more memory.

I try to motivate my observation in a different way: BOINC was developed in line with the Seti@home approach by D. Anderson et al. designed to use unused CPU and memory resources on normal work computers, desktop computers, notebooks (today also tablets, smartphones and embedded computers) for distributed computing. In addition, in BOINC you can flexibly configure the resources that BOINC may use (memory, CPU cores, CPU time, network usage, data volume, ...). BOINC was not primarily designed for power crunchers[*] to make all their resources available to BOINC. The whole approach only makes sense as long as BOINC projects respect the client configuration.

Now, if I upgrade the memory to 8 GiB and configure BOINC to use up to 25% of that, then BOINC should automatically not get any tasks that exceed the 25%. There is the property of work units: "memory bound".

Coming back to Harri Liljeroos' point:

Harri Liljeroos wrote:
Yes, the task should fail if it uses more memory than specified (rsc_memory_bound). Same applies also to disk usage (rsc_disk_bound).

The question is how BOINC observes resp. enforces this rsc_memory_bound set as attribute of each workunit. Apparently, a task is not aborted immediately as soon as the memory usage exceeds the defined rsc_memory_bound, but depending on the extent of the memory excess.

Three examples:

  • Example 1: 4 GiB RAM, BOINC config: up to 70 % of RAM (= 2,867 MiB). This O3MD1 CPU task allocates 3,342.5 MiB virtual mem and was aborted by BOINC when 3,117.77 MiB of physical memory usage was exceeded. There's NO error message from science app within the logfile. See task's result:
    - end status: 198 (0x000000C6) EXIT_MEM_LIMIT_EXCEEDED
    - Peak swap size (MB): 3342.5
    - Peak working set size (MB): 3117.77

  • Example 2: 4 GiB RAM, BOINC config: "up to 50% of RAM" (= 2048 MiB). I reduced memory usage down to 50%, but anyhow the einstein@home scheduler distributed this 3.2 GiB task (3340.99 MiB). When started, BOINC aborted this task earlier as soon as 2.5 GiB (2582.52 MiB) physical memory usage was reached.

    - Peak working set size (MB): 2582.52

    - Peak swap size (MB): 3340.99

    The abort due to EXIT_MEM_LIMIT_EXCEEDED is independent of the free physical RAM, but takes into account the extent to which memory usage exceeds rsc_memory_bound AND BOINC's client config for memory usage.

In both examples, there was no error message from the science app in the log file about insufficient memory.  The memory would have been just about enough: 3.2 GiB free physical memory and sufficient swap memory (total of 8 GiB virtual memory). The science app did not crash. It was BOINC which aborted these two tasks.

  • Example 3 (16 GiB RAM, different host, "use up to 60% of RAM" (= 9,830MiB). Here, other non-BOINC processes occupied a large part of the RAM, so that a (smaller) O3MD1 CPU task (requiring 1.8 GiB mem) could not allocate sufficient virtual memory when it was started. The science app crashed and logged lots of error messages:

    - Peak working set size (MB): 1881.87

    - Peak swap size (MB): 1885.38


    [...] Memory allocation error

    In this case, BOINC did not initiate the task abort. Science app crashed. Exactly as expected.

In short: The workunit generator for O3MD1 CPU (unknown whether also for O3MDF GPU tasks) sets a much too small value for memory bound. Either that is a statically set upper limit that was never reached by workunits in the previous months or years. Or an estimate resp. calculation of the upper memory limit based on workunit parameters is far too optimistic.

Harri Liljeroos
Harri Liljeroos
Joined: 10 Dec 05
Posts: 4283
Credit: 3172545311
RAC: 1945114

Might be a good idea to post

Might be a good idea to post your findings in the Technical News section under proper thread (O3MD1/F) or send personal message to Bernd to lure him to take a peak in this thread.

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1035
Credit: 17697180
RAC: 12898

I would also like to refer to

I would also like to refer to the server status page: https://einsteinathome.org/server_status.php

[updated 17 Apr 2023, 10:25:01 UTC]

"O3MD1" (CPU):

Tasks...

  • valid: 98,555
  • invalid: 64    (0.04% invalids... due to no verification against wingman)
  • inconclusive: 0
  • pending: 0      (initial replication: only 1 - no wingman)
  • failed: 77,421      (44% failed)
  • too late: 1,201

"O3MDF" (GPU):

  • valid: 855,950
  • invalid: 1,408    (0.1% invalids)
  • inconclusive: 667
  • pending: 166,601 (pending tasks were finished successful: not failed)
  • failed: 44,856       (only 4% failed)
  • too late: 704

I don't know the reasons for the many failed tasks for O3MD1 CPU. I haven't the time to explore other users' failed tasks. Maybe there are many user-initiated aborts due to extremely long runtimes (> 1 day), rare checkpoints (30..60 or more minutes between), frustration about it... I have no idea.

But maybe many of these failed tasks are aborted by BOINC itself because "memory bound" value in the workunits is far below actual memory consumption. With any current multi-core host with 6...8 or more cores running the default BOINC client configuration (use 100% of CPU cores) running some of these tasks concurrently will surely exceed BOINC's default client configuration for memory usage (e.g.: 50% of 16 or 32 GiB). It will also overload the available RAM of each up-to-date computer if each of 6...8 or more O3MD1 CPU task allocates 3.2 GiB instead of only 1.9 GiB as stated by the workunits properties.

It's like loading a plane with cargo containers' which specified weight is far lower than actual weight.

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1035
Credit: 17697180
RAC: 12898

Harri Liljeroos schrieb:Might

Harri Liljeroos wrote:
Might be a good idea to post your findings in the Technical News section under proper thread (O3MD1/F) or send personal message to Bernd to lure him to take a peak in this thread.

I'll forward a link to this thread to Bernd. I think there's no need to post everything again in Tech News. But I'll post a link in a proper thread there.

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1035
Credit: 17697180
RAC: 12898

GWGeorge007 schrieb:Scrooge,

GWGeorge007 wrote:
Scrooge, since you only have 4 GB of memory, wouldn't it be wise, and simpler, to add 4 GB of ram so you have a total of 8 GB?  I assume that this particular computer is an older laptop running Win7.  Does it even have the ability to add 4 GB?  As you said, you essentially are running BOINC with just 2 GB of memory, and most of BOINC's Einstein projects are getting be memory hungry over 2 GB.

George, it's a maximum RAM configuration of an old ThinkPad. I'll try to replace all DIMMs with 2RX8 dual-ranked ones which in theory should be compatible... TODO list ...or retiring this oldie.

High end crunchers compete for TOP10. Low end crunchers test minimums. FGRP5 - runs smoothly on 32bit 1-core Intel Atom N280@1.66GHz, 2.5W TDP, 2 GB RAM... as well as BRP4 or previous O2 runs... ;-)   It's only O3MD1 exceeding bounds. If requirements were set correctly, these wouldn't fail.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.