Losing progress on shutdown

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7359641687
RAC: 2271407

Gary Roberts wrote:It's quite

Gary Roberts wrote:
It's quite a while since I've run other projects so I'm not familiar with the naming used elsewhere.  In the past I've seen the default location expressed as "--" or maybe it was three dashes :-).

While I have not actually run SETI in quite a while, my login still works there.  I checked just now, and at my account's preferences page set to multi-column view I see locations named:

default, Home, School, Work.

I don't spot the function (which seems present at Einstein) to designate home, school, or work as the location to be assigned to newly attached computers by default.  So maybe this naming is not so bad as it looks, but I don't like it.  It can't help people who try to understand how their preference changes at once project affect their work elsewhere.

Consistent and useful naming is wonderfully helpful and remarkably hard to attain.  Digital's Equipment's VMS environment was the best I personally have seen.  I suspect there was one (or a very few) clever motivated person who made a lot of good choices early on, and there probably was also a brutal enforcement regime to confine people to the authorized path.

 

Ged
Ged
Joined: 7 May 05
Posts: 4
Credit: 12143322
RAC: 0

Thanks Mikey. I know

Thanks Mikey.

I know checkpointing is not done by BOINC; it's an option for a project's application writer to decide to include such functionality. I guess my concern was that if Einstein application developers use a common framework for checkpoint creation/maintenance and recovery across the Einstein project, was there a possible problem with that code, as it appeared that more than one Einstein application had checkpoint problems reported.

Let me reiterate my thanks for your reply ;-)

Ged
Ged
Joined: 7 May 05
Posts: 4
Credit: 12143322
RAC: 0

Hi Gary,   Thanks for

Hi Gary,

 

Thanks for looking into this. It's much appreciated.

I've checked that my av/malware checker still has BOINC data folder excluded (and it is) to avoid any 'false positives' actions being taken on that folder's content.

I'm keeping an eye on the current eight 'in progress' FGRP5 tasks - 7 out of the 8 have checkpointed - the one that hasn't (yet!) has been running for approximately 30mins / circa 7% complete...

Cheers,

Ged

Redvibe
Redvibe
Joined: 5 Apr 18
Posts: 11
Credit: 2189846
RAC: 0

Gary, Looking at the tasks

Gary,

Looking at the tasks under My Account, I have found this:

Task 836034410
Gravitational Wave All-sky search on LIGO O1 Open Data v0.04 () x86_64-apple-darwin
Crashed executable name: einstein_O1OD1_0.04_x86_64-apple-darwin__Lion
Machine type Intel x86-64h Haswell (64-bit executable)
System version: Macintosh OS 10.13.4 build 17E199
Mon Mar 18 09:08:45 2019

atos cannot load symbols for the file einstein_O1OD1_0.04_x86_64-apple-darwin__Lion for architecture x86_64.
0   einstein_O1OD1_0.04_x86_64-apple-darwin__Lion 0x000000010d0bbe39  

Thread 1 crashed with X86 Thread State (64-bit):
  rax: 0x0100001f  rbx: 0x00000003  rcx: 0x700003c711b8  rdx: 0x00000028
  rdi: 0x700003c71228  rsi: 0x00000003  rbp: 0x700003c71210  rsp: 0x700003c711b8
   r8: 0x00001503   r9: 0x00000000  r10: 0x000009c8  r11: 0x00000206
  r12: 0x000009c8  r13: 0x00000003  r14: 0x700003c71228  r15: 0x00000000
  rip: 0x7fff6d1c420a  rfl: 0x00000206

___________________________________________________________________________________________

Could a problem with this task be causing me to lose all progress from all tasks? There are always generally three tasks running in parallel (Gamma ray pulsar search and/or Gravitational Wave all-sky search - I don't get any other kind of task). Anyway, if this one task is the source of the problem, is there a way to remove it?

 

I do actually have Einstein@Home running on two computers (both literally at home). The other computer is a Windows PC and the problem of lost progress is NOT happening on that one. It is only happening on the mac.

 

anniet
anniet
Joined: 6 Feb 14
Posts: 1348
Credit: 5079314
RAC: 0

There is a task that I'm

There is a task that I'm currently running:

LATeah0052F_88.0_976_-4.6e-11 (https://einsteinathome.org/workunit/396396137) which is not checkpointing.

[edit: The one above has now checkpointed for the first time, after a runtime of 4 hours. In case it's helpful to know, LAT cpu tasks usually take between 11 and 12 hours to complete on my computer, and H1 O1OD1, 16 hours]

I did notice two h1_0497.90 (etc) tasks last week, or it may even be the week before, that didn't checkpoint either. Unfortunately I didn't note their full file names at the time and they are no longer in my completed task lists.

I think there's somewhere I should be able to find the old event log entries, so I will go and see if I can find that when I get a moment. 

Also, if I'm remembering correctly, one of two h1_0497.65 tasks that I received, didn't checkpoint either, (this one I think: https://einsteinathome.org/workunit/395567645*) but I'll need to confirm that. It seems really quite random when it does happen.

 

[edit: so far I can confirm that neither of these (downloaded 13th March) checkpointed: h1_0497.65_O1C02Cl3In0__O1OD1_497.90Hz_1200_0 ; h1_0497.65_O1C02Cl3In0__O1OD1_497.90Hz_1199_0

 * but has, since I posted that - gone from the database[/edit] 

 

Please wait here. Further instructions could pile up at any time. Thank you.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5887
Credit: 119223646938
RAC: 25169750

Redvibe wrote:Looking at the

Redvibe wrote:

Looking at the tasks under My Account, I have found this:

Task 836034410
Gravitational Wave All-sky search on LIGO O1 Open Data v0.04 () x86_64-apple-darwin
Crashed executable name: einstein_O1OD1_0.04_x86_64-apple-darwin__Lion
Machine type Intel x86-64h Haswell (64-bit executable)
System version: Macintosh OS 10.13.4 build 17E199
Mon Mar 18 09:08:45 2019

atos cannot load symbols for the file einstein_O1OD1_0.04_x86_64-apple-darwin__Lion for architecture x86_64.
0   einstein_O1OD1_0.04_x86_64-apple-darwin__Lion 0x000000010d0bbe39  

Thread 1 crashed with X86 Thread State (64-bit):
  rax: 0x0100001f  rbx: 0x00000003  rcx: 0x700003c711b8  rdx: 0x00000028
  rdi: 0x700003c71228  rsi: 0x00000003  rbp: 0x700003c71210  rsp: 0x700003c711b8
   r8: 0x00001503   r9: 0x00000000  r10: 0x000009c8  r11: 0x00000206
  r12: 0x000009c8  r13: 0x00000003  r14: 0x700003c71228  r15: 0x00000000
  rip: 0x7fff6d1c420a  rfl: 0x00000206

___________________________________________________________________________________________

Could a problem with this task be causing me to lose all progress from all tasks? There are always generally three tasks running in parallel (Gamma ray pulsar search and/or Gravitational Wave all-sky search - I don't get any other kind of task). Anyway, if this one task is the source of the problem, is there a way to remove it?

Because your computers are hidden, I couldn't (at first) see your tasks list to get a better picture.  I've reviewed all your previous messages and found your computer ID in the event log messages that you posted earlier.  That allowed me to 'cheat' by leveraging that ID to get the full list of tasks that currently remain in the online database.

It should be highly unlikely that any one task that crashes would also cause the loss of all progress on other tasks running on other cores in your machine.  The machine itself might crash, but when restarted, the other in-progress tasks should be able to restart from their respective saved checkpoints, or from the beginning if there were none.  BOINC is deliberately designed to be fault tolerant in this way.

I've now looked through the complete stderr output returned to the project for the same task that you listed above.  Because of the limit to the size of these files, that one has been truncated at the beginning.  You can tell this because the very first line shows "819." when (from the following numbers) it should read "7819." (and there should be many more similar lines preceding it as well).  I also noticed that your full tasks list only has 7 tasks in total, 4 of which show as aborted and seem to be pretty much in the same boat as the one you mentioned.

All show quite low CPU times and as being aborted rather than just a computation error.  Also, all of the stderr outputs seem to show evidence of restarting several times without ever saving a checkpoint.  In the 'header section' at the top of the stderr output there is the following - which seems rather unusual


Outcome:          Computation error
Client state:     Aborted by user
Exit status:      203 (0x000000CB) EXIT_ABORTED_VIA_GUI

This seems to suggest that the computation error was caused by having all 4 in-progress tasks selected in BOINC Manager and then clicking the abort button.  I don't know about that because I don't think I've ever aborted an in-progress task.

Of the 4 aborted tasks, 3 were GW tasks and the 4th was an FGRP5 task.  I found the stderr output for that one quite intriguing for a number of reasons.  Here is an excerpt - starting at a time stamp of 11:05:42.  Note there are only 8 sky points whilst the figure for nf1dots is quite large at 547.  This means there will only be 8 checkpoints created so the time between them will be quite long.  There should be very long (length=547) lines of 'dots' - but there aren't so a checkpoint was never written.  This seems to suggest that computation was being interrupted and then, much later, being restarted - eg 11:05:42 to 14:10:43 to 09:29:21 - presumably the next day.  To make the log easier to read, I've truncated the very long lines (lines starting "command line:" and "output files:") since they aren't important to the discussion.


11:05:42 (1042): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
11:05:42 (1042): [debug]: Set up communication with graphics process.
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0052F.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 10
read_checkpoint(): Couldn't open file 'LATeah0052F_56.0_168_-5.6e-11_0_0.out.cpt': No such file or directory (2)
% fft_size: 67108864 (0x4000000, 2^26); alloc: 268435464
% Sky point 1/8
% Creating FFT (3.3.4 22109fa) plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 547 df1dot: 1.834098234e-15 f1dot_start: -5.7e-11 f1dot_band: 1e-12
% Filling array of photon pairs
.......
14:10:43 (1225): [normal]: This Einstein@home App was built at: Jul 26 2017 12:06:48

14:10:43 (1225): [normal]: Start of BOINC application 'hsgamma_FGRP5_1.08_x86_64-apple-darwin__FGRPSSE'.
14:10:43 (1225): [debug]: 2.1e+15 fp, 5.1e+09 fp/s, 408920 s, 113h35m19s61
command line: hsgamma_FGRP5_1.08_x86_64-apple-darwin__FGRPSSE --inputfile ../ <- truncated
output files: 'LATeah0052F_56.0_168_-5.6e-11_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah0052F_ <- truncated
14:10:43 (1225): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
14:10:43 (1225): [debug]: Set up communication with graphics process.
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0052F.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 10
read_checkpoint(): Couldn't open file 'LATeah0052F_56.0_168_-5.6e-11_0_0.out.cpt': No such file or directory (2)
% fft_size: 67108864 (0x4000000, 2^26); alloc: 268435464
% Sky point 1/8
% Creating FFT (3.3.4 22109fa) plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 547 df1dot: 1.834098234e-15 f1dot_start: -5.7e-11 f1dot_band: 1e-12
% Filling array of photon pairs
............
09:29:21 (613): [normal]: This Einstein@home App was built at: Jul 26 2017 12:06:48

09:29:21 (613): [normal]: Start of BOINC application 'hsgamma_FGRP5_1.08_x86_64-apple-darwin__FGRPSSE'.
09:29:21 (613): [debug]: 2.1e+15 fp, 5.1e+09 fp/s, 408920 s, 113h35m19s61
command line: hsgamma_FGRP5_1.08_x86_64-apple-darwin__FGRPSSE --inputfile ../ <- truncated
output files: 'LATeah0052F_56.0_168_-5.6e-11_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah0052F_ <- truncated
09:29:21 (613): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
09:29:21 (613): [debug]: Set up communication with graphics process.
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0052F.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 10
read_checkpoint(): Couldn't open file 'LATeah0052F_56.0_168_-5.6e-11_0_0.out.cpt': No such file or directory (2)
% fft_size: 67108864 (0x4000000, 2^26); alloc: 268435464
% Sky point 1/8
% Creating FFT (3.3.4 22109fa) plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 547 df1dot: 1.834098234e-15 f1dot_start: -5.7e-11 f1dot_band: 1e-12
% Filling array of photon pairs
............................................................................................................................................................................................................................
11:02:45 (1055): [normal]: This Einstein@home App was built at: Jul 26 2017 12:06:48

11:02:45 (1055): [normal]: Start of BOINC application 'hsgamma_FGRP5_1.08_x86_64-apple-darwin__FGRPSSE'.
11:02:45 (1055): [debug]: 2.1e+15 fp, 5.1e+09 fp/s, 408920 s, 113h35m19s61

In thinking about the stderr outputs of the four aborted tasks and looking again at the event log you posted earlier, I notice you also support Milkyway.  Presumably, BOINC will be switching between Einstein tasks and Milkyway tasks from time to time.  Do you have the preference setting for 'keeping tasks in memory when suspended' selected?  If you don't, that might explain why tasks seem to be restarting from the beginning.  With that preference engaged, and if a checkpoint doesn't exist, the state of a suspended task can be kept in memory whilst a Milkway task is processing.  If the preference is not set, and there is no checkpoint, the task would need to restart from the beginning.

Also, and maybe even more important, the event log shows, "suspend work if non-BOINC CPU load exceeds 25%".  Whilst using your computer, it's quite easy for a spike in non-boinc activity to exceed 25%.  That also could be causing tasks to be restarting from the beginning.  Please realise that BOINC stuff runs at low priority and can usually easily 'get out of the way' when your computer need to do something more important.  If spikes in user activity are causing BOINC to suspend (and not saved in memory) that might explain what you see.

I'm just a volunteer like yourself so I have no particular insight into the design or internal workings of the applications.  Whilst I try to deduce what is going on, I could easily be wide of the mark.  There are many more 'cycles' after the ones I've shown above and you can check for yourself.  Perhaps you can correlate the various time stamps given in the complete log to what may have been happening on your machine at the relevant times.

Hopefully, I've given you some things to think about as you try to work out what is happening.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5887
Credit: 119223646938
RAC: 25169750

anniet wrote:There is a task

anniet wrote:

There is a task that I'm currently running:

LATeah0052F_88.0_976_-4.6e-11 (https://einsteinathome.org/workunit/396396137) which is not checkpointing.

[edit: The one above has now checkpointed for the first time, after a runtime of 4 hours. In case it's helpful to know, LAT cpu tasks usually take between 11 and 12 hours to complete on my computer, and H1 O1OD1, 16 hours]

Hi Annie,
What a pleasant surprise to see you here <==    rather than over there ==>   where you usually hang out :-).

Because it's still 'in progress', there is nothing on the website to check to see how many sky points it contained.  I suspect it will be a very small number - probably the same (8) as showed up in that previous diatribe I've just posted before answering you.  Don't worry you're not going to get another diatribe - I'm fresh out of them, unfortunately :-).  However there will be a mandatory test assignment with an enormous number of arcane questions to prove you've done your homework and taken in all the relevant details .... :-).  [Very sorry - couldn't resist ...]

If your tasks normally take around 12 hours, 8 checkpoints should mean about 1.5 hours between each one.  That's quite a rough estimate since part of that 12 hours (perhaps 30 to 60 mins) is for the followup stage which is the bit between 90-100% where progress seems to have stopped (but it hasn't).  Previous data files (eg LATeah0050F.dat) produced tasks with 79 sky points (you have a completed one in your list) so that would mean checkpoints about every 9 minutes, roughly.

In case you're a bit miffed on being denied a diatribe, here's a mini-one just for you.  With 79 sky points and therefore 79 checkpoints to be written for the main calculations, you can know exactly when each checkpoint has been written without examining the task properties.  Since 90%/79=1.139%, whenever you see the %done tick over to 1.139%, 2.278%, 3.417%, 4.556%, .... then you know a new checkpoint has just been written.  If your current in-progress task has 8 checkpoints, the sequence should be 11.25%, 22.50%, 33.75%, 45.00%, 56.25%, 67.50%, 78.75% and 90.00%.

anniet wrote:
I did notice two h1_0497.90 (etc) tasks last week, or it may even be the week before, that didn't checkpoint either. Unfortunately I didn't note their full file names at the time and they are no longer in my completed task lists.

In view of the other similar reports, I would think there was a staff mini-oops that probably got quietly and quickly rectified before a huge cacophony broke out.  I'm not surprised that the 'evidence' has disappeared fairly quickly.  probably best to get rid of the problems quickly and get back to normal operations :-).  I think there would be many ongoing reports if that problem still existed.

anniet wrote:
I think there's somewhere I should be able to find the old event log entries, so I will go and see if I can find that when I get a moment.

The file is named 'stdoutdae.txt' and it's usually in the BOINC data folder.  I wouldn't worry about it if I were you because, unless you have special cc_config.xml logging flags set, there won't be any way to deduce if checkpoints were being created or not.

anniet wrote:

Also, if I'm remembering correctly, one of two h1_0497.65 tasks that I received, didn't checkpoint either, (this one I think: https://einsteinathome.org/workunit/395567645*) but I'll need to confirm that. It seems really quite random when it does happen.

 

[edit: so far I can confirm that neither of these (downloaded 13th March) checkpointed: h1_0497.65_O1C02Cl3In0__O1OD1_497.90Hz_1200_0 ; h1_0497.65_O1C02Cl3In0__O1OD1_497.90Hz_1199_0

 * but has, since I posted that - gone from the database[/edit] 

 

Don't worry about the old tasks but it would be helpful to know if there is any further sign in the latest tasks. If you do see any tasks with more than say 11.25% progress whose property page doesn't show a 'CPU time at last checkpoint' entry, a report would be appreciated.  I chose that figure because I don't know of any FGRP task with less than 8 sky points.  I don't know for certain how often the GW tasks are supposed to checkpoint but I would guess it's rather more frequently than FGRP tasks.

 

Cheers,
Gary.

Redvibe
Redvibe
Joined: 5 Apr 18
Posts: 11
Credit: 2189846
RAC: 0

Good news! The work started

Good news! The work started yesterday has not been lost. What I did was to go into computing preferences in my account and selected YES next to "disconnect when done". Not sure why that worked, but it seems to have done the trick. Coincidence, perhaps?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5887
Credit: 119223646938
RAC: 25169750

Redvibe wrote:Good news! The

Redvibe wrote:
Good news! The work started yesterday has not been lost. What I did was to go into computing preferences in my account and selected YES next to "disconnect when done". Not sure why that worked, but it seems to have done the trick. Coincidence, perhaps?

I'm glad things are working for you but I'd be extremely surprised if it had anything to do with the 'disconnect when done' setting.

That setting is under the heading of Network Usage and as I understand it (I don't know for sure as I've never needed to use it), the purpose is to allow network connections to be held only for the period they are really needed.  Normally, BOINC will complain repeatedly if it doesn't have an 'always on' network connection so I think that maybe the setting is a way to silence all the complaints if you need to crunch without a connection.

Imagine you had a laptop with wifi only and no wifi router at home.  Each day if you went to the office (or any wifi hotspot) with that setting in place, you could use the wifi to establish a connection to the Einstein servers for the purpose of returning completed tasks and downloading a bunch of new work.  As soon as that operation completed, the connection would be closed.  The machine could continue to process tasks, storing the results - even in the evening when you go home.  The next day, the cycle could be repeated.  I think something like that is the purpose of that setting.

It's a bit hard to see the creation of checkpoints as being in any way related to this ability to run without a live network connection.  Only BOINC knows about the state of the network connection.  The science app will continue running and creating checkpoints when needed quite independently.

 

Cheers,
Gary.

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 139002861
RAC: 0

I believe the “disconnect

I believe the “disconnect when done” is intended for dial-up connections (ie hang up when done with sending and receiving). Most people these days have an always-on type of internet connection so its largely irrelevant to the majority of us.

As Gary has said I can’t see how this would effect checkpointing. A more likely reason is the “bad” work units are done so now you are getting the ones that do checkpoint.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.