In the last three days I have had about four or five work units, almost all Einstein@home 'freeze' for hours and hours until I choose to abort computation. I was wondering if anyone else has seen this happen. Current system: MacPro 3Ghz, 3GB Ram
Copyright © 2024 Einstein@Home. All rights reserved.
Computational Errors
)
What BOINC Client are you using?
Are there any messages in the messages tab?
Is there any activity on the mac equivalent of task manager where you check how much CPU each process is getting (I have no idea what it's called)?
What other projects are running on the system and have they frozen too?
Kathryn :o)
Einstein@Home Moderator
Activity Monitor as it is
)
Activity Monitor as it is called, shows that normally when sitting idle, the cpu load is near 100% occupied with the computation task. when the glitch occurs, it goes to about 50-70%. I am using client 5.4.9. Note that it is a intel based system if you didnt catch that. SETI and rosetta are also running and both have frozen twice. In the time between posts 2 more errors occured, 1 seti 1 einstein.
Here is a message for the most recent Einstein error:
Tue Dec 5 04:22:39 2006|Einstein@Home|Restarting task h1_0255.0_S5R1__2103_S5R1a_0 using einstein_S5R1 version 428
Tue Dec 5 04:22:39 2006|Einstein@Home|Restarting task h1_0255.0_S5R1__2102_S5R1a_0 using einstein_S5R1 version 428
Tue Dec 5 04:22:39 2006|Einstein@Home|Restarting task h1_0255.0_S5R1__2101_S5R1a_0 using einstein_S5R1 version 428
Tue Dec 5 04:22:39 2006|rosetta@home|Pausing task 1tvg_1_NMRREF_1_1tvg_1_idid_model_01IGNORE_THE_REST_idl_1432_144_0 (removed from memory)
Tue Dec 5 04:22:39 2006|rosetta@home|Pausing task 1tvg_1_NMRREF_1_1tvg_1_idid_model_02IGNORE_THE_REST_idl_1432_144_0 (removed from memory)
Tue Dec 5 04:22:39 2006|rosetta@home|Pausing task 2snm_1_NMRREF_1_2snm_1_id_model_03_core_0001IGNORE_THE_REST_idl_1433_137_0 (removed from memory)
Tue Dec 5 04:22:39 2006|Einstein@Home|Can't create shared memory: system shmat
Tue Dec 5 04:22:39 2006|Einstein@Home|Unrecoverable error for result h1_0255.0_S5R1__2100_S5R1a_0 (Couldn't start or resume: -146)
Tue Dec 5 04:22:39 2006|Einstein@Home|Deferring scheduler requests for 1 minutes and 0 seconds
Tue Dec 5 04:22:39 2006||Rescheduling CPU: start failed
Tue Dec 5 04:22:39 2006|Einstein@Home|Unexpected state 7 for task h1_0255.0_S5R1__2100_S5R1a_0
Tue Dec 5 04:22:40 2006|Einstein@Home|Computation for task h1_0255.0_S5R1__2100_S5R1a_0 finished
At midnight last night this occured repeatedly as well and then BOINC seems to have resolved the issue:
Mon Dec 4 23:56:31 2006|Einstein@Home|Restarting task h1_0339.5_S5R1__16658_S5R1a_0 using einstein_S5R1 version 428
Mon Dec 4 23:58:45 2006||Can't rename state file: Error -1
Mon Dec 4 23:58:45 2006||Couldn't write state file: system rename
Mon Dec 4 23:59:45 2006|Einstein@Home|Task h1_0339.5_S5R1__16658_S5R1a_0 exited with zero status but no 'finished' file
Mon Dec 4 23:59:45 2006|Einstein@Home|If this happens repeatedly you may need to reset the project.
Mon Dec 4 23:59:45 2006||Rescheduling CPU: application exited
Aside from these two errors, everything else has been me aborting the program with that being the only message. I aborted because after monitoring progress of the programs over an hour, of 'running' no progress had been made, nor had the cpu time counter incremented at all.
Suggest would be to try and
)
Suggest would be to try and set the projects to leave in memory. For some reason when they are being paused and removed from memory, something is getting lost in the transition.
It almost sounds like you might have a bad sector that is being written to, but than can be read from after a unit goes into pause. Have you done a good drive test in a while? I know I've seen where a perfectly good drive will have a bad sector once in a while, after enough use.
I ran a disk verify. Turns
)
I ran a disk verify. Turns out there was an error in the disk header. I ran the repair utility, and so far no crashes...but they were infrequent. Nevertheless, its looking better allready...performance system wide seems to have increased. Thanks for all the help
Jared
RE: In the last three days
)
My MacPro had this twice, calculating for einstein only.
On one occasion I traced the error to be an endless loop polling a blocked semaphore. Only two of the four einstein threads made any progress but all four were at 99% cpu usage. The semaphore seemed to be held by neither boinc nor einstein, though. So this is probably messy to fix. The weird part is that the two hanging einstein threads were counted as executing from both boinc and activity monitor while shark shows only syscalls. Simply quitting and restarting the boinc app worked, but at that point it had wasted approximatly two days of calculation time.
If anyone wants to investigate further, I still have the shark trace. At least one thread seems to loop within boinc_init_options_graphics_impl.
Daran
And a followup, as it just
)
And a followup, as it just happened again:-(
This time the trace shows 3 einstein threads to work fine, but the fourth is busy with sys calls (> 300000 calls per sec). Boinc again shows no progress on the thread. The thread is busy calling pthread_mutex_lock from timer_thread, worker_signal_handler and loadSkyGridFile, the latter three living in einstein_S5R1_4.28_i686-apple-darwin. No other process seems to have high activity.
Its been blocked for about 9h. Snoozing Boinc does not help, restarting the app does. This is anoying...
Daran