I´ve noticed severall times that the computation of a work unit seems to be going on (CPU 100%, screensaver loaded and running), but the progress indicator "hangs" at a certain percentage. In these situations, the screensaver isn´t rotating smoothly, but instead rotates/jumps every half a second several degrees and then get "stuck" for half a second, and so on.
Eventually, I get messages that my work unit is overdue. I can´t find a way to get that unit to finish its results. Here I have messages from the latest occurence:
---------------------------------------------------
4-1-2008 13:47:46||Starting BOINC client version 5.2.13 for windows_intelx86
4-1-2008 13:47:46||libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3
4-1-2008 13:47:46||Data directory: C:\\Util\\BOINC
4-1-2008 13:47:47||Processor: 1 GenuineIntel Mobile Intel(R) Pentium(R) 4 CPU 2.80GHz
4-1-2008 13:47:47||Memory: 2.00 GB physical, 2.60 GB virtual
4-1-2008 13:47:47||Disk: 100.00 GB total, 73.70 GB free
4-1-2008 13:47:47|Einstein@Home|Computer ID: 588777; location: home; project prefs: default
4-1-2008 13:47:47|SETI@home|Computer ID: 2298551; location: home; project prefs: default
4-1-2008 13:47:48||General prefs: from unknown project http://bbc.cpdn.org/ (last modified 2006-04-18 01:03:10)
4-1-2008 13:47:48||General prefs: using your defaults
4-1-2008 13:47:48|Einstein@Home|Result h1_0654.65_S5R2__219_S5R3a_1 is 0.67 days overdue.
4-1-2008 13:47:48|Einstein@Home|You may not get credit for it. Consider aborting it.
4-1-2008 13:47:49||Remote control not allowed; using loopback address
4-1-2008 14:31:00|SETI@home|Deferring computation for result 01dc06af.24376.4571.9.6.99_0
4-1-2008 14:31:00|Einstein@Home|Deferring computation for result h1_0654.65_S5R2__219_S5R3a_1
4-1-2008 14:31:00||Resuming computation and network activity
4-1-2008 14:31:00||request_reschedule_cpus: Resuming activities
4-1-2008 14:31:00||Suspending work fetch because computer is overcommitted.
4-1-2008 14:31:00||Using earliest-deadline-first scheduling because computer is overcommitted.
4-1-2008 14:31:01|Einstein@Home|Restarting result h1_0654.65_S5R2__219_S5R3a_1 using einstein_S5R3 version 415
4-1-2008 14:32:42||Suspending computation and network activity - user is active
4-1-2008 14:32:42|Einstein@Home|Pausing result h1_0654.65_S5R2__219_S5R3a_1 (removed from memory)
4-1-2008 14:32:43||request_reschedule_cpus: process exited
4-1-2008 15:03:54||Resuming computation and network activity
4-1-2008 15:03:54||request_reschedule_cpus: Resuming activities
4-1-2008 15:03:54|Einstein@Home|Restarting result h1_0654.65_S5R2__219_S5R3a_1 using einstein_S5R3 version 415
4-1-2008 15:08:55||Suspending computation and network activity - user is active
4-1-2008 15:08:55|Einstein@Home|Pausing result h1_0654.65_S5R2__219_S5R3a_1 (removed from memory)
4-1-2008 15:09:01||request_reschedule_cpus: process exited
4-1-2008 15:12:41||Resuming computation and network activity
4-1-2008 15:12:41||request_reschedule_cpus: Resuming activities
4-1-2008 15:12:41|Einstein@Home|Restarting result h1_0654.65_S5R2__219_S5R3a_1 using einstein_S5R3 version 415
4-1-2008 15:21:28||Suspending computation and network activity - user is active
4-1-2008 15:21:28|Einstein@Home|Pausing result h1_0654.65_S5R2__219_S5R3a_1 (removed from memory)
4-1-2008 15:21:45||request_reschedule_cpus: process exited
4-1-2008 15:23:37|Einstein@Home|Unrecoverable error for result h1_0654.65_S5R2__219_S5R3a_1 (aborted via GUI RPC)
4-1-2008 15:23:37||request_reschedule_cpus: result op
4-1-2008 15:23:37|Einstein@Home|Computation for result h1_0654.65_S5R2__219_S5R3a_1 finished
4-1-2008 15:23:40||request_reschedule_cpus: project op
---------------------------------------------------
Anyone have a clue on what´s heppening here?
Thnx.
Copyright © 2024 Einstein@Home. All rights reserved.
Computation goes on but progress-indication "hangs"
)
The progress indicator (% completed) does move in jumps anyway. The key indicator of lack of progress is the accumulated CPU time which should be increasing continuously. Can you confirm that the CPU time is also "stuck" and not increasing?
I've seen this "stuck" condition a few times. Once "stuck", the task will indeed not complete and will eventually exceed the the deadline unless you "unstick" it. From experience, the way to do this is simply to stop and restart BOINC. That has always worked for me each time I've seen the condition occur.
I can't comment about this as I never allow the graphics to run. To me it's more important that every available CPU cycle goes into the science calculations rather than into displaying a pretty image so I have zero experience with the workings of the graphics thread.
Have you tried completely stopping BOINC and then restarting it?
Here are some comments about the messages you have listed.
These show approximately 1.5 hours of action after the task had already exceeded its deadline by about 16 hours. There is nothing in these messages that shows anything about how the computation got stuck in the first place. Can you confirm that during this 1.5 hour period, the CPU time expended on the partially completed task was stuck at some particular value and not increasing at all? If so, I doubt that what you are seeing is exactly the same as what I've seen myself. Since this particular log snippet commences with a restart of BOINC, I would have expected that the CPU time and the % completed would both have been increasing if it were the same problem as the ones I've seen.
The version of BOINC you are using (5.2.13) is quite an old one. Whilst this shouldn't affect the computation of a task, there have been many enhancements and bug fixes in more recent versions and an upgrade (which is very easy to do) is more than warranted. One of the things that is useful for trying to see exactly what is happening during computation is the stderr.out log that is available when tasks are completed and returned. Current versions of BOINC show quite detailed information in these logs. In your case, clicking on your task only shows the fact that you aborted it. There is no detailed information log showing what was happening up to the time you aborted it.
In your messages log, there seems to be a large gap between 13:47:49 and 14:31:00. Did anything happen during that period? I would have expected to see entries showing the stopping and starting of the science app in response to your use of the computer. Also there should have been an entry showing the initial startup of the science app immediately after the starting of BOINC.
Because of your choice to stop the science app when the user is active, it's always going to be slow for you to complete tasks. In the 1.5 hour period of the messages snippet, there is only about 15 minutes of actual computation showing. You also have chosen to have the task removed from memory when suspended. This will result in the loss of all calculations towards the next checkpoint each time a task is suspended because of user activity. Are you sure that it's not simply user activity that is causing a task to appear to be stalled?
One final point. Even though the task you aborted had exceeded the deadline, you probably didn't need to terminate it at that point and lose all (around 62%) the accumulated crunching. It has been assigned to another cruncher who is still working on it. There may have been time for you to have finished it anyway before the third person does so.
Hopefully, the above comments may give you some points to consider. If I'm not properly understanding your problem, please respond with more information as it's certainly not usual for the science app to be stalled and not making progress.
Cheers,
Gary.
Hi Gary, Thanks for the
)
Hi Gary,
Thanks for the reply. I´ll try to answer your questions, will check the conditions and try solutions the next time it occurs.
I´m not sure about that, but I think it was.
Does a reboot of the PC count as a restart? If so, yes, I´ve tried that.
As said above, I didn´t really check the total computation time, all I mentioned was the progress indicator not increasing at all, CPU going 100% (as usual) and the screensaver doing its thing (but irregularly).
It did the job thusfar... so no reason to go looking if there is a new one, but it would be nice to have an update function in de BOINC-client app.
I´m not sure, there are three possibilities:
1. I was working on the PC during that period,
2. BOINC had no connection with the host that period (that happens more often, I then open the BOINC manager and select the local host, that seems to work),
3. BOINC was running that period.
What can I say? I´m working on that computer... It´s not a server doing almost nothing. ;-)
I didn´t know that. It seems a logical thing to do when I had only 0.5GB mem and the memory counter went towards 900MB... Maybe I change this now so it holds the data in memory.
I know that, but I also know from experience that, when Einstein is in this state, it will never get to the end of it. Even more, it could already have been stuck for days, the only reason I notice these "hangs" is because of the "overdue" messages in the message window. Earlier I had one which was more that a week overdue.
I will keep an eye on this (and check out the upgrade possibilities to a newer version of BOINC). Thanks so far,
Grtz,
Richard
RE: RE: The version of
)
The BOINC client has a warning that a new BOINC client is available since 5.8.x, it'll show in the Messages tab. Since you don't even have that 'new' a client, you won't see it.
Why not bookmark http://boinc.berkeley.edu/download.php and check it about once a month?
Hi Richard, Thanks very
)
Hi Richard,
Thanks very much for taking the trouble to answer my questions in detail.
Normally, the CPU time will "tick over" approximately every second. The % done will increment every minute or two if the process is not being interrupted somewhere during that interval. You haven't described your usual usage pattern and the sort of apps that you regularly run on your computer but if you normally start up your box when you want to use it and then do a session of activities like browsing, reading and responding to email, composing or editing documents and other normal "office-like" activities, it wouldn't be surprising to find that you could do several hours of such activities with little progress in the % done figure for your BOINC projects. If you then shut down your computer to go do something else, BOINC would be hard pressed to make any progress at all. Perhaps something like this is causing you to think it is "stuck".
If you are committed to your current settings of allowing BOINC to run only when the computer is idle, you could make a considerable improvement by adjusting the default interval after keyboard/mouse activity ceases before BOINC is allowed to start. If I remember correctly, the default value for your version of BOINC was 3 minutes, which is a very long interval if you sit and watch it expire. If you were to set it to 0.1 minutes, BOINC would be able to kick in a lot sooner and a lot more frequently. Now that you have 2.0 GB of RAM, there's no real reason to force the app out of RAM when suspended. The combination of these two changes should make a fairly significant improvement in the rate of progress.
Absolutely! This is one of the reasons why I'm now thinking that you don't have the genuine freezing of the app that I've occasionally noticed but rather a "pseudo-stuck" situation due to the way your preferences are working.
When software is under very active development (like BOINC is) you miss out on important bug fixes and new features if you don't occasionally upgrade. When you do eventually upgrade after quite an interval, you usually then realise that the old version that you thought was working fine, wasn't really working optimally after all.
One of the reasons that BOINC doesn't "auto upgrade" is that many people who volunteer for distributed computing want to maintain full control and aren't particularly happy with "stealth upgrades" where they might not have full knowledge and control over what exactly is being changed without their direct knowledge and consent. It could be quite a politically sensitive issue. As Ageless mentions, more modern versions of BOINC now keep you directly informed when new recommended versions are available.
The upgrade procedure is straight forward and quick to do. It can be done at any time and there is no loss of data or settings if you follow the steps precisely:-
* Completely stop BOINC if it is currently running.
* Use Windows "Add/Remove..." to uninstall the old version of BOINC (DO NOT manually change anything else)
* If you are cautious you could make a backup copy of the residual BOINC folder at this point before proceeding.
* Use the desktop icon to launch the installation of the new version of BOINC and answer the questions asked.
* Make sure you select the old BOINC folder as the place to install it (in your case C:\\Util\\BOINC).
You will be asked to choose between different types of installs - in particular single user install, shared install or service install. There is probably general reluctance to do so but I firmly believe that a service install is the best unless you have a very strong reason for choosing another type. However that is a choice that you have to make and is beyond the scope of this reply.
At the end of the install, the new version will launch and will find all your previous data and settings and will simply carry on from where the old version left off.
In which case there should have been messages announcing the fact each time things stopped and restarted.
I don't really follow you here. Unless you are running BOINC as a service, BOINC is usually launched by BOINC Manager which acts as a GUI front end to BOINC. BOINC is simply a process running on your computer and using very little of your computer's resources. BOINC controls the scheduling of the various science apps. These are what consume resources but because they run at very low priority, they tend to consume mainly the CPU cycles that otherwise would be wasted. When you refer to "no connection with the host" I assume you may be referring to situations where the communication between BOINC Manager and BOINC may have broken down, perhaps? Newer versions of BOINC are likely to be more robust in this regard.
I presume you mean "WASN'T running...". Once started, BOINC is always running until you tell it to stop. Depending on circumstances, there are a couple of ways you may have done this. However, if BOINC did stop, so would have the messages. You appear to have a continuous log in which case BOINC was running all the time.
I was mainly wondering if (to conserve space) you had deliberately left out some messages for the period I specified.
It's a common fallacy to believe that BOINC is likely to interfere with your normal use of your own computer. (It's also a fallacy to think that a personal desktop machine is likely to be more short of available CPU cycles than a server but that's another story for another time :) ). Sure, there may be a few power users out there that regularly consume 20% or more of their computer's CPU cycles but they would be pretty few and far between. On average, I would suspect that a modern personal desktop would be lucky to have less than 95% of all its CPU cycles used for anything but the idle loop.
BOINC does a very good job of getting out of the way when it has to. Have you ever tried to allow BOINC to run continuously for a few hours during your normal usage and see if you can notice the difference? Since your machine appears to be a laptop, you would tend to notice the fan running at higher speed all the time but I'm talking about performance rather than the problem of dealing with excess heat. Because of the heat issue, I refuse to run BOINC on laptops these days, but that again is a different issue.
I think you would be wise to do that, particularly on older versions of BOINC. I think that more recent versions are smarter at waiting for a checkpoint to be written before switching apps for example. You could imagine the loss if BOINC were regularly switching between projects that were not held in memory and where checkpoints weren't being written all that often. E@H is quite frequent with its checkpointing so the potential losses are not that large.
OK, you're welcome! Keep us informed as to how things work out.
Cheers,
Gary.
Possibly related to this
)
Possibly related to this problem is the phenomenon my copy of boinc_5.10.32_macOSX_universal is presently exhibiting: BOTH the "Elapsed Time" AND the "Time Remaining" count UP! I should think that the latter should count DOWN!
RE: Possibly related to
)
I would think that it's quite unrelated because the OP was describing a situation where the progress (% done) had completely stalled whereas your situation doesn't seem to be stalled at all.
Unless you give a lot more details and unless you have observed this over a lengthy period of time and over several tasks, the above may in fact be quite normal. You have two active computers and the results lists of both look quite normal and uneventful. There's no sign of anything but the normal variation in crunch times that everyone experiences.
A possible explanation is that the current task on which you are observing the above is going to take somewhat longer to crunch than previous tasks have. If that's the case, BOINC is simply responding to that by increasing its estimate of how much more time will be needed to complete the task. That way you would see (for a period anyway) both elapsed time and remaining time increasing. If you observe for long enough, the remaining time will reach a peak and then start to decline.
If this doesn't explain your situation, please feel free to start a new thread and more fully describe what it is that you are seeing over a period of time.
Cheers,
Gary.
RE: RE: Possibly related
)
I have the same "issue" with both Window and Linux installs. Since the beginning of the S5 run, the "Time Remaining" counts up for a minute or two and then it falls back. For example, on this Linux box, I justed watched the current workunit work it's way from 19:19:xx to 19:21:xx and suddenly fall back to 19:17:xx. The time issue is tied into the S5 app (?, have no idea) and wasn't present in the previous S4 run.
RE: I have the same
)
I strongly suspect that this is not a bug or anything like it, it's a side effect caused by differences between the way work was split into workunits between S5R2 and S5R3.
In S5R2 (and earlier), each workunit consisted of (IIRC) tens of thousands of individual computations. In S5R3, a workunits consists of only ca 1200 such steps (but they take longer).
After each such step is finished, the "progress" reported to the boinc core client is updated (and the crosshair in the screensaver will move to a new position in the sky). Because of this change, progress now is updated less frequently (depending on speed of the PC between ca 15 seconds and more than a minute!). During the time when progress is not updated, some versions of the BOINC GUI will still calculate a new estimated remaining time, which will *increase* if the GUI sees no progress (as the estimate is based on the time it took to get to the current progress). Once a "progress ping" is received from the science app, the remaining time is adjusted and will drop noticeably.
CU
Bikeman
Ah, no worries Bikeman. I
)
Ah, no worries Bikeman. I never considered it a bug and figured it had something to do with how current S5 processing was implemented. I apologize that the tone of my last post sounded like a query. I was trying to state my observation directed toward DanielRKilloran, as I'm pretty sure that if he checked that time status in a period of 2-5 mins., he would confirm the same.