sometimes a task will stall, and i can't get it run. the task's status will say "running", but it is making no progress.
I've observed exactly this several times over the last few weeks and have commented on it towards the end of this message. I have also noticed it just today on another machine. The only difference in my case is that I seem to be able to "un-stall" it rather easily. I'm presuming that your stalled result was the one that you aborted recently?
When it happens, I simply stop boinc completely. I have it running as a service or as a daemon for unix so I simply stop the service. On restarting the service, the crunching restarts from the last saved checkpoint and there seems to be no further problem.
Bernd has said that he will be releasing a new suite of apps soon and hopefully glitches like this will be sorted out at that point.
The one I found today had been stalled for long enough for the result in progres to have already passed the deadline before I noticed the problem. It has now actually completed crunching and has been successfully validated. The particular result has this result ID.
The last result you quoted started with the 4.17 application, right? To me it looks as if it had run into a null-pointer problem, but may have stalled while invoking the runtime-debugger?? It was then able to recover from the previous checkpoint with the new app version.
I was running BOINC with E@H on some dual CPU servers (with 2 cores each, that is '4 CPUs' for Einstein).
Boinc is installed as a service of course, screensaver switched off.
On 2 of these servers it happened that the BOINC service itself occupied a whole CPU and the application didn't get any CPU time...
Perhaps this is the same situation you notice as 'stalled' on a single CPU?
I could 'resolve' this problem by switching back my BOINC client from 5.8.16 to 5.4.11
The last result you quoted started with the 4.17 application, right?
That's correct. 4.17 was the current version on June 18 when the result was first received.
Quote:
To me it looks as if it had run into a null-pointer problem, but may have stalled while invoking the runtime-debugger??
Not being literate in C (or whatever was used) I'm not familiar with null-pointers or runtime debuggers but yes, you can see in the stderr.txt output where an unhandled exception was detected and the Windows runtime debugger was loaded. The debugger announced its version number so I presume it was engaged successfully?? Notice the dump timestamp of June 26 at 00:26:46 local time. At that point the result had been processing around 90+ hours from memory and was about 95% completed.
You will notice that it was restarted today at 13:30 local time, more than a week after it had initially stalled. Maybe the stall was to do with the debugger not being able to proceed?? You will also notice that when it was restarted, 4.17 was initially being used but then I had the bright idea to speed up the final crunching by switching to 4.24. If you scan down the output you will see that this happened at 15:15 local time, after I had set up a hacked app_info.xml to override the 4.17 version that the result was "branded" with in the state file.
Quote:
It was then able to recover from the previous checkpoint with the new app version.
Amost correct :) just exchange "new" for "old" - see above :).
One final point - I've never opened a zip archive containing the actual result itself so I have no idea what it looks like. I did read with interest your comment about this (to Brian I think) elsewhere so no doubt one of these days I'll satisfy my curiosity :). I'm just hoping that there may be information there, rather than in stderr.txt that may be of use to Bernd.
Perhaps this is the same situation you notice as 'stalled' on a single CPU?
I don't think so because I recall on one occasion starting the windows task manager to see what was running. Both BOINC and the science app were listed but not really consuming any CPU. The idle process was at 98-99%
Quote:
I could 'resolve' this problem by switching back my BOINC client from 5.8.16 to 5.4.11
I've only ever needed to stop and restart BOINC, not change version.
In today's episode, the machine in question was headless, keyboardless and mouseless. I could see using BoincMgr on another machine that BOINC had stalled but the way I got direct access to the box was to plug in a USB mouse and a monitor. I have icons on the desktop that invoke small scripts to stop and start the BOINC service so it's quite easy to get things going and even install the app_info.xml with just the USB mouse :).
Not being literate in C (or whatever was used) I'm not familiar with null-pointers or runtime debuggers but yes, you can see in the stderr.txt output where an unhandled exception was detected and the Windows runtime debugger was loaded. The debugger announced its version number so I presume it was engaged successfully?? Notice the dump timestamp of June 26 at 00:26:46 local time.
With the new app, the runtinme debugger will produce tons of output, not just a few lines. Maybe the app really stalled while in the debugger.
Quote:
One final point - I've never opened a zip archive containing the actual result itself so I have no idea what it looks like. I did read with interest your comment about this (to Brian I think) elsewhere so no doubt one of these days I'll satisfy my curiosity :). I'm just hoping that there may be information there, rather than in stderr.txt that may be of use to Bernd.
The unzipped result-file is pretty "boring"...it's pure science. If I remember correctly, it's 10,000 rows in plain ASCII, each containing 5 floating point numbers separated by spaces, each row represents one "candidate" for a pulsar (ok, this is probably simplified).
1st column: has something to do with the spinning frequency of the pulsar
2nd & 3rd column : sky coordinates of the candidate, in a longitude/latitude style coordinate system (RA/DEC)
4th column : something about the change in the spinning frequency (most pulsars seem to change their rotation over time, depending on their age)
5th column: the so called "F-Statistic", a numerical value that measures how well the observed data from the detectors matches the hypothesis that there is a gravitational wave coming from a pulsar with the given spinning characteristics and given sky position.
The final paper on the S3 analysis linked on the E@H home page explains some of this stuff pretty well, even for a physics novice like me :-).
With the new app, the runtinme debugger will produce tons of output, not just a few lines. Maybe the app really stalled while in the debugger.
At the time the debugger was called it wasn't the new app - it was still 4.17 so was using the symbol information from the .pdb file stored locally I guess. So I guess there was supposed to be a whole lot more output in stderr.txt. So I would tend to agree that things were stalled in the debugger.
I guess that this result wont be much use for troubleshooting in that case. Also, since the OP aborted his stalled result (I think) there is no stderr.txt output in his case to see what happened there.
BTW, thanks for the info about the result file structure.
If your result is actually stalled, it would be a good idea to look into the last line(s) of the stderr.txt in the slot directory of that result, and e.g. post it here.
At the very beginning, "Reading SFTs and setting up stacks..." can take quite some time, depending on the speed of the machine. Another operation that can take prety lomg depending on the history of the run is resuming from a checkpoint. While these are in progress, no progress counters etc. are updated.
Another possibility is that the communication between App, BOINC Client and Manager is broken somewhere in between. In these cases it might be helpful to open the App's graphics, as the progress counter displayed there is independent of the communication with the Core Client.
My result was stalled for over a week before I noticed it.
By stopping and restarting the boinc service, crunching resumed immediately and completed without further incident. The result has been uploaded, reported and validated so any info in the slot directory is long gone.
Isn't the stderr.txt file that was in the slot directory now visible on the website if you follow the result ID link I posted?
This stalled result syndrome has happened a few times to me on different machines. If I get another one, I will leave it stalled and post the .txt while it is still in the stalled condition.
Isn't the stderr.txt file that was in the slot directory now visible on the website if you follow the result ID link I posted?
Yes, it is. Unfortunately it only reveals that indeed it got stuck starting the Windows Runtime debugger. Someone should point Rom Walton (BOINC) to this.
(actually there is more it tells us: whatever the reason was for the general access violation, it didn't persist a restart of the App, as the result completed successfully afterwards)
Would be interesting to know is this is the only reason for stalled results, or if there are others.
Quote:
This stalled result syndrome has happened a few times to me on different machines. If I get another one, I will leave it stalled and post the .txt while it is still in the stalled condition.
stalled computation
)
I've observed exactly this several times over the last few weeks and have commented on it towards the end of this message. I have also noticed it just today on another machine. The only difference in my case is that I seem to be able to "un-stall" it rather easily. I'm presuming that your stalled result was the one that you aborted recently?
When it happens, I simply stop boinc completely. I have it running as a service or as a daemon for unix so I simply stop the service. On restarting the service, the crunching restarts from the last saved checkpoint and there seems to be no further problem.
Bernd has said that he will be releasing a new suite of apps soon and hopefully glitches like this will be sorted out at that point.
The one I found today had been stalled for long enough for the result in progres to have already passed the deadline before I noticed the problem. It has now actually completed crunching and has been successfully validated. The particular result has this result ID.
Cheers,
Gary.
Hi Gary, The last result
)
Hi Gary,
The last result you quoted started with the 4.17 application, right? To me it looks as if it had run into a null-pointer problem, but may have stalled while invoking the runtime-debugger?? It was then able to recover from the previous checkpoint with the new app version.
CU
BRM
I was running BOINC with E@H
)
I was running BOINC with E@H on some dual CPU servers (with 2 cores each, that is '4 CPUs' for Einstein).
Boinc is installed as a service of course, screensaver switched off.
On 2 of these servers it happened that the BOINC service itself occupied a whole CPU and the application didn't get any CPU time...
Perhaps this is the same situation you notice as 'stalled' on a single CPU?
I could 'resolve' this problem by switching back my BOINC client from 5.8.16 to 5.4.11
EDIT: Servers are Win2003 Ent.Edition SP1
Udo
RE: The last result you
)
That's correct. 4.17 was the current version on June 18 when the result was first received.
Not being literate in C (or whatever was used) I'm not familiar with null-pointers or runtime debuggers but yes, you can see in the stderr.txt output where an unhandled exception was detected and the Windows runtime debugger was loaded. The debugger announced its version number so I presume it was engaged successfully?? Notice the dump timestamp of June 26 at 00:26:46 local time. At that point the result had been processing around 90+ hours from memory and was about 95% completed.
You will notice that it was restarted today at 13:30 local time, more than a week after it had initially stalled. Maybe the stall was to do with the debugger not being able to proceed?? You will also notice that when it was restarted, 4.17 was initially being used but then I had the bright idea to speed up the final crunching by switching to 4.24. If you scan down the output you will see that this happened at 15:15 local time, after I had set up a hacked app_info.xml to override the 4.17 version that the result was "branded" with in the state file.
Amost correct :) just exchange "new" for "old" - see above :).
One final point - I've never opened a zip archive containing the actual result itself so I have no idea what it looks like. I did read with interest your comment about this (to Brian I think) elsewhere so no doubt one of these days I'll satisfy my curiosity :). I'm just hoping that there may be information there, rather than in stderr.txt that may be of use to Bernd.
Cheers,
Gary.
RE: Perhaps this is the
)
I don't think so because I recall on one occasion starting the windows task manager to see what was running. Both BOINC and the science app were listed but not really consuming any CPU. The idle process was at 98-99%
I've only ever needed to stop and restart BOINC, not change version.
In today's episode, the machine in question was headless, keyboardless and mouseless. I could see using BoincMgr on another machine that BOINC had stalled but the way I got direct access to the box was to plug in a USB mouse and a monitor. I have icons on the desktop that invoke small scripts to stop and start the BOINC service so it's quite easy to get things going and even install the app_info.xml with just the USB mouse :).
Cheers,
Gary.
RE: RE: while invoking
)
With the new app, the runtinme debugger will produce tons of output, not just a few lines. Maybe the app really stalled while in the debugger.
The unzipped result-file is pretty "boring"...it's pure science. If I remember correctly, it's 10,000 rows in plain ASCII, each containing 5 floating point numbers separated by spaces, each row represents one "candidate" for a pulsar (ok, this is probably simplified).
1st column: has something to do with the spinning frequency of the pulsar
2nd & 3rd column : sky coordinates of the candidate, in a longitude/latitude style coordinate system (RA/DEC)
4th column : something about the change in the spinning frequency (most pulsars seem to change their rotation over time, depending on their age)
5th column: the so called "F-Statistic", a numerical value that measures how well the observed data from the detectors matches the hypothesis that there is a gravitational wave coming from a pulsar with the given spinning characteristics and given sky position.
The final paper on the S3 analysis linked on the E@H home page explains some of this stuff pretty well, even for a physics novice like me :-).
CU
BRM
RE: With the new app, the
)
At the time the debugger was called it wasn't the new app - it was still 4.17 so was using the symbol information from the .pdb file stored locally I guess. So I guess there was supposed to be a whole lot more output in stderr.txt. So I would tend to agree that things were stalled in the debugger.
I guess that this result wont be much use for troubleshooting in that case. Also, since the OP aborted his stalled result (I think) there is no stderr.txt output in his case to see what happened there.
BTW, thanks for the info about the result file structure.
Cheers,
Gary.
If your result is actually
)
If your result is actually stalled, it would be a good idea to look into the last line(s) of the stderr.txt in the slot directory of that result, and e.g. post it here.
At the very beginning, "Reading SFTs and setting up stacks..." can take quite some time, depending on the speed of the machine. Another operation that can take prety lomg depending on the history of the run is resuming from a checkpoint. While these are in progress, no progress counters etc. are updated.
Another possibility is that the communication between App, BOINC Client and Manager is broken somewhere in between. In these cases it might be helpful to open the App's graphics, as the progress counter displayed there is independent of the communication with the Core Client.
BM
BM
My result was stalled for
)
My result was stalled for over a week before I noticed it.
By stopping and restarting the boinc service, crunching resumed immediately and completed without further incident. The result has been uploaded, reported and validated so any info in the slot directory is long gone.
Isn't the stderr.txt file that was in the slot directory now visible on the website if you follow the result ID link I posted?
This stalled result syndrome has happened a few times to me on different machines. If I get another one, I will leave it stalled and post the .txt while it is still in the stalled condition.
Cheers,
Gary.
RE: Isn't the stderr.txt
)
Yes, it is. Unfortunately it only reveals that indeed it got stuck starting the Windows Runtime debugger. Someone should point Rom Walton (BOINC) to this.
(actually there is more it tells us: whatever the reason was for the general access violation, it didn't persist a restart of the App, as the result completed successfully afterwards)
Would be interesting to know is this is the only reason for stalled results, or if there are others.
Yep, that's what I meant.
BM
BM