Got two(161484922, 161484846) signal 11 results on my i7 920 root server with the quad ABP2 WUs yesterday. Before and afterwords everything works like a charm. Both WUs crashed at the same time, no hints in the system log files.
Interesting. Anything running on this machine that could eati up memory at that time?
Quote:
Interesting thing is I still get APP2 WUs stamped to be done by app 1.08, while the actual app should be 1.11.
The CUDA App version is at 1.11; the CPU App is 1.08. That's ok.
BM
The server has 8GB RAM and low load, just some Apache instances, mail server and PosgreSQL running.
I see this 'segfault' errors occasionally happen on some machines, usually all app instances running there get this signal at the very same time, and without any relation to the application source code line they are in or the data they are processing, so this isn't a real programming error.
I suspected the Linux 'optimistic memory allocation' to be responsible for that, that randomly kills processes if the physical memory isn't enough for the memory it 'optimistically' assigned to processes, but it's hard to believe that this is the case here.
We currently loose us up to ~2000h of computing time per day due to this problem.
Got two(161484922, 161484846) signal 11 results on my i7 920 root server with the quad ABP2 WUs yesterday. Before and afterwords everything works like a charm. Both WUs crashed at the same time, no hints in the system log files.
The server has 8GB RAM and low load, just some Apache instances, mail server and PosgtreSQL running.
I see this 'segfault' errors occasionally happen on some machines, usually all app instances running there get this signal at the very same time, and without any relation to the application source code line they are in or the data they are processing, so this isn't a real programming error.
Same thing here. Server is running kernel...think you know.. ;)
Quote:
I suspected the Linux 'optimistic memory allocation' to be responsible for that, that randomly kills processes if the physical memory isn't enough for the memory it 'optimistically' assigned to processes, but it's hard to believe that this is the case here.
Hm, anything related to 64bit os and 32bit compatibility libs maybe?
Quote:
We currently loose us up to ~2000h of computing time per day due to this problem.
Oh this is ugly. Any information about the distributions/kernels involved? Didn't see this problem on my other hosts so far. The server runs OpenSuse 11.1(64bit), my laptop runs OpenSuse 11.2(32bit), old Athlon XP 3000 runs OpenSuse 10.3 like my development host(64bit). Former root server run OpenSuse 10.3(64bit/8GB) without segfaults. And there is still a little chance for cpu errors or memory failures. Hear about memory problems more and more - maybe a consequence of low profit for the manufacturers and higher integration.
But: Why don't other apps(exception for FF ;)) crash from time to time if this is a Linux problem? I really cant remember a fatal crash on one of my systems in the last years.
And last not least, could the problem be circumvented by a program restart after killed by OOM? Would require the BOINC client to be changed or a wrapper program calling/controlling the science apps(overhead?). But I'm no C/C++ coder, so I might be far off road. ;)
cu,
Michael
[Edit]'killed by OOM' should read as 'ended by out-of-memory killer'.
[Edit2]Last signal 11 on my X2 5000 with E@H:
2008-01-18 18:06:30 [Einstein@Home] Reason: Unrecoverable error for result h1_0762.95_S5R2__255_S5R3a_0 (process got signal 11)
Logfile started 2006 :)
Athlon XP 3000+ running 24/365:
Never ever any signal 11 since 14-May-2008(logging started)
Hi Gary I have been working with windows 7 taskman and it seems that when I am not using my computer I punch up the running programs by changing the cpus and raising the usage...I also bring the boinc to the front..
Since I switched from 9800GTX+ and 8500GT to GTX470 and now 480, no problems with
CUDA, anymore.
I don't know if it's accepted, but according to the cards GPU & Memory-Load, it's
possible to run 2 at a time.?
(I run 3 SETI MB at a time, which gives a good Load on GPU, 99% and 60% for Memory controller, on it's 384BIT's bus)
On my SuSE Linux 11.1 32-bit pae I can see the ABP2 graphics when I want, but not the S5GC1 graphics. Although I use it rarely because it takes a lot of CPU, I am wondering why.
Tullio
I don't know, if this is the right thread, but I can't find a better one. Can anybody tell me about the update cycle of the webpages. I see a number of tasks, wich are finished and uploaded for almost 24 hours still as "in progress". What's the reason for this delay?
I don't know, if this is the right thread, but I can't find a better one. Can anybody tell me about the update cycle of the webpages. I see a number of tasks, wich are finished and uploaded for almost 24 hours still as "in progress". What's the reason for this delay?
Well they are still in progress. But you'd be waiting on your 'wingman'. All work is duplicated ( at least ) to two different hosts, of which you are one in this case. When the other host returns work, then validation occurs, credit is awarded etc and all being well the matter is settled. How long to wait? Well that depends on the activity of the other host and/or other circumstances like missing of deadlines, possible re-issue to complete the quorum ( 2 validated results ) and the like ......
Cheers, Mike.
( edit ) One is always welcome to fire up a new thread if you judge there is no current suitable one ... :-)
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Hello all!! Before I go out and get a loan for the new GTX 580, will it be usable to crunch? Is there an compatibility issue? Oh I will get one reguardless, unless there is a better card?
If and/or when it can crunch,I will post the results and/or report any bugs.
DO WHAT THOW WILL SHALL BE THE WHOLE OF THE LAW.
PROUD MEMBER OF THE CARL SAGAN TEAM.
DO WHAT THO WILL SHALL BE THE WHOLE OF THE LAW.
PROUD MEMBER OF THE CARL SAGAN TEAM.
RE: RE: RE: Got
)
I see this 'segfault' errors occasionally happen on some machines, usually all app instances running there get this signal at the very same time, and without any relation to the application source code line they are in or the data they are processing, so this isn't a real programming error.
I suspected the Linux 'optimistic memory allocation' to be responsible for that, that randomly kills processes if the physical memory isn't enough for the memory it 'optimistically' assigned to processes, but it's hard to believe that this is the case here.
We currently loose us up to ~2000h of computing time per day due to this problem.
BM
BM
RE: RE: RE: RE: Got
)
Same thing here. Server is running kernel...think you know.. ;)
Hm, anything related to 64bit os and 32bit compatibility libs maybe?
Oh this is ugly. Any information about the distributions/kernels involved? Didn't see this problem on my other hosts so far. The server runs OpenSuse 11.1(64bit), my laptop runs OpenSuse 11.2(32bit), old Athlon XP 3000 runs OpenSuse 10.3 like my development host(64bit). Former root server run OpenSuse 10.3(64bit/8GB) without segfaults. And there is still a little chance for cpu errors or memory failures. Hear about memory problems more and more - maybe a consequence of low profit for the manufacturers and higher integration.
But: Why don't other apps(exception for FF ;)) crash from time to time if this is a Linux problem? I really cant remember a fatal crash on one of my systems in the last years.
And last not least, could the problem be circumvented by a program restart after killed by OOM? Would require the BOINC client to be changed or a wrapper program calling/controlling the science apps(overhead?). But I'm no C/C++ coder, so I might be far off road. ;)
cu,
Michael
[Edit]'killed by OOM' should read as 'ended by out-of-memory killer'.
[Edit2]Last signal 11 on my X2 5000 with E@H:
2008-01-18 18:06:30 [Einstein@Home] Reason: Unrecoverable error for result h1_0762.95_S5R2__255_S5R3a_0 (process got signal 11)
Logfile started 2006 :)
Athlon XP 3000+ running 24/365:
Never ever any signal 11 since 14-May-2008(logging started)
Hi Gary I have been working
)
Hi Gary I have been working with windows 7 taskman and it seems that when I am not using my computer I punch up the running programs by changing the cpus and raising the usage...I also bring the boinc to the front..
Since I switched from
)
Since I switched from 9800GTX+ and 8500GT to GTX470 and now 480, no problems with
CUDA, anymore.
I don't know if it's accepted, but according to the cards GPU & Memory-Load, it's
possible to run 2 at a time.?
(I run 3 SETI MB at a time, which gives a good Load on GPU, 99% and 60% for Memory controller, on it's 384BIT's bus)
RE: I don't know if it's
)
You are in the wrong thread (CPU vs. GPU), but I think it's accepted anyway. ;-)
You'll just have to create an app_info.xml with the correct entries. There is at least one other thread with infos about that.
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
On my SuSE Linux 11.1 32-bit
)
On my SuSE Linux 11.1 32-bit pae I can see the ABP2 graphics when I want, but not the S5GC1 graphics. Although I use it rarely because it takes a lot of CPU, I am wondering why.
Tullio
I don't know, if this is the
)
I don't know, if this is the right thread, but I can't find a better one. Can anybody tell me about the update cycle of the webpages. I see a number of tasks, wich are finished and uploaded for almost 24 hours still as "in progress". What's the reason for this delay?
RE: I don't know, if this
)
Well they are still in progress. But you'd be waiting on your 'wingman'. All work is duplicated ( at least ) to two different hosts, of which you are one in this case. When the other host returns work, then validation occurs, credit is awarded etc and all being well the matter is settled. How long to wait? Well that depends on the activity of the other host and/or other circumstances like missing of deadlines, possible re-issue to complete the quorum ( 2 validated results ) and the like ......
Cheers, Mike.
( edit ) One is always welcome to fire up a new thread if you judge there is no current suitable one ... :-)
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: I see a number of
)
There's no delay. Tasks are considered "in progress" until they are reported, which is a process separate from uploading.
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
Hello all!! Before I go out
)
Hello all!! Before I go out and get a loan for the new GTX 580, will it be usable to crunch? Is there an compatibility issue? Oh I will get one reguardless, unless there is a better card?
If and/or when it can crunch,I will post the results and/or report any bugs.
DO WHAT THOW WILL SHALL BE THE WHOLE OF THE LAW.
PROUD MEMBER OF THE CARL SAGAN TEAM.
DO WHAT THO WILL SHALL BE THE WHOLE OF THE LAW.
PROUD MEMBER OF THE CARL SAGAN TEAM.