For a reference, my five hosts among them only have one "error while computing" displayed currently, and all but one of them run Einstein 24/7 at greater than 90%. I'd suggest you review the usual suspects: overclocking, overheating, software conflicts, marginal RAM, not quite enough CPU voltage even though not overclocked...
These are, however, 32Bit Apss, there is not a 64Bit Linux App for S6LV1 yet. The BOINC Client used to check whether the compatibility libs are present on a system before offering to run 32Bit Apps to the server. From the errors we got so far I suspect that this check isn't working properly on (some) 7.0.x Clients.
Are you running on a 64Bit system and have you the 32Bit compatibility libs installed?
As for the original question: You seem get "signal 8" (Floating-Point Exception, FPE) sporadically for both the old (S6Bucket) and the new (S6LV1) Apps, so I'd conclude this is more a coincidence between the occurrence of this error and the release of S6LV1.
We get a couple of these FPE from systems with fairly recent kernels. We are currently investigating one such issue that occurs in the BOINC API on recent Debian testing systems. It is still not clear what exactly happens there and how to avoid it. Does the occurrence of these errors anyhow relates to a kernel update / change?
As for the original question: You seem get "signal 8" (Floating-Point Exception, FPE) sporadically for both the old (S6Bucket) and the new (S6LV1) Apps, so I'd conclude this is more a coincidence between the occurrence of this error and the release of S6LV1.
We get a couple of these FPE from systems with fairly recent kernels. We are currently investigating one such issue that occurs in the BOINC API on recent Debian testing systems. It is still not clear what exactly happens there and how to avoid it. Does the occurrence of these errors anyhow relates to a kernel update / change?
BM
Hi Bernd!
The machines with the highest error rates have only been in operation for a few months, and are both running 64-bit Fedora 16 with 32-bit libraries installed. The one with the highest error rate, the one I've already referenced, is running Linux kernel 3.2. The one with the second-highest error rate hasn't been updated in a while, and is still running Linux kernel 3.1. So, the problem doesn't seem to have anything to do with having the most recent kernel updates.
Here's a 64-bit Debian 6 machine which doesn't have quite as bad of an error rate. It's running Debian stable, with Linux kernel 2.6.32.
I also have two 64-bit OpenSuSE 12.1 machines with Linux kernel 3.1, but the error rates for the bucket workunits are quite low. (I haven't received any Line Veto workunits on them, so I can't yet say about them.) Also, the error rates for all of my 64-bit Scientific Linux and CentOS machines are quite low.
So, that makes me wonder, is there something strange with the way Debian and Fedora are compiling their kernels?
Edit--By the time you see this, the second Fedora machine will have been updated to Linux kernel 3.2. (Update is currently in progress.)
These are, however, 32Bit Apss, there is not a 64Bit Linux App for S6LV1 yet. The BOINC Client used to check whether the compatibility libs are present on a system before offering to run 32Bit Apps to the server. From the errors we got so far I suspect that this check isn't working properly on (some) 7.0.x Clients.
Are you running on a 64Bit system and have you the 32Bit compatibility libs installed?
BM
Hi Bernd,
Thanks for your response.
The machine is running all Einstein applications for several month now and I don't think I have ever had any errors except when I trashed then WUs by doing something stupid here.
However, I may have fond the problem. When checking the app I noticed that the executable flag was not set. That might explain the execv failure. I have o idea why this happened. So far BOINC has always downloaded the application and has made it executable w/o any intervention by me. Maybe a problem with the 7.0.18 client ?
I am now waiting for new work to see if it works.
@Donald: Just a thought. There is an old kernel bug that crashes the application from time to time. From Bernds description it sounds like it could be this problem. To avoid this bug make sure that you are using a NON-preemtive kernel.
@Donald: Just a thought. There is an old kernel bug that crashes the application from time to time. From Bernds description it sounds like it could be this problem. To avoid this bug make sure that you are using a NON-preemtive kernel.
regards,
mickydl*
Hi Micky!
Yeah, you're right, and that was my problem a couple of years ago. However, that doesn't seem to be the problem now, since I have the stock pre-emptive kernel running on all my machines, but only a few are giving me problems.
But then, who knows? When I get time, I might compile my own kernel for one of the problem-children, just to see what happens.
I've just rebooted on my new home-brew kernel. I'll watch it over the next few days to see if things improve. If so, I'll compile a kernel for the other machines, as well.
I've just rebooted on my new home-brew kernel. I'll watch it over the next few days to see if things improve. If so, I'll compile a kernel for the other machines, as well.
So far, so good with the Bucket work-units. But, I'm still getting validate errors with the Gamma-Ray ones, so there's obviously another problem with them.
High Failure rate on Line Veto
)
That same host seems to have rather more than common recent rate of error on Gravitational Wave S6 GC search v1.01
recent error tasks for that host
For a reference, my five hosts among them only have one "error while computing" displayed currently, and all but one of them run Einstein 24/7 at greater than 90%. I'd suggest you review the usual suspects: overclocking, overheating, software conflicts, marginal RAM, not quite enough CPU voltage even though not overclocked...
So far all of the Line Veto
)
So far all of the Line Veto WUs have failed on my Linux machine. The stderr.txt looks as follows:
process exited with code 22 (0x16, -234)
execv: No such file or directory
The Machine in question is: Nostromo
All other Einstein apps work flawlessly.
mickydl*
The "execv: No such file or
)
The "execv: No such file or directory" message usually points to a missing shared library required by the application.
However the libraries required are identical between the S6Bucket and the S6LV1 (Linux) App:
These are, however, 32Bit Apss, there is not a 64Bit Linux App for S6LV1 yet. The BOINC Client used to check whether the compatibility libs are present on a system before offering to run 32Bit Apps to the server. From the errors we got so far I suspect that this check isn't working properly on (some) 7.0.x Clients.
Are you running on a 64Bit system and have you the 32Bit compatibility libs installed?
BM
BM
As for the original question:
)
As for the original question: You seem get "signal 8" (Floating-Point Exception, FPE) sporadically for both the old (S6Bucket) and the new (S6LV1) Apps, so I'd conclude this is more a coincidence between the occurrence of this error and the release of S6LV1.
We get a couple of these FPE from systems with fairly recent kernels. We are currently investigating one such issue that occurs in the BOINC API on recent Debian testing systems. It is still not clear what exactly happens there and how to avoid it. Does the occurrence of these errors anyhow relates to a kernel update / change?
BM
BM
RE: As for the original
)
Hi Bernd!
The machines with the highest error rates have only been in operation for a few months, and are both running 64-bit Fedora 16 with 32-bit libraries installed. The one with the highest error rate, the one I've already referenced, is running Linux kernel 3.2. The one with the second-highest error rate hasn't been updated in a while, and is still running Linux kernel 3.1. So, the problem doesn't seem to have anything to do with having the most recent kernel updates.
Here's a 64-bit Debian 6 machine which doesn't have quite as bad of an error rate. It's running Debian stable, with Linux kernel 2.6.32.
I also have two 64-bit OpenSuSE 12.1 machines with Linux kernel 3.1, but the error rates for the bucket workunits are quite low. (I haven't received any Line Veto workunits on them, so I can't yet say about them.) Also, the error rates for all of my 64-bit Scientific Linux and CentOS machines are quite low.
So, that makes me wonder, is there something strange with the way Debian and Fedora are compiling their kernels?
Edit--By the time you see this, the second Fedora machine will have been updated to Linux kernel 3.2. (Update is currently in progress.)
RE: The "execv: No such
)
Hi Bernd,
Thanks for your response.
The machine is running all Einstein applications for several month now and I don't think I have ever had any errors except when I trashed then WUs by doing something stupid here.
However, I may have fond the problem. When checking the app I noticed that the executable flag was not set. That might explain the execv failure. I have o idea why this happened. So far BOINC has always downloaded the application and has made it executable w/o any intervention by me. Maybe a problem with the 7.0.18 client ?
I am now waiting for new work to see if it works.
@Donald: Just a thought. There is an old kernel bug that crashes the application from time to time. From Bernds description it sounds like it could be this problem. To avoid this bug make sure that you are using a NON-preemtive kernel.
regards,
mickydl*
RE: @Donald: Just a
)
Hi Micky!
Yeah, you're right, and that was my problem a couple of years ago. However, that doesn't seem to be the problem now, since I have the stock pre-emptive kernel running on all my machines, but only a few are giving me problems.
But then, who knows? When I get time, I might compile my own kernel for one of the problem-children, just to see what happens.
Okay, non-premptive kernel
)
Okay, non-premptive kernel compilation is in progress on one Fedora 16 machine. We'll see if that fixes the problem.
(You might know, Fedora doesn't offer any pre-built non-premptive kernels in its repository.)
I've just rebooted on my new
)
I've just rebooted on my new home-brew kernel. I'll watch it over the next few days to see if things improve. If so, I'll compile a kernel for the other machines, as well.
RE: I've just rebooted on
)
So far, so good with the Bucket work-units. But, I'm still getting validate errors with the Gamma-Ray ones, so there's obviously another problem with them.