On my machines I am getting SIGABRT for every case where a machine is using core_client_5.4.9 and a WU '...S5R2c_1' but not other cases. I have three machines runing core_client_5.8.15, 2 are AMD one is Intel, and so far these machines are getting to success with these WUs. All of my machines running 5.4.9 are getting the SIGABRT errors described.
Here's one more (not mine), he had no problems with S5RIa but every S5R2c gives him
5.4.9
Couldn't start or resume: -108
This one gets a totally different error with S5R2c :
5.2.13
process exited with code 2 (0x2)
2007-04-20 20:28:51 [Einstein@Home] execv(../../projects/einstein.phys.uwm.edu/einstein_S5R2_4.14_i686-pc-linux-gnu) failed: error -1
execv: No such file or directory
Those totally different errors look much like a pointer or array size problem. Too many open files could cause something like that too
It's status shows 2 successful results (a quorum) pending validation, plus a 3rd with client error (hence posting in this thread), plus a 4th 'unsent'. The WU appears in my Results list as 'pending' but doesn't show up in my Pending Credit list at all (hence the limbo).
No change in validation status for over 24 hours. No indication what's keeping it from validating with a quorum, though I notice that Server Status doesn't yet show the S5R2 validators/assimilators either up OR down.
It's not about the credit; I just hate to see 30-plus hours of CPU time go down the toilet, since I have other project mouths to feed. With the larger workunits, time invested in a single result that won't validate becomes a concern.
No indication what's keeping it from validating with a quorum, though I notice that Server Status doesn't yet show the S5R2 validators/assimilators either up OR down.
The validator has looked, and does not like what it sees enough to accept the result. If you look at the end of the result detail, this shows as:
Checked, but no consensus yet
So another result is already prepared to send out to another host. When that goes out and comes back, if it is sufficiently similar to one of the two already returned, it will decide the issue of who is right.
I'll speculate that one of the the hosts suffered computational error, but not of the sort which generates an illegal memory access or anything else caught but run-time checking in the application.
So far every one has errored out on one of my machines. It says client error, and when I look into the work units it has 'invalid function' logged in it.
I have about 35 hosts, maybe more
almost none of them haven't received credit last week
NONE of them are overclocked, almost all are linux servers at different companies
for example, my dual Xeon at home CANNOT be overclocked (HPxw6000)
(BIOSes for dual procs don't have such options, servers are made for stability)
but I didn't received credit since 22 apr
more than 800 credit lost, and that's only on 1 host!
ALL other projects (seti, climate, predictor) are OK on ALL hosts
for me the decision is simple:
int ResourceShare=1;
int veryFewClientErrors=0;
int bugMessages=1;
int date=1;
/* hmm, not int, but let's hope it'll not take more than 100 years 365*100>32767
*/
checkNextWeek();
main() {
for (date < theEndOfTheUniverse) {
checkNextWeek();
if (veryFewClientErrors && !bugMessages ) {
increaseBack(ResourceShare);
exit;
} // end if
} // end for
} //end main
checkNextWeek() {
if ( "very few errors"==1) veryFewClientErrors=1;
if ("few bug messages") bugMessages=0;
date+=7;
} //end check
// EOF
It seems that every WU that was interupted/resumed gets compute error with signal 11/SIGABRT on Linux machine.
Example http://einsteinathome.org/task/83757575 and this host have more such.
It's pitty, as I have to restart them quite often...
On my machines I am getting
)
On my machines I am getting SIGABRT for every case where a machine is using core_client_5.4.9 and a WU '...S5R2c_1' but not other cases. I have three machines runing core_client_5.8.15, 2 are AMD one is Intel, and so far these machines are getting to success with these WUs. All of my machines running 5.4.9 are getting the SIGABRT errors described.
Jim Howe
Zhuhai, China and Portland, Oregon
Here's one more (not mine),
)
Here's one more (not mine), he had no problems with S5RIa but every S5R2c gives him
Couldn't start or resume: -108
This one gets a totally different error with S5R2c :
2007-04-20 20:28:51 [Einstein@Home] execv(../../projects/einstein.phys.uwm.edu/einstein_S5R2_4.14_i686-pc-linux-gnu) failed: error -1
execv: No such file or directory
Those totally different errors look much like a pointer or array size problem. Too many open files could cause something like that too
RE: All my WU - client
)
Since switching over Iv'e lost 730000 CPU secs (280 credits) and have yet to see one to successful conclusion :o((
The Barsteward
Beer is proof that God loves us and wants us to be happy.
--Benjamin Franklin
RE: RE: All my WU -
)
Another 60000 secs lost for no apparent reason, so will, not transfer wu until it seems to be resolved.
The Barsteward
Beer is proof that God loves us and wants us to be happy.
--Benjamin Franklin
Got a new one @ CPU time =
)
Got a new one @ CPU time = 22089 :
- Unhandled Exception Record -
Reason: Privileged Instruction (0xc0000096) at address 0x0044FC48
http://einsteinathome.org/workunit/33428928
The other result in that WU is nice too :
5.2.13
Maximum disk usage exceeded
Nice to see that Bruce Allen still uses 5.2.13 too :-)
Signal 11 on Bruces box as well btw. : resultid=83588082
One of my S5R2 workunits:
)
One of my S5R2 workunits: http://einsteinathome.org/workunit/33355866 looks to be stuck in database limbo.
It's status shows 2 successful results (a quorum) pending validation, plus a 3rd with client error (hence posting in this thread), plus a 4th 'unsent'. The WU appears in my Results list as 'pending' but doesn't show up in my Pending Credit list at all (hence the limbo).
No change in validation status for over 24 hours. No indication what's keeping it from validating with a quorum, though I notice that Server Status doesn't yet show the S5R2 validators/assimilators either up OR down.
It's not about the credit; I just hate to see 30-plus hours of CPU time go down the toilet, since I have other project mouths to feed. With the larger workunits, time invested in a single result that won't validate becomes a concern.
B/W
Best wishes :)
RE: One of my S5R2
)
The validator has looked, and does not like what it sees enough to accept the result. If you look at the end of the result detail, this shows as:
Checked, but no consensus yet
So another result is already prepared to send out to another host. When that goes out and comes back, if it is sufficiently similar to one of the two already returned, it will decide the issue of who is right.
I'll speculate that one of the the hosts suffered computational error, but not of the sort which generates an illegal memory access or anything else caught but run-time checking in the application.
So far every one has errored
)
So far every one has errored out on one of my machines. It says client error, and when I look into the work units it has 'invalid function' logged in it.
Machine with errors
You like Myst? Uru Live returns! www.urulive.com
I have about 35 hosts, maybe
)
I have about 35 hosts, maybe more
almost none of them haven't received credit last week
NONE of them are overclocked, almost all are linux servers at different companies
for example, my dual Xeon at home CANNOT be overclocked (HPxw6000)
(BIOSes for dual procs don't have such options, servers are made for stability)
but I didn't received credit since 22 apr
results
more than 800 credit lost, and that's only on 1 host!
ALL other projects (seti, climate, predictor) are OK on ALL hosts
for me the decision is simple:
main() {
for (date < theEndOfTheUniverse) {
checkNextWeek();
if (veryFewClientErrors && !bugMessages ) {
increaseBack(ResourceShare);
exit;
} // end if
} // end for
} //end main
checkNextWeek() {
if ( "very few errors"==1) veryFewClientErrors=1;
if ("few bug messages") bugMessages=0;
date+=7;
} //end check
// EOF
have a nice (and successful to find bugs) week
It seems that every WU that
)
It seems that every WU that was interupted/resumed gets compute error with signal 11/SIGABRT on Linux machine.
Example http://einsteinathome.org/task/83757575 and this host have more such.
It's pitty, as I have to restart them quite often...