Recently, one of my hosts (running Einstein S5R4 6.02 Linux SSE) has been suffering from repeated errors, exiting several hours into execution with a signal 4. My understanding is that signal 4 on Linux is SIGILL, or illegal instruction. In case the results for this host have been wiped from the server, below is the stack trace from stderr.txt, which has the same pattern for all (at least three so far) work-units which have ended in signal 4:
APP DEBUG: Application caught signal 4.
Obtained 6 stack frames for this thread.
Use gdb command: 'info line *0xADDRESS' to print corresponding line numbers.
../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_1[0x8069513]
../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_1[0x80631fc]
../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_1[0x805d789]
../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_1[0x806ad57]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe0)[0xb7d7c450]
../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_1(shmat+0x55)[0x804b5c1]
Stack trace of LAL functions in worker thread:
LocalComputeFStatFreqBand at line 211 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R4_5.08_1/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 313 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R4_5.08_1/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
(null) at line 0 of file (null)
At lowest level status code = 0, description: NO LAL ERROR REGISTERED
called boinc_finish
The Einstein S5R4 6.02 SSE executable was downloaded automatically by BOINC and has the MD5 checksum: 6b0e4078b69908ca2d820e94aa57b6c8
I am willing to accept that the age of the hardware might be a factor, however, I have several older machines which have been used much more heavily without fault. This particular host, too, does not appear to suffer from any other problems besides failing Einstein work-units, at present. I could not find anyone with a similar problem after a cursory examination of the forums, so I may be the only one with this problem. Does anyone have any idea as to why Einstein keeps failing on this particular host? It is not overclocked at all.
Soli Deo Gloria
Copyright © 2024 Einstein@Home. All rights reserved.
Repeated Signal 4 Errors - Please Help
)
In my experience it is most likely to be a hardware problem of some sort.
1. Have you checked the motherboard for any sign of swollen capacitors?
2. Have you tried swapping the RAM with another machine?
3. Have you tried swapping the PSU?
4. Since you have another AMD sempron box, I'd be tempted to swap the hard disk to the other box temporarily and boot it up there. If it starts producing good results you have the possibility of a flakey motherboard in the original box.
Good luck with it.
Cheers,
Gary.
RE: RE: Does anyone have
)
I had it checked whether we get any other "signal 4" errors on i368 Linux at the moment: negative. Your Sempron box seems to be the only one generating this particular error code atm, which is another reason to follow Gary's advise because it's most likely a hardware problem.
CU
Bikeman
Okay, thanks for the
)
Okay, thanks for the information.
I forgot to mention that the host has no problems with SETI workunits (of which it spends most of its time on), but I suppose there may be specific instructions Einstein uses which may trigger something in the hardware.
Too, the CPU may read as "Sempron", but it's actually an Athlon XP which has been rebranded. In fact, both of my Sempron boxes are rebadged Athlon XPs (Barton cores)...
As far as I can tell, there are no swollen capacitors and the PSU is a non-generic AcBel (I have an identical one serving in a very similar host (CPU, motherboard, etc) without any problems). memtest86+ doesn't detect any faults, although I realise that's no guarantee of working memory anyway.
The system is old enough that if the hardware really is at fault, I won't bother replacing it - the system is just acting as a local web server and database at the moment anyway. But if Einstein continues to fail while SETI keeps on going, I may just stop obtaining Einstein work-units for it, which will be a shame.
Soli Deo Gloria
You could place a file
)
You could place a file "CPU_TYPE_0" in your BOINC directory. This will tell the switcher to use the generic App.
BM
BM
Alas, even the generic
)
Alas, even the generic application failed. It took a while before it did, though - must be something funny with this particular host. I am still puzzled as to why everything else appears to work fine, but that's the way things go, I suppose.
process exited with code 41 (0x29, -215)
Overridden CPU type 0
2008-11-20 03:25:01.7333 [normal]: This program is published under the GNU General Public License, version 2
2008-11-20 03:25:01.7334 [normal]: For details see http://einstein.phys.uwm.edu/license.php
2008-11-20 03:25:01.7334 [normal]: This Einstein@home App was built at: Jul 27 2008 15:01:05
2008-11-20 03:25:01.7334 [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_0'.
Unrecognized XML in parse_init_data_file: computation_deadline
Skipping: 1228616110.000000
Skipping: /computation_deadline
2008-11-20 03:25:01.7900 [debug]: Set up communication with graphics process.
2008-11-20 03:25:01.9344 [debug]: Reading SFTs and setting up stacks ... done
2008-11-20 03:25:15.0319 [normal]: INFO: Couldn't open checkpoint h1_0685.70_S5R4__277_S5R4a_2_0.cpt
2008-11-20 03:25:15.0320 [debug]: Total skypoints = 840. Progress: 0,
$Revision: 1.129 $ REV:$Revision, OPT:4, SCVAR:9, SCTRIM:2, HOTVAR:0, HOTDIV:0, HGHPRE:0, HGHBAT:2
c
1, c
2, c
3, c
190, c
191,
APP DEBUG: Application caught signal 11.
Obtained 9 stack frames for this thread.
Use gdb command: 'info line *0xADDRESS' to print corresponding line numbers.
../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_0[0x80695be]
/lib/tls/i686/cmov/libc.so.6(__libc_malloc+0x23)[0xb7e8e7f3]
../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_0[0x809ddeb]
../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_0[0x80c959b]
../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_0[0x80629a5]
../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_0[0x805d88c]
../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_0[0x806ae17]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe5)[0xb7e31685]
../../projects/einstein.phys.uwm.edu/einstein_S5R4_6.02_i686-pc-linux-gnu_0(shmat+0x45)[0x804b5b1]
Stack trace of LAL functions in worker thread:
LocalComputeFStatFreqBand at line 209 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R4_5.08_0/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 311 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R4_5.08_0/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LALGetMultiSSBtimes at line 1830 of file ComputeFstat.c
LALGetSSBtimes at line 1808 of file ComputeFstat.c
At lowest level status code = 0, description: NO LAL ERROR REGISTERED
called boinc_finish
]]>
Soli Deo Gloria