Lots of FPU errors

next_ghost
next_ghost
Joined: 25 Mar 05
Posts: 12
Credit: 246383
RAC: 0
Topic 193462

I keep getting tons of compute errors since around December 5 2007 (guess from BoincStats credit graph). Since then, all work units but one ended up like this:

5.10.28

process exited with code 38 (0x26, -218)

2008-01-23 15:41:35.6203 [normal]: Built at: Nov 29 2007 15:00:43

2008-01-23 15:41:35.6205 [normal]: Start of BOINC application 'einstein_S5R3_4.20_i686-pc-linux-gnu'.
2008-01-23 15:41:36.0558 [debug]: Reading SFTs and setting up stacks ... done
2008-01-23 15:45:31.6187 [normal]: INFO: Couldn't open checkpoint h1_0749.05_S5R2__89_S5R3a_1_0.cpt
2008-01-23 15:45:31.6288 [debug]: Total skypoints = 1199. Progress: 0,
$Revision: 1.80 $ OPT:0 SCV:2, SCTRIM:8
c
1,
APP DEBUG: Application caught signal 8.

FPU status word ffff98c1, flags: ERR_SUMM STACK_FAULT INVALID
Obtained 7 stack frames for this thread.
Use gdb command: 'info line *0xADDRESS' to print corresponding line numbers.
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a4b9e]
einstein_S5R3_4.20_i686-pc-linux-gnu(LocalComputeFStatFreqBand+0x1b33)[0x80ad153]
einstein_S5R3_4.20_i686-pc-linux-gnu(MAIN+0x352d)[0x80a495d]
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a5b34]
../../projects/einstein.phys.uwm.edu/einstein_S5R3_4.20_i686-pc-linux-gnu.so(_Z6foobarPv+0x14)[0xb7ce6e24]
/lib/libpthread.so.0[0xb7ec918b]
/lib/libc.so.6(clone+0x5e)[0xb7e5314e]
Stack trace of LAL functions in worker thread:
LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
(null) at line 0 of file (null)
At lowest level status code = 0, description: NO LAL ERROR REGISTERED

]]>

I use Gentoo Linux, Einstein stopped working on Boinc 5.8.15 (worked fine for several months), updating to 5.10.28 didn't help. All other projects (SETI and CPDN) work fine.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 804838877
RAC: 1239143

Lots of FPU errors

Hi!

What's striking is that this PC is running extremly slow!

Quote:

2008-01-23 15:41:35.6205 [normal]: Start of BOINC application 'einstein_S5R3_4.20_i686-pc-linux-gnu'.
2008-01-23 15:41:36.0558 [debug]: Reading SFTs and setting up stacks ... done
2008-01-23 15:45:31.6187 [normal]:

This shows it takes almost 4 minutes (!) to just read in the input files, something that should take only 10-20 seconds on your hardware. Either there is a process running on this machine that takes almost 100% of the CPU and slows the E@H app down to a crawl, or the CPU is throttled down because of a problem with cooling, which would also explain the computation errors. There's definitely something wrong with this PC.

Bikeman

next_ghost
next_ghost
Joined: 25 Mar 05
Posts: 12
Credit: 246383
RAC: 0

RE: This shows it takes

Message 77686 in response to message 77685

Quote:
This shows it takes almost 4 minutes (!) to just read in the input files, something that should take only 10-20 seconds on your hardware. Either there is a process running on this machine that takes almost 100% of the CPU and slows the E@H app down to a crawl, or the CPU is throttled down because of a problem with cooling, which would also explain the computation errors. There's definitely something wrong with this PC.

It's a laptop so the HDD transfer rates suck. And I think I was running an update at that time which does take 100% of both CPU cores on Gentoo. If you take a look at the long list of failed tasks in my profile, you'll find out that most of them read input files in 20-30 seconds. Temperature is not a problem according to ACPI.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 804838877
RAC: 1239143

Hi! So far the wingmen

Hi!

So far the wingmen (even those using the same Linux app) do not get those errors, so I really think this is specific to your PC, either corruption of files or hardware failure.

Quote:

All other projects (SETI and CPDN) work fine.

Please note that the same host did produce client errors on SETI@Home and CPDN recently, so I really do think that PC needs repair.

http://setiathome.berkeley.edu/results.php?hostid=2720482
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=818431

Sorry I have no better news,

Bikeman

next_ghost
next_ghost
Joined: 25 Mar 05
Posts: 12
Credit: 246383
RAC: 0

RE: So far the wingmen

Message 77688 in response to message 77687

Quote:
So far the wingmen (even those using the same Linux app) do not get those errors, so I really think this is specific to your PC, either corruption of files or hardware failure.

Yes, it may be specific to my PC but it's *NOT* a hardware failure. The problem must be somewhere between Einstein@Home client and system libraries or kernel.

I'm using kernel 2.6.23 since December 17, I used 2.6.22 before. Nothing else was changed around the time of first failures. Were there any Einstein@Home client updates around December 15?

Quote:

Please note that the same host did produce client errors on SETI@Home and CPDN recently, so I really do think that PC needs repair.

http://setiathome.berkeley.edu/results.php?hostid=2720482

2 segfaults out of 13 reported work units. Most likely minor SETI client bug. 5 results were correct, the rest is pending validation.

Broken work units which crashed on all computers so far, see for yourself. Another work unit has been running for 124 hours and the results are correct so far.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5887
Credit: 119157334060
RAC: 24594457

RE: I'm using kernel

Message 77689 in response to message 77688

Quote:

I'm using kernel 2.6.23 since December 17, I used 2.6.22 before. Nothing else was changed around the time of first failures. Were there any Einstein@Home client updates around December 15?

You are currently using version 4.20 of the science app which was made the official app around December 03. There have been bugfixes recently and the current beta (4.27) is much improved. It would be a very good idea for you to give the 4.27 beta app a spin to see if your problems are solved. It's quite easy to do this and you should check out the beta test page for details about the test app and to download the test package.

To run the beta app, all you need to do is:-

  • * Download and extract the contents of the zip package
    * Completely stop BOINC
    * Copy the package contents into your Einstein project directory, overwriting any existing app_info.xml file that might be there
    * Restart BOINC

If there is an "in-progress" task, the new app will pick up from where the old app left off - nothing is lost. It might appear that the old app is still being used but that is not so. New tasks downloaded will be "branded" with the version number of the new app. You should get a nice little performance boost as well.

If this doesn't solve your problems then I'm afraid I'd have to agree that you are very likely to have a hardware issue, despite your confidence that you don't. Also please realise that different projects put different stresses on different parts of your system so that it's not impossible to see just one project falling over and perhaps not the others.

Cheers,
Gary.

next_ghost
next_ghost
Joined: 25 Mar 05
Posts: 12
Credit: 246383
RAC: 0

4.24 crashes as well,

4.24 crashes as well, rebooting to the old kernel (on which Einstein@Home worked fine for over a month) didn't help either.

The problem always results in only two error messages:

Stack trace of LAL functions in worker thread:
GetSemiCohToplist at line 3173 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/HierarchicalSearch.c
At lowest level status code = 0, description: NO LAL ERROR REGISTERED

or

Stack trace of LAL functions in worker thread:
LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
(null) at line 0 of file (null)
At lowest level status code = 0, description: NO LAL ERROR REGISTERED

These errors happen in both 4.20 and 4.24.

next_ghost
next_ghost
Joined: 25 Mar 05
Posts: 12
Credit: 246383
RAC: 0

And it appears that my PC has

And it appears that my PC has finished only 1 workunit since December 5 (or it might have been a workunit finished before December 5 which was pending credit for long time) which pretty much overlaps with 4.20 official release.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 804838877
RAC: 1239143

RE: And it appears that my

Message 77692 in response to message 77691

Quote:
And it appears that my PC has finished only 1 workunit since December 5 (or it might have been a workunit finished before December 5 which was pending credit for long time) which pretty much overlaps with 4.20 official release.

I really don't think this is related to 4.20, but to double check, you could actually re-install the version you were using prior to 4.20 and see if it crashes.

Version 4.17 (in the beta package with app_info.xml) can still be downloaded from

http://einstein.phys.uwm.edu/app_test/linux/einstein_S5R1_4.17_i686-pc-linux-gnu.tar.gz

If this one crashes as well, but worked before, I guess you should be convinced that there might be something wrong with that CPU of yours.

CU

Bikeman

next_ghost
next_ghost
Joined: 25 Mar 05
Posts: 12
Credit: 246383
RAC: 0

Scheduler won't give me any

Scheduler won't give me any work units for 4.17.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5887
Credit: 119157334060
RAC: 24594457

RE: Scheduler won't give me

Message 77694 in response to message 77693

Quote:
Scheduler won't give me any work units for 4.17.

Did you modify your app_info.xml file?

You probably need to add clauses so that the more recent data versions like 4.20 and 4.24 are allowed to be processed by the older 4.17 app. Open the file with a text editor and see how different task versions are handled.

What version is shown against any tasks you have showing on your Boinc Manager Tasks tab?

One other point. I posted previously and suggested using the 4.27 version app. I was careless and called it a beta app whereas it's really called a "power user" app - see the appropriate sticky thread. I've got it installed on at least 20 different machines and it's working without issue for me. It's supposed to be the 4.24 code base but maybe there are changes that might make a difference to your machine. It's worth a try. You could use the 4.27 app_info.xml unchanged if you had either 4.20 or 4.24 "branded" tasks in your tasks list.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.