Crash of einstein_S5R5_1.01_i686-pc-linux-gnu_2

kimmerin
kimmerin
Joined: 29 Sep 08
Posts: 16
Credit: 11090767
RAC: 0
Topic 194225

On one of my boxes, the einstein-client is crashing all the time. Here is the link to that computer:
http://einsteinathome.org/host/1864884/tasks

The error-messsage is the same all the time:

*** glibc detected *** ../../projects/einstein.phys.uwm.edu/einstein_S5R5_1.01_i686-pc-linux-gnu_2: double free or corruption (fasttop): 0x08241060 ***

There are four crashes (of six WUs being processed so far), two of them crashed while calculating part 99, one after writing the output (i.e. after acutally finishing the calculation) and one quite soon while working on part 16.

The box is a standard-SuSE-installation, so there are no exotic libraries that might lead to problems, so maybe there is a bug in the client that might lead to the same effect on other machines.

tullio
tullio
Joined: 22 Jan 05
Posts: 2118
Credit: 61407735
RAC: 0

Crash of einstein_S5R5_1.01_i686-pc-linux-gnu_2

Quote:

On one of my boxes, the einstein-client is crashing all the time. Here is the link to that computer:
http://einsteinathome.org/host/1864884/tasks

The error-messsage is the same all the time:

*** glibc detected *** ../../projects/einstein.phys.uwm.edu/einstein_S5R5_1.01_i686-pc-linux-gnu_2: double free or corruption (fasttop): 0x08241060 ***

There are four crashes (of six WUs being processed so far), two of them crashed while calculating part 99, one after writing the output (i.e. after acutally finishing the calculation) and one quite soon while working on part 16.

The box is a standard-SuSE-installation, so there are no exotic libraries that might lead to problems, so maybe there is a bug in the client that might lead to the same effect on other machines.


I am running SuSE Linux 10.3 32 bit with no problem on 7 projects. Could it be a hardware problem on one CPU?
Tullio

kimmerin
kimmerin
Joined: 29 Sep 08
Posts: 16
Credit: 11090767
RAC: 0

RE: RE: *** glibc

Message 90832 in response to message 90831

Quote:
Quote:
*** glibc detected *** ../../projects/einstein.phys.uwm.edu/einstein_S5R5_1.01_i686-pc-linux-gnu_2: double free or corruption (fasttop): 0x08241060 ***

I am running SuSE Linux 10.3 32 bit with no problem on 7 projects. Could it be a hardware problem on one CPU?

The box only has one CPU and is running all the time. The OS itself is stable. As well, I doubt that the double calling of free can be caused by a hardware-failure.

tullio
tullio
Joined: 22 Jan 05
Posts: 2118
Credit: 61407735
RAC: 0

RE: RE: RE: *** glibc

Message 90833 in response to message 90832

Quote:
Quote:
Quote:
*** glibc detected *** ../../projects/einstein.phys.uwm.edu/einstein_S5R5_1.01_i686-pc-linux-gnu_2: double free or corruption (fasttop): 0x08241060 ***

I am running SuSE Linux 10.3 32 bit with no problem on 7 projects. Could it be a hardware problem on one CPU?

The box only has one CPU and is running all the time. The OS itself is stable. As well, I doubt that the double calling of free can be caused by a hardware-failure.


My CPU is an Opteron 1210 with 2 cores, each of them running a different project (no hyperthreading). I never had a compute error in Einstein, only in QMC/ORCA which is a beta project. SETI enhanced and Astropulse, both SSE3 optimized, run OK, same as climateprediction.net. CPDN Beta had one crash, AQUA@home and LHC@home rarely have work.Your other CPUs with Linux work well, as far as I can see. This is why I was thinking of a hardware problem (memory?).

mikey
mikey
Joined: 22 Jan 05
Posts: 12863
Credit: 1884357890
RAC: 238400

RE: RE: RE: *** glibc

Message 90834 in response to message 90832

Quote:
Quote:
Quote:
*** glibc detected *** ../../projects/einstein.phys.uwm.edu/einstein_S5R5_1.01_i686-pc-linux-gnu_2: double free or corruption (fasttop): 0x08241060 ***

I am running SuSE Linux 10.3 32 bit with no problem on 7 projects. Could it be a hardware problem on one CPU?

The box only has one CPU and is running all the time. The OS itself is stable. As well, I doubt that the double calling of free can be caused by a hardware-failure.

I only have one thought...have you done all the latest upgrades?

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4345
Credit: 252815400
RAC: 41141

RE: The box is a

Quote:
The box is a standard-SuSE-installation, so there are no exotic libraries that might lead to problems, so maybe there is a bug in the client that might lead to the same effect on other machines.


I'm not aware of any such bugs left in the application. Other machines have completed the same tasks without error (partly even with the same application, e.g. http://einsteinathome.org/workunit/49339040). The error log (http://einsteinathome.org/task/119675653) looks pretty weird, the application crashes _after_ finishing the task ("calling boinc_finish"), and after handling a signal, apparently it catches another while trying to shut down the application. Pretty weird, definitely not a common problem.

BM

BM

kimmerin
kimmerin
Joined: 29 Sep 08
Posts: 16
Credit: 11090767
RAC: 0

RE: RE: RE: RE: ***

Message 90836 in response to message 90834

Quote:
Quote:
Quote:
Quote:
*** glibc detected *** ../../projects/einstein.phys.uwm.edu/einstein_S5R5_1.01_i686-pc-linux-gnu_2: double free or corruption (fasttop): 0x08241060 ***

I am running SuSE Linux 10.3 32 bit with no problem on 7 projects. Could it be a hardware problem on one CPU?

The box only has one CPU and is running all the time. The OS itself is stable. As well, I doubt that the double calling of free can be caused by a hardware-failure.

I only have one thought...have you done all the latest upgrades?

It's an OpenSuSE 10.2 so there are no new updates. But all updates that have been available are on the system.

Here is the output of ldd:
polly2:/home/kimmerin/BOINC/projects/einstein.phys.uwm.edu # ldd einstein_S5R5_1.01_i686-pc-linux-gnu_2
linux-gate.so.1 => (0xb7f7e000)
libpthread.so.0 => /lib/libpthread.so.0 (0xb7f30000)
libm.so.6 => /lib/libm.so.6 (0xb7f0a000)
libc.so.6 => /lib/libc.so.6 (0xb7ddc000)
/lib/ld-linux.so.2 (0xb7f7f000)
polly2:/home/kimmerin/BOINC/projects/einstein.phys.uwm.edu # ls -l /lib/libm.so.6
lrwxrwxrwx 1 root root 11 Dec 6 18:40 /lib/libm.so.6 -> libm-2.5.so
polly2:/home/kimmerin/BOINC/projects/einstein.phys.uwm.edu # ls -l /lib/libc.so.6
lrwxrwxrwx 1 root root 11 Dec 6 18:40 /lib/libc.so.6 -> libc-2.5.so
polly2:/home/kimmerin/BOINC/projects/einstein.phys.uwm.edu # ls -l /lib/libpthread.so.0
lrwxrwxrwx 1 root root 17 Dec 6 18:40 /lib/libpthread.so.0 -> libpthread-2.5.so

Hope that helps.

Regards, Lothar

kimmerin
kimmerin
Joined: 29 Sep 08
Posts: 16
Credit: 11090767
RAC: 0

RE: RE: The box is a

Message 90837 in response to message 90835

Quote:
Quote:
The box is a standard-SuSE-installation, so there are no exotic libraries that might lead to problems, so maybe there is a bug in the client that might lead to the same effect on other machines.

I'm not aware of any such bugs left in the application. Other machines have completed the same tasks without error (partly even with the same application, e.g. http://einsteinathome.org/workunit/49339040). The error log (http://einsteinathome.org/task/119675653) looks pretty weird, the application crashes _after_ finishing the task ("calling boinc_finish"), and after handling a signal, apparently it catches another while trying to shut down the application. Pretty weird, definitely not a common problem.

The box is really completely standard. It's just running a cronjob to get all mails from my mailbox to allow it to download using POP3 at a later point of time. So we have an OpenSuSE-installation more or less as it popped out after installation, with all updates available for the box until it was declared "obsolete". This wasn't too long ago, so the libraries here should be current enough.

One thing I can think of is some dependencies on the system-time. Every ten minutes, another cronjob starts ntpdate to synchronized the time with a timeserver). So if there is a thread running that perform calls of free being dependent on the system-time this might lead to problems. As a test I will deactivate the cronjob.

Regards, Lothar

tullio
tullio
Joined: 22 Jan 05
Posts: 2118
Credit: 61407735
RAC: 0

RE: It's an OpenSuSE 10.2

Message 90838 in response to message 90836

Quote:


It's an OpenSuSE 10.2 so there are no new updates. But all updates that have been available are on the system.

Here is the output of ldd:
polly2:/home/kimmerin/BOINC/projects/einstein.phys.uwm.edu # ldd einstein_S5R5_1.01_i686-pc-linux-gnu_2
linux-gate.so.1 => (0xb7f7e000)
libpthread.so.0 => /lib/libpthread.so.0 (0xb7f30000)
libm.so.6 => /lib/libm.so.6 (0xb7f0a000)
libc.so.6 => /lib/libc.so.6 (0xb7ddc000)
/lib/ld-linux.so.2 (0xb7f7f000)
polly2:/home/kimmerin/BOINC/projects/einstein.phys.uwm.edu # ls -l /lib/libm.so.6
lrwxrwxrwx 1 root root 11 Dec 6 18:40 /lib/libm.so.6 -> libm-2.5.so
polly2:/home/kimmerin/BOINC/projects/einstein.phys.uwm.edu # ls -l /lib/libc.so.6
lrwxrwxrwx 1 root root 11 Dec 6 18:40 /lib/libc.so.6 -> libc-2.5.so
polly2:/home/kimmerin/BOINC/projects/einstein.phys.uwm.edu # ls -l /lib/libpthread.so.0
lrwxrwxrwx 1 root root 17 Dec 6 18:40 /lib/libpthread.so.0 -> libpthread-2.5.so

Hope that helps.

Regards, Lothar


Mine OpenSuSE is 10.3, but I have a copy of 11.1 downloaded should I decide to upgrade. But, as long as everything works, I am keeping 10.3, which is stable.

kimmerin
kimmerin
Joined: 29 Sep 08
Posts: 16
Credit: 11090767
RAC: 0

RE: I'm not aware of any

Message 90839 in response to message 90835

Quote:
I'm not aware of any such bugs left in the application.

I think there still is one. After commenting out the cronjob for the ntpdate-call the client is working without crash (at least until now). Currently it's at step 67, so I can expect it to be finished about tomorrow (the calculation is suspended over the night to keep the room more silent).

Without knowing the source, I think the following happens:

A thread is waiting using the function pthread_cond_timedwait that expects the time as absolute value. During this wait, ntpdate is called that is setting the time somewhere into the future. This leads to the awakening of this thread way before the time it was intended to be, freeing up resources before their time. Later on another thread frees up this resource as well, leading to the exception being monitored.

I know the very specific effect, because there is some problem in Java when using Thread.sleep() or Thread.wait() on a system that does change the clock due to daylight savings. This can lead to the effect, that a Thread.sleep(50) stop for an hour instead of the wanted 50 milliseconds.

Just my thinking, I will report again, when the current WU is finished (this or that way ;-)

Regards, Lothar

kimmerin
kimmerin
Joined: 29 Sep 08
Posts: 16
Credit: 11090767
RAC: 0

RE: Just my thinking, I

Message 90840 in response to message 90839

Quote:
Just my thinking, I will report again, when the current WU is finished (this or that way ;-)

It was "that way" :-(

74, *** glibc detected *** ../../projects/einstein.phys.uwm.edu/einstein_S5R5_1. 01_i686-pc-linux-gnu_2: double free or corruption (fasttop): 0x082410a8 ***

But it was more or less exactly at 22:00:00, the time, the calculation is suspended for the night, whatever that might mean...

Regards, Lothar

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.