On one of my boxes, the einstein-client is crashing all the time. Here is the link to that computer:
http://einsteinathome.org/host/1864884/tasks
The error-messsage is the same all the time:
*** glibc detected *** ../../projects/einstein.phys.uwm.edu/einstein_S5R5_1.01_i686-pc-linux-gnu_2: double free or corruption (fasttop): 0x08241060 ***
There are four crashes (of six WUs being processed so far), two of them crashed while calculating part 99, one after writing the output (i.e. after acutally finishing the calculation) and one quite soon while working on part 16.
The box is a standard-SuSE-installation, so there are no exotic libraries that might lead to problems, so maybe there is a bug in the client that might lead to the same effect on other machines.
Copyright © 2024 Einstein@Home. All rights reserved.
Crash of einstein_S5R5_1.01_i686-pc-linux-gnu_2
)
I am running SuSE Linux 10.3 32 bit with no problem on 7 projects. Could it be a hardware problem on one CPU?
Tullio
RE: RE: *** glibc
)
The box only has one CPU and is running all the time. The OS itself is stable. As well, I doubt that the double calling of free can be caused by a hardware-failure.
RE: RE: RE: *** glibc
)
My CPU is an Opteron 1210 with 2 cores, each of them running a different project (no hyperthreading). I never had a compute error in Einstein, only in QMC/ORCA which is a beta project. SETI enhanced and Astropulse, both SSE3 optimized, run OK, same as climateprediction.net. CPDN Beta had one crash, AQUA@home and LHC@home rarely have work.Your other CPUs with Linux work well, as far as I can see. This is why I was thinking of a hardware problem (memory?).
RE: RE: RE: *** glibc
)
I only have one thought...have you done all the latest upgrades?
RE: The box is a
)
I'm not aware of any such bugs left in the application. Other machines have completed the same tasks without error (partly even with the same application, e.g. http://einsteinathome.org/workunit/49339040). The error log (http://einsteinathome.org/task/119675653) looks pretty weird, the application crashes _after_ finishing the task ("calling boinc_finish"), and after handling a signal, apparently it catches another while trying to shut down the application. Pretty weird, definitely not a common problem.
BM
BM
RE: RE: RE: RE: ***
)
It's an OpenSuSE 10.2 so there are no new updates. But all updates that have been available are on the system.
Here is the output of ldd:
polly2:/home/kimmerin/BOINC/projects/einstein.phys.uwm.edu # ldd einstein_S5R5_1.01_i686-pc-linux-gnu_2
linux-gate.so.1 => (0xb7f7e000)
libpthread.so.0 => /lib/libpthread.so.0 (0xb7f30000)
libm.so.6 => /lib/libm.so.6 (0xb7f0a000)
libc.so.6 => /lib/libc.so.6 (0xb7ddc000)
/lib/ld-linux.so.2 (0xb7f7f000)
polly2:/home/kimmerin/BOINC/projects/einstein.phys.uwm.edu # ls -l /lib/libm.so.6
lrwxrwxrwx 1 root root 11 Dec 6 18:40 /lib/libm.so.6 -> libm-2.5.so
polly2:/home/kimmerin/BOINC/projects/einstein.phys.uwm.edu # ls -l /lib/libc.so.6
lrwxrwxrwx 1 root root 11 Dec 6 18:40 /lib/libc.so.6 -> libc-2.5.so
polly2:/home/kimmerin/BOINC/projects/einstein.phys.uwm.edu # ls -l /lib/libpthread.so.0
lrwxrwxrwx 1 root root 17 Dec 6 18:40 /lib/libpthread.so.0 -> libpthread-2.5.so
Hope that helps.
Regards, Lothar
RE: RE: The box is a
)
The box is really completely standard. It's just running a cronjob to get all mails from my mailbox to allow it to download using POP3 at a later point of time. So we have an OpenSuSE-installation more or less as it popped out after installation, with all updates available for the box until it was declared "obsolete". This wasn't too long ago, so the libraries here should be current enough.
One thing I can think of is some dependencies on the system-time. Every ten minutes, another cronjob starts ntpdate to synchronized the time with a timeserver). So if there is a thread running that perform calls of free being dependent on the system-time this might lead to problems. As a test I will deactivate the cronjob.
Regards, Lothar
RE: It's an OpenSuSE 10.2
)
Mine OpenSuSE is 10.3, but I have a copy of 11.1 downloaded should I decide to upgrade. But, as long as everything works, I am keeping 10.3, which is stable.
RE: I'm not aware of any
)
I think there still is one. After commenting out the cronjob for the ntpdate-call the client is working without crash (at least until now). Currently it's at step 67, so I can expect it to be finished about tomorrow (the calculation is suspended over the night to keep the room more silent).
Without knowing the source, I think the following happens:
A thread is waiting using the function pthread_cond_timedwait that expects the time as absolute value. During this wait, ntpdate is called that is setting the time somewhere into the future. This leads to the awakening of this thread way before the time it was intended to be, freeing up resources before their time. Later on another thread frees up this resource as well, leading to the exception being monitored.
I know the very specific effect, because there is some problem in Java when using Thread.sleep() or Thread.wait() on a system that does change the clock due to daylight savings. This can lead to the effect, that a Thread.sleep(50) stop for an hour instead of the wanted 50 milliseconds.
Just my thinking, I will report again, when the current WU is finished (this or that way ;-)
Regards, Lothar
RE: Just my thinking, I
)
It was "that way" :-(
74, *** glibc detected *** ../../projects/einstein.phys.uwm.edu/einstein_S5R5_1. 01_i686-pc-linux-gnu_2: double free or corruption (fasttop): 0x082410a8 ***
But it was more or less exactly at 22:00:00, the time, the calculation is suspended for the night, whatever that might mean...
Regards, Lothar