Possible Reasons for Linux - Segmentation Violation

FalconFly
FalconFly
Joined: 16 Feb 05
Posts: 191
Credit: 15650710
RAC: 0
Topic 192079

I've built a new System using almost identical components than several previous ones.

But while all previous Systems worked like a champ right from the start, the new one quits computation with a Kernel Panic - Segmentation Violation repeatedly within 24 hours of operation while running a EAH WorkUnit.

Seems though that the BOINC Binary itself may also be the cause.

Basic question is :
What should I look for as possible reasons ?

So far, I've tried a different Linux Kernel (Fedora Core 5 and 6) with identical result.
Also, I've replaced the Video Card (trial & error), but it made no difference.
Finally, I've swapped the 2 512MB DDR2 RAM Modules around to see if that helps, but no change.

Voltages and Temperatures are perfect, just like the previous Systems I built.

System :
AMD Athlon64 X2 4600+ EE
2x 512MB MDT PC2-800 CL5
ASUS M2V
Optimized BOINC V5.2.13 (which runs perfect on all other similar Systems)

The System runs - as said - fine again for some hours after being re-installed and given the Backup of the BOINC folder.

TruXsoft V5.3.12 crashes with the same error, gonna test the vanilla BOINC release in a Moment.

The whole behaviour leads me to believe that the EAH Client has reached a point of immediate error once it attempts to resume.

Right now I'm rolling back to the backup again and will Reset EAH before going into Production again... But after that I'll be out of Ideas :(

Desti
Desti
Joined: 20 Aug 05
Posts: 117
Credit: 23762214
RAC: 0

Possible Reasons for Linux - Segmentation Violation

Do you have any errors in the logfiles after or before the segfault?

FalconFly
FalconFly
Joined: 16 Feb 05
Posts: 191
Credit: 15650710
RAC: 0

No, the running Programs have

Message 51932 in response to message 51931

No, the running Programs have no chance to write or record anything of the Event (apart from the Kernel Dump).

Seems I nailed the error though (System is up since this morning, so far running good) :

Lenghty Prime95 Torture tests revealed shakey Rounding (Bit-Level) errors, indicating my two PC2-800 DDR2 Modules cannot run up to full Speed in my Motherboard.
That would explain why the Client within hours of operation wrote bogus Data and would eventually crash using it for computation and preventing any further run without restoring the BOINC backup.

I've reduced clocking from DDR800 to DDR667 and had no more errors during hours of stress-testing.

I hope that workaround fixes the Problem.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.