All "Gamma-ray pulsar binary search on GPU" units fail with computation fault

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5876

Credit: 118524314543

RAC: 26259269

There are a couple of

28 Apr 2018 1:01:31 UTC

Message 165195 in response to message 165186

(moderation:

)

There are a couple of comments you made in your latest message that perhaps point to hardware (plus heat) as the real problem.

Richard Bertrand wrote:

... The calculations went well for more than an hour and a half (the projected calculation time for the task was one hour or so, but during calculations, this changed to more than 4 hours).

The error occured after the 402th (of 1255) photon-load event according to the task log.

Firstly, 1.5 hours of crunching pretty much confirms the driver is OK. I wouldn't think you would get that far with a faulty driver. Last time I looked, virtually all tasks were failing after about 30 secs.

The estimate is being correctly adjusted by BOINC so that's not a problem. 30% of the work done in 1.5 hours points to a corrected time of close to 5 hours. This is exactly what BOINC should be doing. Your GPU is listed as a GT 650M. You did mention this in the opening message but it didn't immediately click with me that you were using a laptop. My (admittedly limited) experience is that laptops can't really handle the stresses of crunching and do tend to fail if used for crunching over extended periods.

You are using both the discrete GPU and the Intel GPU, and most tasks done on the latter are failing with validate errors. If you are also running CPU tasks at all, you will be generating a lot of internal heat which will tend to 'cook' components on the main logic board. I'm thinking particularly about capacitors which can fail prematurely in elevated temperature conditions.

Quote:

As all other program's work fine (I tried a game, Google Earth and Seti is calculating constantly and without errors with OpenCL and Cuda calculation programs), I am wondering why only the Einstein OpenCL application has problems.

Because that app is probably the one that's putting the highest stress on your machine. If there was a fundamental flaw in the EAH app, everybody else would be affected and the boards would be littered with complaints.

Quote:

A few months ago I noticed that the fan speed is higher than I was used to. That is a signal that I need to clean the fan and cooling system.

This is likely to be a key bit of information. Components age much more rapidly as the temperature goes up. The CPU and GPU cores are not usually the problem. It's often ripple and spikes in voltages that are not being cleaned up because capacitors aren't doing their job any more. Over the years, I would have done around 50 motherboard repairs to replace bulging electrolytic capacitors. Apart from the visual symptoms, the first thing likely to be noticed is tasks failing or the whole system locking up or crashing. If I replace any bulging electrolytic capacitors, the machine usually stops misbehaving. I've never had to replace a polymer cap.

When you check and clean your cooling system, carefully check all caps you can find for any signs of bulging tops or deformation of the sealing 'bung' (black rubber or plastic seal) at the base.

Quote:

... I probably also will renew the cooling paste on the CPU and video cards after 5 years of operating at rather high temperatures: at this moment each core and the nVidia card is constantly throtteling with the temperature switching between 80 and 100+ degrees Celcius.

Certainly worthwhile doing that while it's all in bits. If you've used that machine for crunching for 5 years, I'm surprised you haven't run into problems before this. Have you cleaned the fans or heat sinks previously? For light duties, it's not so critical but for crunching, you need to do it rather more regularly.

Quote:

The problem is, that I need this computer every day for work and private stuff.

Then please think carefully about how much crunching load you should impose on it.

Quote:

Anyway, as the temperature sensors of the CPU and nVidia card seem to work OK and throttle the speed of these, I don't see why temperature would be an issue with the Einstein calculations?

This is exactly why I think it's possibly a hardware problem possibly related to voltage cleanliness and stability. This could be due to the PSU itself or to voltage stability components on the main logic board.

I see the same sort of behaviour with desktop configurations as they age. They do run hot but normally don't have problems coping with the heat. These days, at the first sign of running problems / task failures, I inspect both the motherboard and PSU for swollen caps. A lot of my machines are in the 7+ age bracket, and most of those have had this problem and needed cap replacements in order to keep working. I realise this isn't a viable option for most people. I'm just fortunate that I invested in a decent soldering station and have acquired (through lots of practice) some basic soldering skills.

Cheers,
Gary.

Richard Bertrand

Joined: 2 Apr 06

Posts: 8

Credit: 5833265

RAC: 0

I have cleaned the fan and

29 Apr 2018 22:08:16 UTC

Message 165213

(moderation:

)

I have cleaned the fan and renewed the cooling paste (the paste was completely dried out). Temperatures are down by 5 to 10 degrees Celsius. The laptop is now running with Boinc tasks enabled (most of the time 3 or 4 of them) at temperatures between 85 and 95 degrees Celsius. The nVidia chip runs a lot cooler now: may be 10 to 20 degrees Celsius less at 80 to 82 degrees Celsius. The fan is a lot more quiet now: it runs at about 4000rpm (before at full speed at above 5000rpm).

The laptop is running almost as good / cool as it did a few years ago (I suspect it runs a little bit hotter still). The CPU will now hold its speed (at turbo-speed 3,1 - 3,2 GHz), where it was throttling a lot the last months.

However, the first two gamma-ray tasks ended with an error again: the first after 90 minutes or so (more or less the same run-time as after re-installation of the video driver), the second within 30 seconds after the first. At that time, the PC did run slightly hotter than before: at about 90 to 98 degrees Clesius.

So maybe there is something that gets the computer running hotter with the gamma-ray calculations and it is more sensitive in my case. That will leave me no other option than to stop calculating for Einstein, otherwise I keep trashing the gamma-ray calculations as Gary said. Or I need to lessen the number of Boinc tasks that are running to keep the temperature down.

However, I also just noticed a message in the (old and new) output "De limiet voor het aantal netwerk-BIOS-sessies is overschreden." (The limit for the number of network BIOS sessions has been exceeded). I don't know whether that message has something to do with the computation errors?

Richard Bertrand

Joined: 2 Apr 06

Posts: 8

Credit: 5833265

RAC: 0

Gary Roberts wrote:replace

29 Apr 2018 22:36:44 UTC

Message 165214

(moderation:

)

Gary Roberts wrote:

replace bulging electrolytic capacitors

As far as I know and have seen on my board, there are only a few electrolytic capacitors, and they looked fine. Or there were more, but maybe nowadays they are able to make these elco's so small, that I can't recognize them (and then I cannot repair them as they are too small).

Quote:

Have you cleaned the fans or heat sinks previously?

Yes, I have done that twice before. That's why I know that it is a daring task, because al sorts of plugs are breaking in the laptop after five years of service (and probably because of its hotter running constantly). So besides the cleaning, I have to be extra carefull de-assembling and assembling this laptop.

As this laptop does function fine for all sorts of other work, I have no intention of replacing it (moreover, I will buy some extra memory for it). May be I will buy another one in about een year of two, or it must die on me before that. So I think I will suspend calculating for Einstein until I have a new laptop, or it must be possible to refuse the gamma-ray units and calculate on the other type of Einstein units only....

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5876

Credit: 118524314543

RAC: 26259269

Richard Bertrand wrote:As far

30 Apr 2018 0:21:46 UTC

Message 165216 in response to message 165214

(moderation:

)

Richard Bertrand wrote:

As far as I know and have seen on my board, there are only a few electrolytic capacitors, and they looked fine. Or there were more, but maybe nowadays they are able to make these elco's so small, that I can't recognize them (and then I cannot repair them as they are too small).

No, you can't miss electrolytics. They are cylindrical 'cans' and the smallest are probably around 3-5mm in diameter and perhaps 5-8mm in height. Of course, those used for voltage stability/ripple suppression can be rather larger. Failure of the bigger ones becomes obvious through distortion of the 'can'. With smaller ones, they tend to 'dry out' faster because of their small size so there is usually no visible indication. In a 5 year old machine that has had an overly warm life, there is bound to be some deterioration of those components.

If you put any value on your time (I don't), it's false economy to keep repairing stuff. I've been retired for more than 10 years and my time has negative value. The thing that is most valuable to me these days is the mental stimulation and sense of purpose that comes from maintaining a fleet of crunchers. I've had a lot of fun upgrading 7+ year old CPU crunchers with modern AMD Polaris GPUs and figuring out how to get OpenCL under Linux working correctly on them. Undoubtedly, motherboard deterioration will put a stop to this at some point, probably sooner rather than later :-).

With all the things you have done in trying to identify/correct the cause of your problems, there is one other thing I would suggest. You now have a situation where heat issues and driver installation issues are hopefully taken care of. It's starting to look like out of spec components on the logic board - things seem OK as long as you don't stress the machine too much. If this is a reasonable assessment, I would expect that you may be able to get a single GPU task to run to conclusion if you stop all other crunching activities while doing so. If you just suspend all tasks except for a single FGRPB1G GPU task, it would be interesting to see if that one can finish. This would prove that the driver is fine and that it must be something related to overall stress on the machine when multiple tasks are crunching.

However, even if that task succeeds, I agree with you that it's probably wise to stop crunching these tasks, both on the discrete GPU and also the BRP4 tasks on the internal GPU (based on all the validate errors). There are no other types of Einstein GPU tasks that you can replace them with. I don't know what sort of tasks (if any) you run on the CPU cores but they will create a lot of heat too if you run too many simultaneously. The project preferences allow you to enable/disable particular searches as you wish.

Cheers,
Gary.

All "Gamma-ray pulsar binary search on GPU" units fail with computation fault

Forums › Problems and Bug Reports

There are a couple of

I have cleaned the fan and

Gary Roberts wrote:replace

Richard Bertrand wrote:As far

Comment viewing options

Forums › Problems and Bug Reports