Strange Behaviour with GT240

ashes999
ashes999
Joined: 17 Jun 11
Posts: 9
Credit: 2125257
RAC: 0
Topic 195913

I have a GT240 which I use primarily for BOINC . I run it at 100% all the time, with TThrottle to keep it under 90C.

A couple of months ago, NVidia released a new updated version of drivers. Since upgrading, my computer chops like crazy when running the app; it freezes every second, for a full second. It's very very very hard to use when this happens.

I tried downgrading the drivers to several old versions (can't recall which one I had before), and it helped, but not too much; the problem is still there.

The other problem is, even with my app sent to "use GPU only when PC is idle," after a few hours (not sure how long), the GPU application just stops. Based on the temperature graph, it looks like the GPU usage goes really low -- maybe 5-10% or so (fan is inaudible). I know it's still running, because if I move my mouse, the GPU temperature drops.

None of this occurred before, and I suspect the drivers, but I'm not sure how to troubleshoot, isolate, or fix the problem.

This happens both with Einstein@Home and with GPUGRID; it's not project-specific, as best as I can tell.

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2699403
RAC: 0

Strange Behaviour with GT240

Try downgrading to 266.58 Cuda32 drivers, the Cuda4 drivers don't like being interrupted and tend to freak out and downclock the GPU, a reboot needed to fix this,
Jason G over at Seti reported this to the Boinc Devs months ago, and supplied thread safe api code, but i've yet to here of any GPU project updating their GPU apps with thread safe code:

Quote:

OK I will. It's quite involved, but I'll try detail first then explain further if needed.

Certain new methods that Cuda4 drivers deal with memory & Cuda transfers are sensitive to being abrubtly terminated without warning. All Windows-Boinc-Cuda app releases to date use boincApi code for their exit code, given that Boinc needs to tell applications through this channel when to snooze/resume/exit etc, as well as when the worker needs to exit normally.

Symptoms directly pertaining to effects using Cuda 4 drivers with current Boinc-Cuda applications are primarily the 'sticky downclock' problem, but also other forms of unexplained erroring out.

There are other non-Cuda related symptoms visible across non-Cuda (CPU) applications as well, most visible being truncation or erasure of the stderr.txt contents, and less visible possibly checkpoint & result files as well.

These sorts of symptoms, being apparently related to how 'nicely' the program treats the active buffer transfers when the application shuts down, seemed to be statisically more common on lower bus/memory speed systems, probably as a result of the transfers etc taking longer (i.e. higher contention).

The trial solution in testing is to install exit code within boincAPI that 'asks' the worker thread (that feeds the Cuda device etc) to shut down 'nicely', so that it can quickly finish what it is doing & tidyup before being 'killed'. At present this seems effective at preventing the downclock problem & possibly the stderr/etc truncation symptoms as well, though we're poking at it to look for unexpected issues at this time. I've relayed as much information as I can to Berkeley & will leave it in their hands.

If you experience the downclock problems, there are currently 2 options I'm aware of:
- Downgrade to driver 266.58 which is not as senstivie to its tasks being summarily executed, or
- Determine if it's a situation where you absolutely need the fix now: That would only be a possiblity for this Project (Other projects don't have the fix yet & may not be even aware of the issue), and only under special circumstances, as it would involve pre-alpha testing unproven code. We are a bit overworked at the moment with V7 & other development considerations, So please don't expect a rush release of this uproven code.

In any case, high throughput hosts are statisically less susceptible to this problem, so It is quite possible many hosts don't see the symptoms appear even with newer drivers & existing applications.

HTH, Jason

Claggy

ashes999
ashes999
Joined: 17 Jun 11
Posts: 9
Credit: 2125257
RAC: 0

Got it. The clock speed is

Got it. The clock speed is downgrading from 549 to 135. How do I prevent this? This is not because of temperature; the temperature holds steady at 90C.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2960835998
RAC: 702349

RE: Got it. The clock speed

Quote:
Got it. The clock speed is downgrading from 549 to 135. How do I prevent this? This is not because of temperature; the temperature holds steady at 90C.


There are three known workrounds.

1) Downgrade to driver 266.58
2) Avoid suspending an Einstein task mid-run. If you are sharing the card with other BOINC projects, set the 'Task Switch Interval' longer that the run time of your longest-running project CUDA app.
3) Wait for the project to re-compile the BRP3/4 app, or for somebody else to do it for them from the published sources.

FrankHagen
FrankHagen
Joined: 13 Feb 08
Posts: 102
Credit: 272200
RAC: 0

RE: RE: Got it. The clock

Quote:
Quote:
Got it. The clock speed is downgrading from 549 to 135. How do I prevent this? This is not because of temperature; the temperature holds steady at 90C.

There are three known workrounds.

1) Downgrade to driver 266.58
2) Avoid suspending an Einstein task mid-run. If you are sharing the card with other BOINC projects, set the 'Task Switch Interval' longer that the run time of your longest-running project CUDA app.
3) Wait for the project to re-compile the BRP3/4 app, or for somebody else to do it for them from the published sources.

4) don't try to "fix" it - just reboot if it happens.

ashes999
ashes999
Joined: 17 Jun 11
Posts: 9
Credit: 2125257
RAC: 0

It happens constantly.

It happens constantly. Changing the switch time to 9999 minutes did nothing. I'll downgrade and see what happens.

Worst case, I'm stuck with just running GPU apps on idle time.

ashes999
ashes999
Joined: 17 Jun 11
Posts: 9
Credit: 2125257
RAC: 0

The solution was, very

The solution was, very strangely, to change my PC power settings to "always on" and disable my screensaver. This is documented on my SuperUser question here.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.