Strange Behaviour with GT240

ashes999

Joined: 17 Jun 11

Posts: 9

Credit: 2125257

RAC: 0

21 Aug 2011 9:27:32 UTC

Topic 195913

(moderation:

)

I have a GT240 which I use primarily for BOINC . I run it at 100% all the time, with TThrottle to keep it under 90C.

A couple of months ago, NVidia released a new updated version of drivers. Since upgrading, my computer chops like crazy when running the app; it freezes every second, for a full second. It's very very very hard to use when this happens.

I tried downgrading the drivers to several old versions (can't recall which one I had before), and it helped, but not too much; the problem is still there.

The other problem is, even with my app sent to "use GPU only when PC is idle," after a few hours (not sure how long), the GPU application just stops. Based on the temperature graph, it looks like the GPU usage goes really low -- maybe 5-10% or so (fan is inaudible). I know it's still running, because if I move my mouse, the GPU temperature drops.

None of this occurred before, and I suspect the drivers, but I'm not sure how to troubleshoot, isolate, or fix the problem.

This happens both with Einstein@Home and with GPUGRID; it's not project-specific, as best as I can tell.

Claggy

Joined: 29 Dec 06

Posts: 560

Credit: 2798290

RAC: 2801

Strange Behaviour with GT240

21 Aug 2011 10:55:07 UTC

Message 106403

(moderation:

)

Try downgrading to 266.58 Cuda32 drivers, the Cuda4 drivers don't like being interrupted and tend to freak out and downclock the GPU, a reboot needed to fix this,
Jason G over at Seti reported this to the Boinc Devs months ago, and supplied thread safe api code, but i've yet to here of any GPU project updating their GPU apps with thread safe code:

Quote:

OK I will. It's quite involved, but I'll try detail first then explain further if needed.

Certain new methods that Cuda4 drivers deal with memory & Cuda transfers are sensitive to being abrubtly terminated without warning. All Windows-Boinc-Cuda app releases to date use boincApi code for their exit code, given that Boinc needs to tell applications through this channel when to snooze/resume/exit etc, as well as when the worker needs to exit normally.

Symptoms directly pertaining to effects using Cuda 4 drivers with current Boinc-Cuda applications are primarily the 'sticky downclock' problem, but also other forms of unexplained erroring out.

There are other non-Cuda related symptoms visible across non-Cuda (CPU) applications as well, most visible being truncation or erasure of the stderr.txt contents, and less visible possibly checkpoint & result files as well.

These sorts of symptoms, being apparently related to how 'nicely' the program treats the active buffer transfers when the application shuts down, seemed to be statisically more common on lower bus/memory speed systems, probably as a result of the transfers etc taking longer (i.e. higher contention).

The trial solution in testing is to install exit code within boincAPI that 'asks' the worker thread (that feeds the Cuda device etc) to shut down 'nicely', so that it can quickly finish what it is doing & tidyup before being 'killed'. At present this seems effective at preventing the downclock problem & possibly the stderr/etc truncation symptoms as well, though we're poking at it to look for unexpected issues at this time. I've relayed as much information as I can to Berkeley & will leave it in their hands.

If you experience the downclock problems, there are currently 2 options I'm aware of:
- Downgrade to driver 266.58 which is not as senstivie to its tasks being summarily executed, or
- Determine if it's a situation where you absolutely need the fix now: That would only be a possiblity for this Project (Other projects don't have the fix yet & may not be even aware of the issue), and only under special circumstances, as it would involve pre-alpha testing unproven code. We are a bit overworked at the moment with V7 & other development considerations, So please don't expect a rush release of this uproven code.

In any case, high throughput hosts are statisically less susceptible to this problem, so It is quite possible many hosts don't see the symptoms appear even with newer drivers & existing applications.

HTH, Jason

Claggy

ashes999

Joined: 17 Jun 11

Posts: 9

Credit: 2125257

RAC: 0

Got it. The clock speed is

21 Aug 2011 14:47:38 UTC

Message 106404 in response to message 106403

(moderation:

)

Got it. The clock speed is downgrading from 549 to 135. How do I prevent this? This is not because of temperature; the temperature holds steady at 90C.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3001431938

RAC: 699671

RE: Got it. The clock speed

21 Aug 2011 17:19:30 UTC

Message 106405 in response to message 106404

(moderation:

)

Quote:

Got it. The clock speed is downgrading from 549 to 135. How do I prevent this? This is not because of temperature; the temperature holds steady at 90C.

There are three known workrounds.

1) Downgrade to driver 266.58
2) Avoid suspending an Einstein task mid-run. If you are sharing the card with other BOINC projects, set the 'Task Switch Interval' longer that the run time of your longest-running project CUDA app.
3) Wait for the project to re-compile the BRP3/4 app, or for somebody else to do it for them from the published sources.

FrankHagen

Joined: 13 Feb 08

Posts: 102

Credit: 272200

RAC: 0

RE: RE: Got it. The clock

21 Aug 2011 20:30:40 UTC

Message 106406 in response to message 106405

(moderation:

)

Quote:

Quote:
Got it. The clock speed is downgrading from 549 to 135. How do I prevent this? This is not because of temperature; the temperature holds steady at 90C.

There are three known workrounds.

1) Downgrade to driver 266.58
2) Avoid suspending an Einstein task mid-run. If you are sharing the card with other BOINC projects, set the 'Task Switch Interval' longer that the run time of your longest-running project CUDA app.
3) Wait for the project to re-compile the BRP3/4 app, or for somebody else to do it for them from the published sources.

4) don't try to "fix" it - just reboot if it happens.

ashes999

Joined: 17 Jun 11

Posts: 9

Credit: 2125257

RAC: 0

It happens constantly.

21 Aug 2011 22:48:40 UTC

Message 106407 in response to message 106406

(moderation:

)

It happens constantly. Changing the switch time to 9999 minutes did nothing. I'll downgrade and see what happens.

Worst case, I'm stuck with just running GPU apps on idle time.

ashes999

Joined: 17 Jun 11

Posts: 9

Credit: 2125257

RAC: 0

The solution was, very

24 Aug 2011 1:36:18 UTC

Message 106408

(moderation:

)

The solution was, very strangely, to change my PC power settings to "always on" and disable my screensaver. This is documented on my SuperUser question here.

Strange Behaviour with GT240

Forums › Problems and Bug Reports

Strange Behaviour with GT240

Got it. The clock speed is

RE: Got it. The clock speed

RE: RE: Got it. The clock

It happens constantly.

The solution was, very

Comment viewing options

Forums › Problems and Bug Reports