system with 3 GPUs constantly erroring out

Joseph Stateson
Joseph Stateson
Joined: 7 May 07
Posts: 174
Credit: 3072734214
RAC: 550603
Topic 209686

https://einsteinathome.org/host/12468425

single gtx670, pair of gtx650ti; has 1 valid result out of 46 results.  could wrong checkpoint be handed off following task switch back?

Sid
Sid
Joined: 17 Oct 10
Posts: 164
Credit: 972731812
RAC: 396788

I'd say it is very unlikely.

I'd say it is very unlikely. I've one system with 3 750Ti gpus - two on 16X and one on 8X.

The error rate is almost 0,

mikey
mikey
Joined: 22 Jan 05
Posts: 12702
Credit: 1839107661
RAC: 3603

BeemerBiker

BeemerBiker wrote:

https://einsteinathome.org/host/12468425

single gtx670, pair of gtx650ti; has 1 valid result out of 46 results.  could wrong checkpoint be handed off following task switch back?

Did you load the Nvidia drivers 3 separate times or at least twice? With different gpu's it's often necessary due to the pc not getting confused. One unit I looked at said it had an opencl error near the bottom of the page.

Sebastian M. Bobrecki
Sebastian M. Bo...
Joined: 20 Feb 05
Posts: 63
Credit: 1529603097
RAC: 103

It looks like a problem with

It looks like a problem with power. Maybe your PSU have insufficient power or, as these gtx650ti probably do not have additional power connectors, MB is unable to deliver required power to PCIe slots.

Edit: I made mistake, 650ti should have power connectors (650 w/o ti don't have).

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

What's the exact brand and

What's the exact brand and model of each GPU... and PSU... and motherboard?

What kind of procedure did you use to install those cards? Did you put them on the board all at once and then installed the driver, or did you boot up computer after adding a card?

I would try this:

1. Remove those 650 Ti's so that only the GTX 670 is installed on the board.

2. Use Display Driver Uninstaller (Wagnard DDU) to remove all Nvidia drivers, in Safe Mode.

3. Use a registry cleaner to remove any remnants of old configurations. CCleaner Free for example is a free software for that purpose.

4. Install new Nvidia driver, http://www.nvidia.com/download/driverResults.aspx/123219/en-us and reboot.

5. After a reboot, shutdown computer and install one 650 Ti. Power up and let computer boot up. Do a reboot one time. Then shutdown computer.

6. Install second 650 Ti. Power up and let computer boot up. Do a reboot one time.

7. Download GPU-Z, install it and see if it will display information for all the three GPU's properly.

8. Update Boinc client to version 7.8.2.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

Looking at the the stderror

Looking at the the stderror output for each error task shows only the GTX 670 is receiving (and failing) tasks

Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce GTX 670" by: NVIDIA Corporation
Max allocation limit: 536870912
Global mem size: 2147483648
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0138L.dat
% Total amount of photon times: 30007
% Preparing toplist of length: 10
% Read 1255 binary points
read_checkpoint(): Couldn't open file 'LATeah0138L_1204.0_0_0.0_23414535_2_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1255
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 31  df1dot: 3.344368011e-015  f1dot_start: -1e-013  f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:1176: clFinish failed. status=-36
error in opencl_qsort
00:00:17 (7136): [CRITICAL]: ERROR: MAIN() returned with error '1'
FPU status flags:  PRECISION

I have highlighted the error, and error is occurring early at a point when the application allocates GPU memory.  The OpenCL error -36 is CL_INVALID_COMMAND_QUEUE, and this comes up often on Windows Updates breaking OpenCL things.  @richie's point 4 below should fix that.

I don't know exactly what memory the 1.20 app needs, but here suggests requires ~1GB is needed, so you may also be getting close to that limit.

I don't know why one task would work before, perhaps the amount of GPU memory which could be allocated may have changed or a driver / OS update has reduced what is able to be allocated.

To compare against say my RX-480 which has 8GB of GPU memory the same app stderror output shows

Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "Ellesmere" by: Advanced Micro Devices, Inc.
Max allocation limit: 4244635648
Global mem size: 5970087936
OpenCL device has FP64 support

 hth
Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Available memory for OpenCl

Available memory for OpenCl applications is not the same between Nvidia GPU and ATI GPUs.  

ATI GPUs get somewhere between 50-70% of use of all RAM on the card.

Nvidia I believe is only allocated to 25%.

There is also some usage by the system so always subtract a little more for use by the OS.

eXtreme Warhead
eXtreme Warhead
Joined: 30 Nov 05
Posts: 2
Credit: 12768163
RAC: 0

same problem here on two

same problem here on two different gpus.

yesterday everytime the wu stopped with an error. then i changed the config so only one gpu is for einstein and the other one works on another project. that runs fine

later yesterday i tested it again and everything runs fine with both gpu on einstein.

today i started the same pc, as i shut it down yesterday, where everything works perfect and now i only get errors again...every wu stopped with calculation error, UNTIL only that one is left, which was started yesterday on the primary gpu and this one runs perfectly...

so i have to get new wu...which everytime failed again on the other gpu. i then started boinc new on that pc and...now it works again on both gpu! thats ridiculous if even the smallest system fart or something can cause errors for the application!

the gpus are a 780ti and a 660ti...

mikey
mikey
Joined: 22 Jan 05
Posts: 12702
Credit: 1839107661
RAC: 3603

eXtreme Warhead wrote:same

eXtreme Warhead wrote:

same problem here on two different gpus.

yesterday everytime the wu stopped with an error. then i changed the config so only one gpu is for einstein and the other one works on another project. that runs fine

later yesterday i tested it again and everything runs fine with both gpu on einstein.

today i started the same pc, as i shut it down yesterday, where everything works perfect and now i only get errors again...every wu stopped with calculation error, UNTIL only that one is left, which was started yesterday on the primary gpu and this one runs perfectly...

so i have to get new wu...which everytime failed again on the other gpu. i then started boinc new on that pc and...now it works again on both gpu! thats ridiculous if even the smallest system fart or something can cause errors for the application!

the gpus are a 780ti and a 660ti...

I would guess your system is getting confused between the two different gpu's and the order in which they start crunching could make all the difference in valid or invalid workunits. Personally i would put one here and one someplace else and workaround the problem, but if that isn't in your plans you might have to find a way to suspend one gpu on startup and then enable it after the other gpu is already crunching. I have no clue how to do that though.

eXtreme Warhead
eXtreme Warhead
Joined: 30 Nov 05
Posts: 2
Credit: 12768163
RAC: 0

no, that can't be. because

no, that can't be. because the primary gpu works and works on his wu na dthe secondary card crashes every wu at around 0,5%. it crashes the wu ever and ever again and on the same time the primary still works fine...

you can restart boinc multiple times...no changes.

you can reboot the whole pc and if you're lucky on the first try he runs normal on the secondary as well and this doesn't change hours later. if you had bad luck you have to try multiple reboots

the system is a boinc-only-machine without other applications

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.