Computation error on GPUs

A Lang

Joined: 26 Dec 18

Posts: 4

Credit: 14097460

RAC: 0

24 Jan 2019 5:38:30 UTC

Topic 218001

(moderation:

)

I have 3 computers running boinc-client 7.12.0+dfsg-1, On one computer i get coputation error on the GPU tasks after about 16-17 seconds (in boinc-manager) of runtime.

log output:

06:16:42 [Einstein@Home] [coproc] NVIDIA instance 0; 1.000000 pending for LATeah2103L_1204.0_0_0.0_403180_0
06:16:42 [Einstein@Home] [coproc] NVIDIA instance 1: confirming 1.000000 instance for LATeah2103L_1204.0_0_0.0_403180_0
06:16:42 hp-z600 boinc[18838]: No protocol specified
06:16:43 hp-z600 boinc[18838]: No protocol specified
06:16:43 [Einstein@Home] Computation for task LATeah2103L_1204.0_0_0.0_403180_0 finished
06:16:43 [Einstein@Home] Output file LATeah2103L_1204.0_0_0.0_403180_0_0 for task LATeah2103L_1204.0_0_0.0_403180_0 absent
06:16:43 [Einstein@Home] Output file LATeah2103L_1204.0_0_0.0_403180_0_1 for task LATeah2103L_1204.0_0_0.0_403180_0 absent

I have Ubuntu 18.10 and 2 Nvidia Quadro 600 cards with driver Nvidia 390.87 on that machine.

I need some help to get the GPU tasks to compute.

/Anders

Logforme

Joined: 13 Aug 10

Posts: 332

Credit: 1714373961

RAC: 0

Please enable the "Should

24 Jan 2019 8:04:42 UTC

Message 169044

(moderation:

)

Please enable the "Should Einstein@Home show your computers on its web site?" setting on the page https://einsteinathome.org/account/prefs/privacy so people can help you diagnose the problem

A Lang

Joined: 26 Dec 18

Posts: 4

Credit: 14097460

RAC: 0

The settiong have been

24 Jan 2019 19:58:20 UTC

Message 169055

(moderation:

)

The settiong have been changed to show my computers, The trouble machine is the HP-Z600, The HP-Z400 Machine works fine with a Nvidia GT1030 Graphics card, The uplinksrv is an VM Machine without any GPU.

/Anders

MarkJ

Joined: 28 Feb 08

Posts: 437

Credit: 139002861

RAC: 0

The Quadros only have 1Gb of

25 Jan 2019 5:44:46 UTC

Message 169064

(moderation:

)

The Quadros only have 1Gb of memory, I am not sure if that’s enough for the Einstein GPU apps.

BOINC blog

Logforme

Joined: 13 Aug 10

Posts: 332

Credit: 1714373961

RAC: 0

It does look like a memory

25 Jan 2019 7:51:42 UTC

Message 169065

(moderation:

)

It does look like a memory problem on the GPU. Looking at the result of one of the failed tasks I see:

Error in OpenCL context: CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_WRITE_BUFFER on Quadro 600 (Device 0).

Since it worked before I can only think of 2 causes:

1. You changed something on your GPU that leaves less memory for E@H. (e.g. run other stuff in parallel to E@H).

2. E@H changed the tasks so they consume more memory. Since the tasks are progressing up the frequency band maybe they require more memory? I don't know.

A Lang

Joined: 26 Dec 18

Posts: 4

Credit: 14097460

RAC: 0

I have changed the GPU on

25 Jan 2019 12:01:13 UTC

Message 169074

(moderation:

)

I have changed the GPU on that computer from one 2Gb GPU to two 1Gb GPUs, I think need to upgrade to a 2GB GPU card again, got a Nvidia Quadr0 p620 2Gb on its way in the mail.

Thanks for your help!

/Anders

kb9skw

Joined: 25 Feb 05

Posts: 21

Credit: 378431045

RAC: 16971

I also have some computation

8 Feb 2019 1:41:21 UTC

Message 169383

(moderation:

)

I also have some computation errors popping up on a new bit of hardware.

I built a new crunching only PC, older C2D Pentium Dual core with two RX 570 GPUs. Both GPUs are at their stock frequencies. It has completed 229 but I have 34 with an error. Any clue what is up?

https://einsteinathome.org/host/12765822/tasks/6/0

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119739903516

RAC: 25612440

kb9skw wrote:... Any clue

8 Feb 2019 4:03:18 UTC

Message 169384 in response to message 169383

(moderation:

)

kb9skw wrote:

... Any clue what is up?

https://einsteinathome.org/host/12765822/tasks/6/0

Did you click on the task ID link for one of the failed tasks? If you do, you will see something like

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
couldn't start app: Input file templates_LATeah1044L_0172_41675312.dat missing or invalid: md5 checksum failed for file</message>
]]>

This seems to imply that a downloaded template file has a bad checksum. Do you have anti-virus software that perhaps is interfering? Have you done a cleanup and perhaps deleted some files? You could investigate in the Einstein project directory to see if the template file named in the message actually exists. If it does, you could get a utility that can determine the MD5 checksum and see if the value agrees with what is stored in the state file (client_state.xml) for that particular template. The best time to do this would be immediately after a task fails and before it gets uploaded, reported and deleted. If you turn off network comms, so the failed task can't be dealt with, you would have the opportunity to really check what is causing the checksum failure.

You would need to source a suitable utility for calculating MD5 checksums under Windows. I have no idea what that might be. For Linux, I use a utility called md5sum if I need to verify a checksum.

Cheers,
Gary.

kb9skw

Joined: 25 Feb 05

Posts: 21

Credit: 378431045

RAC: 16971

Thanks Gary The problem

10 Feb 2019 13:44:13 UTC

Message 169409

(moderation:

)

Thanks Gary

The problem seems to have not happened again so I am not going to worry about it.

Computation error on GPUs

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports