I have 3 computers running boinc-client 7.12.0+dfsg-1, On one computer i get coputation error on the GPU tasks after about 16-17 seconds (in boinc-manager) of runtime.
log output:
06:16:42 [Einstein@Home] [coproc] NVIDIA instance 0; 1.000000 pending for LATeah2103L_1204.0_0_0.0_403180_0
06:16:42 [Einstein@Home] [coproc] NVIDIA instance 1: confirming 1.000000 instance for LATeah2103L_1204.0_0_0.0_403180_0
06:16:42 hp-z600 boinc[18838]: No protocol specified
06:16:43 hp-z600 boinc[18838]: No protocol specified
06:16:43 [Einstein@Home] Computation for task LATeah2103L_1204.0_0_0.0_403180_0 finished
06:16:43 [Einstein@Home] Output file LATeah2103L_1204.0_0_0.0_403180_0_0 for task LATeah2103L_1204.0_0_0.0_403180_0 absent
06:16:43 [Einstein@Home] Output file LATeah2103L_1204.0_0_0.0_403180_0_1 for task LATeah2103L_1204.0_0_0.0_403180_0 absent
I have Ubuntu 18.10 and 2 Nvidia Quadro 600 cards with driver Nvidia 390.87 on that machine.
I need some help to get the GPU tasks to compute.
/Anders
Copyright © 2024 Einstein@Home. All rights reserved.
Please enable the "Should
)
Please enable the "Should Einstein@Home show your computers on its web site?" setting on the page https://einsteinathome.org/account/prefs/privacy so people can help you diagnose the problem
The settiong have been
)
The settiong have been changed to show my computers, The trouble machine is the HP-Z600, The HP-Z400 Machine works fine with a Nvidia GT1030 Graphics card, The uplinksrv is an VM Machine without any GPU.
/Anders
The Quadros only have 1Gb of
)
The Quadros only have 1Gb of memory, I am not sure if that’s enough for the Einstein GPU apps.
BOINC blog
It does look like a memory
)
It does look like a memory problem on the GPU. Looking at the result of one of the failed tasks I see:
Error in OpenCL context: CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_WRITE_BUFFER on Quadro 600 (Device 0).
Since it worked before I can only think of 2 causes:
1. You changed something on your GPU that leaves less memory for E@H. (e.g. run other stuff in parallel to E@H).
2. E@H changed the tasks so they consume more memory. Since the tasks are progressing up the frequency band maybe they require more memory? I don't know.
I have changed the GPU on
)
I have changed the GPU on that computer from one 2Gb GPU to two 1Gb GPUs, I think need to upgrade to a 2GB GPU card again, got a Nvidia Quadr0 p620 2Gb on its way in the mail.
Thanks for your help!
/Anders
I also have some computation
)
I also have some computation errors popping up on a new bit of hardware.
I built a new crunching only PC, older C2D Pentium Dual core with two RX 570 GPUs. Both GPUs are at their stock frequencies. It has completed 229 but I have 34 with an error. Any clue what is up?
https://einsteinathome.org/host/12765822/tasks/6/0
kb9skw wrote:... Any clue
)
Did you click on the task ID link for one of the failed tasks? If you do, you will see something like
<core_client_version>7.14.2</core_client_version> <![CDATA[ <message> couldn't start app: Input file templates_LATeah1044L_0172_41675312.dat missing or invalid: md5 checksum failed for file</message> ]]>
This seems to imply that a downloaded template file has a bad checksum. Do you have anti-virus software that perhaps is interfering? Have you done a cleanup and perhaps deleted some files? You could investigate in the Einstein project directory to see if the template file named in the message actually exists. If it does, you could get a utility that can determine the MD5 checksum and see if the value agrees with what is stored in the state file (client_state.xml) for that particular template. The best time to do this would be immediately after a task fails and before it gets uploaded, reported and deleted. If you turn off network comms, so the failed task can't be dealt with, you would have the opportunity to really check what is causing the checksum failure.
You would need to source a suitable utility for calculating MD5 checksums under Windows. I have no idea what that might be. For Linux, I use a utility called md5sum if I need to verify a checksum.
Cheers,
Gary.
Thanks Gary The problem
)
Thanks Gary
The problem seems to have not happened again so I am not going to worry about it.