It started a couple of days ago... on routine work fetches I'm getting these lines in the event log:
20-Aug-2023 16:44:06 [Einstein@Home] Sending scheduler request: To fetch work.
20-Aug-2023 16:44:06 [Einstein@Home] Requesting new tasks for CPU and NVIDIA GPU
20-Aug-2023 16:44:15 [---] app version refers to missing GPU type ibo,GBT,long) is not available for your type of computer.
20-Aug-2023 16:44:15 [Einstein@Home] Scheduler request completed: got 2 new tasks
20-Aug-2023 16:44:15 [Einstein@Home] App version uses non-existent ibo,GBT,long) is not available for your type of computer. GPU
... bold text added for emphasis ...
It appears that (Arecibo,GBT,long) got mangled somehow to come through as just ...ibo,GBT,long) and probably more text is missing, but it's not clear to me whether it's just me or whether it is an E@H server thing. I can't seem to find out which app it is referring to. If anybody else is seeing something like this then it's not just me... OTOH, if it IS JUST ME... is a project reset the best way to recover? I did think of deleting "the app" and let BOINC reload it. The only reference to (Arecibo,GBT,long) that I can find is in the client_state.xml file where it is shown as the "user friendly name" for the BRP4G app. Would it make sense to delete that app and see if BOINC recovers? Or, just do a project reset to cover all bases? Meanwhile, FGRP5 and O3AS work is continuing normally.
Copyright © 2024 Einstein@Home. All rights reserved.
Eugene Stemple wrote: It
)
I'm not quite sure what has happened, but I'll try to offer some explanations as to what I think it could be.
First, I'd recommend that you upgrade your client from 7.14 to at least 7.18. If I'm correct, the outdated client may have something to do with your missing GPU because it is no longer recognizing your computer, therefore can not see your GPU. In addition, BOINC is no longer using HTTP for addressing your computer now, it only sees HTTPS which was not in operation at the time you had 7.14 installed.
Second, I'd recommend that you reduce your project selections to one and see if it does recognize it again. Yes, you do have 6GB in your 1060 GPU, but you are using only 4GB of it. Some of these projects take over 4GB now and therefore won't recognize your GPU.
Try using one at a time until you get one that works.
If this does or does not work, we'd appreciate it if you get back to us and tell us the good or bad news.
Proud member of the Old Farts Association
Eugene Stemple wrote: It
)
A reset will delete every Einstein task you have on your pc and force you to get all new ones
Quote:Yes, you do have 6GB in
)
This is incorrect. The displaying of only 4GB of memory for the 1060 GPU is only a flaw in older BOINC versions. Remedied in later versions to use 64 bit calls for probing memory. The OP really should upgrade their client.
The gpu applications will use ALL of the card installed memory regardless of what BOINC reports if the application is well designed.
See below for some more
)
See below for some more recent diagnostic searching...
But first, responding to some of the suggestions/issues in the responses.
[gwgeorge & keith] suggest upgrading the client (to something later than my 7.14.2). I am intentionally holding at the 7.14.2 version for two reasons. (1) using "project_max_concurrent" in the app_config.xml file fails catastrophically in later versions - see other threads regarding downloading work units endlessly when using that parameter; (2) the later versions do NOT have, in the FILE pull-down menu, "shutdown connected client" and "exit BOINC manager". I find both of those functions very useful to shutdown, and resume, BOINC gracefully in my setup with two instances of boinc and boincmgr running different projects.
[gwgeorge] The https: configuration is set up in the project global_prefs.xml file and as far as I know is not dependent on the client version. And, anyway, that part of the server link is working properly. And, to clarify, e@h is not failing to detect the GPU. It is running O3AS (opencl) tasks normally.
[mikey] Yes, I know all the bad things a project "reset" would do. I would do an NNT and drain the cache before going down that path. But if nothing else helps then that is always an option.
[keith] Following up on your 4GB reporting limit in older clients... I'm finding all kinds of <gpu_ram> parameters reported in the client_state.xml file. 7.864G for the FGRPB1G app down to 2.004G for the O3MDF app. And in <coproc_cuda> parameters <available_ram> is 4.167G while <coproc_opencl> shows <global_mem_size> as 6.359G. Never looked at that stuff before and I have no idea where those numbers come from.
Some additional file scanning gave some interesting (relevant?) information. These 2 lines from client_state.xml.
<name>einsteinbinary_BRP4G</name>
<user_friendly_name>Binary Radio Pulsar Search (Arecibo,GBT,long)</user_friendly_name>
and these lines from sched_reply_einstein.phys.uwm.edu.xml.
<coproc>
<type>ibo,GBT,long) is not available for your type of computer.</type>
<count>647500445489094944987862487032585421213412207048870776038754297507971394031017770238076521610590413666285928503412500294475737461178726716849130298534562777624215719772160.000000</count>
</coproc>
SORRY about that exceedingly long line. It's what was in the sched_reply file !!!
Something is terribly wrong here. As I understand it, a sched_request goes up to the server and it responds with a sched_reply. There is nothing like a ...(Arecibo,GBT,long)... in the sched_request so where does that mangled reply come from. And what's with that ~200 digit "count" in the reply? That's a lot of coprocessors...<grin>!
I've set NNT with the expectation that a project reset may be the best/only recovery. This error condition does not occur on every work request. As best as I can deduce, it is only when the server is trying to send me a BRP4G task, which does not happen on every work request.
Aren't computers fun...?
I was about to post exactly
)
I was about to post exactly the same issue. I have been seeing this problem on a new machine I just attached to E@H which is failing on the hsgamma_FGRP5 task.
The problem is there's garbage in the <coproc> tag in the client_state.xml file for this app which shouldn't be there:
<app_name>hsgamma_FGRP5</app_name>
<version_num>108</version_num>
<platform>x86_64-pc-linux-gnu</platform>
<avg_ncpus>1.000000</avg_ncpus>
<flops>1000000000.000000</flops>
<plan_class>FGRPSSE</plan_class>
...........
<coproc>
<type>ibo,GBT,long) is not available for your type of computer.</type>
<count>647500445489094944987862487032585421213412207048870776038754297507971394031017770238076521610590413666285928503412500294475737461178726716849130298534562777624215719772160.000000</count>
</coproc> etc
Notice that hsgamma is defined in an <app_version> block.
The text in bold exactly matches the error message I see in the system logs & boincmgr
If I look on another machine I have which is successfully running the hsgamma app, then I do NOT have the <coproc> block for this app_version.
So it looks as if the project is sending out a malformed app description, or, something very weird happened on my machine (but now I know it's not just me!)
I will run down the existing tasks and try a project reset to see if that cures it. However, it may not offer an explanation as to why; which I am curious about as I work with CPDN.
Detaching/attaching the
)
Detaching/attaching the project didn't solve the problem.
I cleared running tasks. I then removed the project; checked that all instances of hsgamma had gone from client_state.xml; reattached and watched the log.
And again I see:
It appears E@H is responsible.
Maybe it's related to specific hardware? In this case the machine is a 5900x + 1650 card. I have another machine 12400 + 1650 which doesn't have this issue.
Can someone at E@H investigate this? Appears it's adding a corrupt <coproc> XML block. I think I've done all I can here.
I did some changes to the
)
I did some changes to the server code last week in particular with communicating the coproc usage to the client, in order to get the Apple M GPU app version delivered and working. Likely something went wrong there.
1. Can you find out and report when you started getting this error, as precisely as possible?
2. Does this happen on Macs only?
3. Does this actually hinder work fetch or is just a strange error?
Thanks a lot for reporting!
BM
I just found a flaw in the
)
I just found a flaw in the code (uninitialized variable) and fixed it. Does the problem persist?
BM
That's fixed it. I
)
That's fixed it. I reattached to E@H, none of previous errors now appear in logs & hsgamma tasks running normally.
Thanks for the quick response. Appreciated.
I was going nuts, thinking
)
I was going nuts, thinking there was a problem on my end. Two machines were suffering this problem over the past week or so. But things seem to be stabilising. Thanks for the information.
Soli Deo Gloria