For some days now one boinc
(ID: 3768222) reports
"[error] Can't parse workunit in scheduler reply: unexpected XML tag or syntax
No close tag in scheduler reply"
instead of calculating tasks.
Unfortunately it is not obvious for me, which XML-file has the bug and at which line (might be simple to add some close tag, if its only XML)
How to fix this?
History before, which might be related: The allowed amount of hard disc space (10GB) was completely used
and there was no space left for tasks anymore.
After all jobs were done, I tried to reset the project.
Copyright © 2024 Einstein@Home. All rights reserved.
[error] Can't parse workunit in scheduler reply
)
This doesn't sound like a problem that you can fix on your end, it is rather on the server side.
Could you try try to make the file sched_reply_einstein.phys.uwm.edu.xml avaliable to me?
BM
BM
It is send with email to your
)
It is send with email to your aei address...
RE: This doesn't sound like
)
I documented this exact problem two months ago in this message but didn't get a response at the time. I also provided a link to the workunit which still works today. Every time the scheduler has attempted to resend the task, the send date was updated so that the task would never expire.
Some time later, I got sick of seeing the error message so I ran the work cache dry and reset the project, thinking that would get rid of the problem. I had been running beta test (OpenCL) tasks on that machine and had decided to stop the beta test anyway.
Even after a full project reset and even although the machine had been shifted to a venue where CasA was not selected, the server still tried to send this particular task. This did interfere quite a bit with receiving and reporting other work (BRP5) but there were enough occasions where the machine would request BRP5 only without the server trying to send the CasA lost task so the machine was able to maintain a work supply despite the error message. (No work could actually be received or successfully reported if the 'lost task problem' was part of the exchange.)
After a while, I forgot about the problem until I saw this fresh report today. You have obviously 'fixed' something as a result of this report since the error message has now stopped on my host and the actual lost task has been finally delivered after two months of the error message. I've checked my machine to see what has happened. The task was delivered as a CPU task and not as a beta test GPU task. Because I'd reset the project at the time, there were 108 large data files plus the apps and the ancilliary files like sun, earth, etc, sent with the task. Quite a huge download for just one task :-).
Was the problem to do with xml_doc max buffer as mentioned in my original message?
Cheers,
Gary.
RE: I documented this exact
)
Sorry. Must have missed that.
Indeed. We come across this problem occasionally, the first implementation of the current solution dates back to 2010.
A GW run usually progresses from lower to higher frequency data. Due to the nature of the analysis (effect of the "spindown") at higher frequencies we need more data files per task to cover a larger frequency range. For each data file there are two entries (file_info and file_ref) in each workunit. In addition, for each file there is one (full) URL per available download mirror. So with time into a run, the buffer that holds the xml blob of a task gets filled up more and more.
Currently we cut down the size by limiting the download URLs that are transmitted to the clients for each file. Although we have five mirrors distributed around the world, each client only sees the n nearest. Last night I lowered n from 3 to 2, which apparently fixed the problem, hopefully until the end of the S6CasA run.
I do hope that by the end of the S6CasA (or at least before its successor) on Einstein@Home we will be running the server software that we are currently testing on Albert@Home, and that this version doesn't exhibit this problem anymore.
BM
BM