[error] Can't parse workunit in scheduler reply

Olaf
Olaf
Joined: 16 Sep 06
Posts: 26
Credit: 190763630
RAC: 0
Topic 197638

For some days now one boinc
(ID: 3768222) reports

"[error] Can't parse workunit in scheduler reply: unexpected XML tag or syntax
No close tag in scheduler reply"

instead of calculating tasks.
Unfortunately it is not obvious for me, which XML-file has the bug and at which line (might be simple to add some close tag, if its only XML)

How to fix this?

History before, which might be related: The allowed amount of hard disc space (10GB) was completely used
and there was no space left for tasks anymore.
After all jobs were done, I tried to reset the project.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4341
Credit: 252634559
RAC: 35595

[error] Can't parse workunit in scheduler reply

This doesn't sound like a problem that you can fix on your end, it is rather on the server side.

Could you try try to make the file sched_reply_einstein.phys.uwm.edu.xml avaliable to me?

BM

BM

Olaf
Olaf
Joined: 16 Sep 06
Posts: 26
Credit: 190763630
RAC: 0

It is send with email to your

It is send with email to your aei address...

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5883
Credit: 118996525099
RAC: 24373285

RE: This doesn't sound like

Quote:

This doesn't sound like a problem that you can fix on your end, it is rather on the server side.

Could you try try to make the file sched_reply_einstein.phys.uwm.edu.xml avaliable to me?

BM


I documented this exact problem two months ago in this message but didn't get a response at the time. I also provided a link to the workunit which still works today. Every time the scheduler has attempted to resend the task, the send date was updated so that the task would never expire.

Some time later, I got sick of seeing the error message so I ran the work cache dry and reset the project, thinking that would get rid of the problem. I had been running beta test (OpenCL) tasks on that machine and had decided to stop the beta test anyway.

Even after a full project reset and even although the machine had been shifted to a venue where CasA was not selected, the server still tried to send this particular task. This did interfere quite a bit with receiving and reporting other work (BRP5) but there were enough occasions where the machine would request BRP5 only without the server trying to send the CasA lost task so the machine was able to maintain a work supply despite the error message. (No work could actually be received or successfully reported if the 'lost task problem' was part of the exchange.)

After a while, I forgot about the problem until I saw this fresh report today. You have obviously 'fixed' something as a result of this report since the error message has now stopped on my host and the actual lost task has been finally delivered after two months of the error message. I've checked my machine to see what has happened. The task was delivered as a CPU task and not as a beta test GPU task. Because I'd reset the project at the time, there were 108 large data files plus the apps and the ancilliary files like sun, earth, etc, sent with the task. Quite a huge download for just one task :-).

Was the problem to do with xml_doc max buffer as mentioned in my original message?

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4341
Credit: 252634559
RAC: 35595

RE: I documented this exact

Quote:
I documented this exact problem two months ago in this message but didn't get a response at the time.

Sorry. Must have missed that.

Quote:
Was the problem to do with xml_doc max buffer as mentioned in my original message?

Indeed. We come across this problem occasionally, the first implementation of the current solution dates back to 2010.

A GW run usually progresses from lower to higher frequency data. Due to the nature of the analysis (effect of the "spindown") at higher frequencies we need more data files per task to cover a larger frequency range. For each data file there are two entries (file_info and file_ref) in each workunit. In addition, for each file there is one (full) URL per available download mirror. So with time into a run, the buffer that holds the xml blob of a task gets filled up more and more.

Currently we cut down the size by limiting the download URLs that are transmitted to the clients for each file. Although we have five mirrors distributed around the world, each client only sees the n nearest. Last night I lowered n from 3 to 2, which apparently fixed the problem, hopefully until the end of the S6CasA run.

I do hope that by the end of the S6CasA (or at least before its successor) on Einstein@Home we will be running the server software that we are currently testing on Albert@Home, and that this version doesn't exhibit this problem anymore.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.