[error] Can't parse workunit in scheduler reply

Olaf

Joined: 16 Sep 06

Posts: 26

Credit: 190763630

RAC: 0

13 Jul 2014 11:25:46 UTC

Topic 197638

(moderation:

)

For some days now one boinc
(ID: 3768222) reports

"[error] Can't parse workunit in scheduler reply: unexpected XML tag or syntax
No close tag in scheduler reply"

instead of calculating tasks.
Unfortunately it is not obvious for me, which XML-file has the bug and at which line (might be simple to add some close tag, if its only XML)

How to fix this?

History before, which might be related: The allowed amount of hard disc space (10GB) was completely used
and there was no space left for tasks anymore.
After all jobs were done, I tried to reset the project.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4341

Credit: 252634559

RAC: 35595

[error] Can't parse workunit in scheduler reply

13 Jul 2014 12:16:07 UTC

Message 122150

(moderation:

)

This doesn't sound like a problem that you can fix on your end, it is rather on the server side.

Could you try try to make the file sched_reply_einstein.phys.uwm.edu.xml avaliable to me?

Olaf

Joined: 16 Sep 06

Posts: 26

Credit: 190763630

RAC: 0

It is send with email to your

13 Jul 2014 15:30:04 UTC

Message 122151 in response to message 122150

(moderation:

)

It is send with email to your aei address...

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5883

Credit: 118996525099

RAC: 24373285

RE: This doesn't sound like

14 Jul 2014 5:37:22 UTC

Message 122152 in response to message 122150

(moderation:

)

Quote:

This doesn't sound like a problem that you can fix on your end, it is rather on the server side.

Could you try try to make the file sched_reply_einstein.phys.uwm.edu.xml avaliable to me?

BM

I documented this exact problem two months ago in this message but didn't get a response at the time. I also provided a link to the workunit which still works today. Every time the scheduler has attempted to resend the task, the send date was updated so that the task would never expire.

Some time later, I got sick of seeing the error message so I ran the work cache dry and reset the project, thinking that would get rid of the problem. I had been running beta test (OpenCL) tasks on that machine and had decided to stop the beta test anyway.

Even after a full project reset and even although the machine had been shifted to a venue where CasA was not selected, the server still tried to send this particular task. This did interfere quite a bit with receiving and reporting other work (BRP5) but there were enough occasions where the machine would request BRP5 only without the server trying to send the CasA lost task so the machine was able to maintain a work supply despite the error message. (No work could actually be received or successfully reported if the 'lost task problem' was part of the exchange.)

After a while, I forgot about the problem until I saw this fresh report today. You have obviously 'fixed' something as a result of this report since the error message has now stopped on my host and the actual lost task has been finally delivered after two months of the error message. I've checked my machine to see what has happened. The task was delivered as a CPU task and not as a beta test GPU task. Because I'd reset the project at the time, there were 108 large data files plus the apps and the ancilliary files like sun, earth, etc, sent with the task. Quite a huge download for just one task :-).

Was the problem to do with xml_doc max buffer as mentioned in my original message?

Cheers,
Gary.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4341

Credit: 252634559

RAC: 35595

RE: I documented this exact

14 Jul 2014 8:07:09 UTC

Message 122153 in response to message 122152

(moderation:

)

Quote:

I documented this exact problem two months ago in this message but didn't get a response at the time.

Sorry. Must have missed that.

Quote:

Was the problem to do with xml_doc max buffer as mentioned in my original message?

Indeed. We come across this problem occasionally, the first implementation of the current solution dates back to 2010.

A GW run usually progresses from lower to higher frequency data. Due to the nature of the analysis (effect of the "spindown") at higher frequencies we need more data files per task to cover a larger frequency range. For each data file there are two entries (file_info and file_ref) in each workunit. In addition, for each file there is one (full) URL per available download mirror. So with time into a run, the buffer that holds the xml blob of a task gets filled up more and more.

Currently we cut down the size by limiting the download URLs that are transmitted to the clients for each file. Although we have five mirrors distributed around the world, each client only sees the n nearest. Last night I lowered n from 3 to 2, which apparently fixed the problem, hopefully until the end of the S6CasA run.

I do hope that by the end of the S6CasA (or at least before its successor) on Einstein@Home we will be running the server software that we are currently testing on Albert@Home, and that this version doesn't exhibit this problem anymore.

[error] Can't parse workunit in scheduler reply

Forums › Problems and Bug Reports

[error] Can't parse workunit in scheduler reply

It is send with email to your

RE: This doesn't sound like

RE: I documented this exact

Comment viewing options

Forums › Problems and Bug Reports