new units not downloading

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: Any chance to reset the

Message 13526 in response to message 13525

Quote:
Any chance to reset the "daily quota" things too for today?

Good idea. I should be able to reset the daily quota for any host that has had WU cancelled. I'll work on this now.

[Update 10 minutes later]

DONE!

I've reset the daily result quota for any host that received an h1 workunit.
By the way, I don't think I ever said 'thank you' to those people who pointed out that something was wrong.

THANK YOU VERY MUCH!!

Could anyone suggest a simple and reliable way to abort h1_ workunits from any host, including those running old clients? Since the input data file is no longer on the download servers, I would have thought a simple and guaranteed solution was (1) stop BOINC (2) delete all files named h1_* (LOWER CASE!) and (3) restart BOINC. Can anyone confirm that this works? Is there an easier way?

Bruce

Director, Einstein@Home

Ananas
Ananas
Joined: 22 Jan 05
Posts: 272
Credit: 2500681
RAC: 0

RE: DONE! I've reset the

Message 13527 in response to message 13526

Quote:

DONE!

I've reset the daily result quota for any host that received an h1 workunit.

Bruce

Great, that worked - thanks :-)

2 of my dual CPU machines have been sitting there with on SETI WU each, they didn't download more SETIs as Einstein has a much higher share. Now they are busy on both CPUs again :-)

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 117975691624
RAC: 21952321

I've had a large number of

I've had a large number of boxes affected by this. I've only just noticed it a few minutes ago on one box. I stopped BOINC, deleted the h1 file (lower case h), restarted BOINC, forced an update and got a new file (l1 this time - lower case ell) and everything seems sweet again.

I've started looking at other boxes that I can't physically get to immediately and have found quite a number (probably about 10 so far) that have errored out work for no apparent reason today. Interestingly a number of these show signs of autorecovering in that fresh work is appearing in the list of results.

I'm not at all angry about this - c'est la vie, as they say. All I'd like to know is whether all affected boxes will autorecover now that the 8 per day has been reset, or will I physically have to go to each box and delete the offending h1 file?

Cheers,
Gary.

littleBouncer
littleBouncer
Joined: 22 Jan 05
Posts: 86
Credit: 12206010
RAC: 0

@ Bruce Allen, Why you

@ Bruce Allen,

Why you didn't change yet the application from 4.79 to 0.03 (Windows)?

-only a Q.-

greetz littleBouncer

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: I've had a large number

Message 13530 in response to message 13528

Quote:
I've had a large number of boxes affected by this. I've only just noticed it a few minutes ago on one box. I stopped BOINC, deleted the h1 file (lower case h), restarted BOINC, forced an update and got a new file (l1 this time - lower case ell) and everything seems sweet again.

I'm glad this works. I think that this is probably the easiest procedure for most users.

Quote:
I've started looking at other boxes that I can't physically get to immediately and have found quite a number (probably about 10 so far) that have errored out work for no apparent reason today. Interestingly a number of these show signs of autorecovering in that fresh work is appearing in the list of results.

The basic problem is that some hosts may have WU that refer to different files, named (for example) H1_0050.0 and h1_0050.0. These have different lengths and different checksums. But Windows treats these files as the same and will replace one with the other. Hence a WU may error out because the checksum stated in the workunit does not agree with the calculated checksum from the file. If this happens, then all is well because the WU will exit immediately with no wasted CPU time.

Quote:
I'm not at all angry about this - c'est la vie, as they say. All I'd like to know is whether all affected boxes will autorecover now that the 8 per day has been reset, or will I physically have to go to each box and delete the offending h1 file?

I'm glad you're not mad, though I imagine that others will be! In a few hours I will again re-run the script that resets the daily result quota for machines that got h1_ workunits. This should help the machines to get more work right away.

If you don't delete the offending h1 file, I am not sure what will happen. In some cases, if there is no conflict with an H1 file name, the WU may well complete. Then the main issue is wasted CPU cycles, since I cancelled these WU on the server side.

Cheers,
Bruce

Director, Einstein@Home

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: @ Bruce Allen, Why you

Message 13531 in response to message 13529

Quote:

@ Bruce Allen,

Why you didn't change yet the application from 4.79 to 0.03 (Windows)?

-only a Q.-

greetz littleBouncer

We should probably have this discussion in the other thread. But the short answer is that the new executable seems to be slower in most cases. We need to understand and fix that problem before distributing it widely.

Bruce

Director, Einstein@Home

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 117975691624
RAC: 21952321

OK, thanks very much for the

OK, thanks very much for the reply. Let me get this straight. If I'm seeing repeated attempts to get a file and repeated checksum errors it's due to a clash between a H1_xxxx and a h1_xxxx and this results in rapidly errored out work.

However, if I see any box with work in its results list starting h1_nnnn then whilst it appears at the moment to be proceeding normally, I'm going to get a rude awakening when that work is finished and attempted to be reported so I'm going to be wasting cycles big time unless I go and delete all h1_ work on all boxes that have it.

Does that about sum it up in layman's terms? :).

If so, then AAAAAARRRRRRRRGGGGGGGGGGGHHHHHHHHHH!!!!!!! :).

Seriously, I'm still not at all mad about this. One redeeming feature is that because of my fetish for keeping small caches there is not a huge number of work units to be wasted even although (at a quick search) there are h1_ files on most of my boxes. In fact most of the wasted cycles have already occurred and if I spend the time it takes to get around every box, I'm probably not going to save very much anyway.

Would like to be assured that the basic layman's summary is correct though.

Cheers,
Gary.

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: OK, thanks very much

Message 13533 in response to message 13532

Quote:

OK, thanks very much for the reply. Let me get this straight. If I'm seeing repeated attempts to get a file and repeated checksum errors it's due to a clash between a H1_xxxx and a h1_xxxx and this results in rapidly errored out work.

However, if I see any box with work in its results list starting h1_nnnn then whilst it appears at the moment to be proceeding normally, I'm going to get a rude awakening when that work is finished and attempted to be reported so I'm going to be wasting cycles big time unless I go and delete all h1_ work on all boxes that have it.

Does that about sum it up in layman's terms? :).

Yes!

I have CANCELLED all h1_ workunits. That means that any CPU time spent on them is entirely wasted. No credits, no glory, no purpose.

Shoot those workunits before they tire out your CPUs.

(And once again, sincere apologies for this fiasco.)

Bruce

Director, Einstein@Home

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 117975691624
RAC: 21952321

RE: Yes! I have

Message 13534 in response to message 13533

Quote:

Yes!

I have CANCELLED all h1_ workunits. That means that any CPU time spent on them is entirely wasted. No credits, no glory, no good.

Shoot those workunits before they tire out your CPUs.

In anticipation of that answer I've just finished deleting h1_nnnn work on about a dozen boxes that I can actually get physical access to. Bit of a struggle for V4.19 as it doesn't have the nice abort button that the later CCs have. Here's basically what I had to do.

1. Stop BOINC
3. Delete the large h1_nnnn file in the the einstein subdir of the projects dir
4. Restart BOINC. It would complain about missing files and would try to reget them.
5. The current WU would error out and the reget would mostly fail but occasionally it seemed to succeed.
6. Stop BOINC and repeat the procedure. The next h1_nnnn would then seem to error out.
7. I think on all second passes, BOINC would then get an l1_nnnn data file and I knew I was winning.
8. I'd throw in the odd "update" which occasionally seemed to help. I also had to stop and start BOINC to get processing started.

The interesting thing was that on at least three occasions BOINC claimed to be able to reget at least part of the hi_nnnn large file. I thought they were all supposedly deleted? Maybe BOINC was kidding itself :).

Cheers,
Gary.

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: The interesting thing

Message 13535 in response to message 13534

Quote:
The interesting thing was that on at least three occasions BOINC claimed to be able to reget at least part of the hi_nnnn large file. I thought they were all supposedly deleted? Maybe BOINC was kidding itself :).

E@H uses five different data servers. Four are mirrored off the root server at UWM. I deleted the files from that root server about 8 hours ago, and the secondary servers are supposed to mirror that change after no more than 15 minutes. However if one or more of them failed to mirror the changes, then it will continue to serve out the files and might cause the behavior that you saw.

Bruce

Director, Einstein@Home

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.