Validator offline??

BarryAZ

Joined: 8 May 05

Posts: 190

Credit: 325828203

RAC: 14967

8 Feb 2006 23:22:05 UTC

Topic 190745

(moderation:

)

I noticed in the server status that the validator is offline -- apparently for the past several hours.

Normally things here run rock steady.

Didn't see any notes regarding problems with the validator -- anyone have any news on this?

history

Joined: 22 Jan 05

Posts: 127

Credit: 7573923

RAC: 0

Validator offline??

9 Feb 2006 0:32:15 UTC

Message 25026

(moderation:

)

BarryAZ: Uploaded over a hundred WU's, got diddly for credit. The weakest link has been identified! With all the 1 hour or less WU's (some of mine processed in 24 minutes), the Pentium 233 driving the validator has had a serious increase in read/write demand. This sucks. They got Optys driving the servers and junk running the validator. I have noticed on a good day that this sucker has some serious delay problems servicing the RAC after heavy uploads. God speed and good crunching.

Regards-tweakster

Schnappi

Joined: 12 Mar 05

Posts: 1

Credit: 96786

RAC: 0

Hello, I am from Germany and

9 Feb 2006 3:06:58 UTC

Message 25027

(moderation:

)

Hello, I am from Germany and I don´t know if i understand that problem right, but my credits are the same as last week. I uploaded results but the didn´t count. So maybe a problem of boinc or whatever!?!

KSMarksPsych

Moderator

Joined: 15 Oct 05

Posts: 2702

Credit: 4090227

RAC: 0

RE: Hello, I am from

9 Feb 2006 3:42:47 UTC

Message 25028 in response to message 25027

(moderation:

)

Quote:

Hello, I am from Germany and I donï¿½t know if i understand that problem right, but my credits are the same as last week. I uploaded results but the didnï¿½t count. So maybe a problem of boinc or whatever!?!

All but one of your pending WU are waiting for quorum to form.

The one at the top of the list is apparently waiting for the validator to run it.

Kathryn

Kathryn :o)

Einstein@Home Moderator

BarryAZ

Joined: 8 May 05

Posts: 190

Credit: 325828203

RAC: 14967

Looks like the validator went

9 Feb 2006 6:35:46 UTC

Message 25029 in response to message 25026

(moderation:

)

Looks like the validator went back online after another hour or so.

But you are right, all these short cycle work units are increasing the workload. The SETI folks have that problem as well. In the past, the longer (larger) work units that Einstein was working with helped them....

Quote:

BarryAZ: Uploaded over a hundred WU's, got diddly for credit. The weakest link has been identified! With all the 1 hour or less WU's (some of mine processed in 24 minutes), the Pentium 233 driving the validator has had a serious increase in read/write demand. This sucks. They got Optys driving the servers and junk running the validator. I have noticed on a good day that this sucker has some serious delay problems servicing the RAC after heavy uploads. God speed and good crunching.

Regards-tweakster

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

After six weeks of continuous

9 Feb 2006 14:09:19 UTC

Message 25030

(moderation:

)

After six weeks of continuous operation, the validator exposed a bug in the unzip library which it uses to uncompress results, and crashed. I noticed this within a short time, and restarted the validator, but it crashed again on the same bug and this time a number of hours went by before I could look more carefully.

The authors of the zip library have been notified about the bug, and the validator has been restarted. After a number of hours offline, the validator had a backlog of about 14000 workunits to validate, which took some time to grind through. Right now the validator backlog is normal -- a handful of workunits. I really don't understand the P233 remarks: normally, workunits never wait more than about ten seconnds before validation.

Cheers,
Bruce

Director, Einstein@Home

history

Joined: 22 Jan 05

Posts: 127

Credit: 7573923

RAC: 0

Bruce: please understand that

10 Feb 2006 1:52:58 UTC

Message 25031

(moderation:

)

Bruce: please understand that lack of communication breeds idle speculation. A red box is a dead box. Forgive my reaction, I am a wounded veteran of the "boikly follies". The validator needed a stress test before the dreaded "24 divided by 16 equals one" WU's hit the grid. 12 hours of wtf on the part of loyal crunchers was clearly a point of order. The validator still seems to have a "random hand"

Regards-tweakster

BarryAZ

Joined: 8 May 05

Posts: 190

Credit: 325828203

RAC: 14967

Bruce -- thanks for the

10 Feb 2006 2:27:34 UTC

Message 25032 in response to message 25030

(moderation:

)

Bruce -- thanks for the information - I started the thread in the informational void that sometimes happens. Since Einstein runs a lot more smoothly than SETI, the offline status seemed more glaring I suppose.

That being said, I am seeing an increased number of pending credit. My 'run rate' hasn't changed all that much since late last year -- but back in December, my pending number was in the 2K to 2.5K range. Over the past several weeks, that number has climbed to 7K or so.

My own guess is that this reflects something of a load problem (not just on the validator side perhaps) that is being driven by the extra processing load the database encounters with a large increase of results to handle as the average size (time to complete) of the result has dropped.

Quote:

After six weeks of continuous operation, the validator exposed a bug in the unzip library which it uses to uncompress results, and crashed. I noticed this within a short time, and restarted the validator, but it crashed again on the same bug and this time a number of hours went by before I could look more carefully.

The authors of the zip library have been notified about the bug, and the validator has been restarted. After a number of hours offline, the validator had a backlog of about 14000 workunits to validate, which took some time to grind through. Right now the validator backlog is normal -- a handful of workunits. I really don't understand the P233 remarks: normally, workunits never wait more than about ten seconnds before validation.

Cheers,
Bruce

Pooh Bear 27

Joined: 20 Mar 05

Posts: 1376

Credit: 20312671

RAC: 0

RE: That being said, I am

10 Feb 2006 3:13:26 UTC

Message 25033 in response to message 25032

(moderation:

)

Quote:

That being said, I am seeing an increased number of pending credit. My 'run rate' hasn't changed all that much since late last year -- but back in December, my pending number was in the 2K to 2.5K range. Over the past several weeks, that number has climbed to 7K or so.

My own guess is that this reflects something of a load problem (not just on the validator side perhaps) that is being driven by the extra processing load the database encounters with a large increase of results to handle as the average size (time to complete) of the result has dropped.

It actually reflects that only 3 results are sent, instead of 4. My pending has over doubled since this happened. Since it needs a minimum of 3 returned to validate, all 3 must return. With 4 there was extra out there. Now that the fault tolerance is that we need to wait til one or more expire before another one goes out, and that can happen many time. Another thing I notice is more people are carrying larger caches.

So, we get to live with a lot more pending, but hopefully more work is done. I personally do not see it, because if I am over double, and you are over double, it means that more than double of the work is waiting for people than when 4 went out, which seems to me that more than double the time for each WU is taken to finish. My thinking might be slightly flawed, but it's just an observation of what I see.

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

RE: That being said, I am

10 Feb 2006 14:21:09 UTC

Message 25034 in response to message 25032

(moderation:

)

Quote:

That being said, I am seeing an increased number of pending credit. My 'run rate' hasn't changed all that much since late last year -- but back in December, my pending number was in the 2K to 2.5K range. Over the past several weeks, that number has climbed to 7K or so.

This is because a week or so ago I changed one of the scheduler parameters so that unsent results only get 'forced' out to a host machine if they are more than a week old. Previously this happened if they were more than about two days old. The primary reason I made this change is that it will result in fewer large data file downloads by volunteers. To say it another way, it will tend to localize data files more, so that a given volunteer with a given data file will get more work for that file before having to download a new data file. I think this is a better choice for the project, although it may lead to somewhat longer average times to validation.

Cheers,
Bruce

Director, Einstein@Home

BarryAZ

Joined: 8 May 05

Posts: 190

Credit: 325828203

RAC: 14967

Ah -- OK. Again, thanks for

11 Feb 2006 5:44:33 UTC

Message 25035 in response to message 25034

(moderation:

)

Ah -- OK. Again, thanks for the explanation.

Quote:

This is because a week or so ago I changed one of the scheduler parameters so that unsent results only get 'forced' out to a host machine if they are more than a week old. Previously this happened if they were more than about two days old. The primary reason I made this change is that it will result in fewer large data file downloads by volunteers. To say it another way, it will tend to localize data files more, so that a given volunteer with a given data file will get more work for that file before having to download a new data file. I think this is a better choice for the project, although it may lead to somewhat longer average times to validation.

Cheers,
Bruce

Validator offline??

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports