Validator offline??

BarryAZ
BarryAZ
Joined: 8 May 05
Posts: 190
Credit: 325269513
RAC: 16089
Topic 190745

I noticed in the server status that the validator is offline -- apparently for the past several hours.

Normally things here run rock steady.

Didn't see any notes regarding problems with the validator -- anyone have any news on this?

history
history
Joined: 22 Jan 05
Posts: 127
Credit: 7573923
RAC: 0

Validator offline??

BarryAZ: Uploaded over a hundred WU's, got diddly for credit. The weakest link has been identified! With all the 1 hour or less WU's (some of mine processed in 24 minutes), the Pentium 233 driving the validator has had a serious increase in read/write demand. This sucks. They got Optys driving the servers and junk running the validator. I have noticed on a good day that this sucker has some serious delay problems servicing the RAC after heavy uploads. God speed and good crunching.

Regards-tweakster

Schnappi
Schnappi
Joined: 12 Mar 05
Posts: 1
Credit: 96786
RAC: 0

Hello, I am from Germany and

Hello, I am from Germany and I don´t know if i understand that problem right, but my credits are the same as last week. I uploaded results but the didn´t count. So maybe a problem of boinc or whatever!?!

KSMarksPsych
KSMarksPsych
Moderator
Joined: 15 Oct 05
Posts: 2702
Credit: 4090227
RAC: 0

RE: Hello, I am from

Message 25028 in response to message 25027

Quote:
Hello, I am from Germany and I don�t know if i understand that problem right, but my credits are the same as last week. I uploaded results but the didn�t count. So maybe a problem of boinc or whatever!?!

All but one of your pending WU are waiting for quorum to form.

The one at the top of the list is apparently waiting for the validator to run it.

Kathryn

Kathryn :o)

Einstein@Home Moderator

BarryAZ
BarryAZ
Joined: 8 May 05
Posts: 190
Credit: 325269513
RAC: 16089

Looks like the validator went

Message 25029 in response to message 25026

Looks like the validator went back online after another hour or so.

But you are right, all these short cycle work units are increasing the workload. The SETI folks have that problem as well. In the past, the longer (larger) work units that Einstein was working with helped them....

Quote:

BarryAZ: Uploaded over a hundred WU's, got diddly for credit. The weakest link has been identified! With all the 1 hour or less WU's (some of mine processed in 24 minutes), the Pentium 233 driving the validator has had a serious increase in read/write demand. This sucks. They got Optys driving the servers and junk running the validator. I have noticed on a good day that this sucker has some serious delay problems servicing the RAC after heavy uploads. God speed and good crunching.

Regards-tweakster


Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

After six weeks of continuous

After six weeks of continuous operation, the validator exposed a bug in the unzip library which it uses to uncompress results, and crashed. I noticed this within a short time, and restarted the validator, but it crashed again on the same bug and this time a number of hours went by before I could look more carefully.

The authors of the zip library have been notified about the bug, and the validator has been restarted. After a number of hours offline, the validator had a backlog of about 14000 workunits to validate, which took some time to grind through. Right now the validator backlog is normal -- a handful of workunits. I really don't understand the P233 remarks: normally, workunits never wait more than about ten seconnds before validation.

Cheers,
Bruce

Director, Einstein@Home

history
history
Joined: 22 Jan 05
Posts: 127
Credit: 7573923
RAC: 0

Bruce: please understand that

Bruce: please understand that lack of communication breeds idle speculation. A red box is a dead box. Forgive my reaction, I am a wounded veteran of the "boikly follies". The validator needed a stress test before the dreaded "24 divided by 16 equals one" WU's hit the grid. 12 hours of wtf on the part of loyal crunchers was clearly a point of order. The validator still seems to have a "random hand"

Regards-tweakster

BarryAZ
BarryAZ
Joined: 8 May 05
Posts: 190
Credit: 325269513
RAC: 16089

Bruce -- thanks for the

Message 25032 in response to message 25030

Bruce -- thanks for the information - I started the thread in the informational void that sometimes happens. Since Einstein runs a lot more smoothly than SETI, the offline status seemed more glaring I suppose.

That being said, I am seeing an increased number of pending credit. My 'run rate' hasn't changed all that much since late last year -- but back in December, my pending number was in the 2K to 2.5K range. Over the past several weeks, that number has climbed to 7K or so.

My own guess is that this reflects something of a load problem (not just on the validator side perhaps) that is being driven by the extra processing load the database encounters with a large increase of results to handle as the average size (time to complete) of the result has dropped.

Quote:

After six weeks of continuous operation, the validator exposed a bug in the unzip library which it uses to uncompress results, and crashed. I noticed this within a short time, and restarted the validator, but it crashed again on the same bug and this time a number of hours went by before I could look more carefully.

The authors of the zip library have been notified about the bug, and the validator has been restarted. After a number of hours offline, the validator had a backlog of about 14000 workunits to validate, which took some time to grind through. Right now the validator backlog is normal -- a handful of workunits. I really don't understand the P233 remarks: normally, workunits never wait more than about ten seconnds before validation.

Cheers,
Bruce


Pooh Bear 27
Pooh Bear 27
Joined: 20 Mar 05
Posts: 1376
Credit: 20312671
RAC: 0

RE: That being said, I am

Message 25033 in response to message 25032

Quote:

That being said, I am seeing an increased number of pending credit. My 'run rate' hasn't changed all that much since late last year -- but back in December, my pending number was in the 2K to 2.5K range. Over the past several weeks, that number has climbed to 7K or so.

My own guess is that this reflects something of a load problem (not just on the validator side perhaps) that is being driven by the extra processing load the database encounters with a large increase of results to handle as the average size (time to complete) of the result has dropped.

It actually reflects that only 3 results are sent, instead of 4. My pending has over doubled since this happened. Since it needs a minimum of 3 returned to validate, all 3 must return. With 4 there was extra out there. Now that the fault tolerance is that we need to wait til one or more expire before another one goes out, and that can happen many time. Another thing I notice is more people are carrying larger caches.

So, we get to live with a lot more pending, but hopefully more work is done. I personally do not see it, because if I am over double, and you are over double, it means that more than double of the work is waiting for people than when 4 went out, which seems to me that more than double the time for each WU is taken to finish. My thinking might be slightly flawed, but it's just an observation of what I see.

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: That being said, I am

Message 25034 in response to message 25032

Quote:
That being said, I am seeing an increased number of pending credit. My 'run rate' hasn't changed all that much since late last year -- but back in December, my pending number was in the 2K to 2.5K range. Over the past several weeks, that number has climbed to 7K or so.

This is because a week or so ago I changed one of the scheduler parameters so that unsent results only get 'forced' out to a host machine if they are more than a week old. Previously this happened if they were more than about two days old. The primary reason I made this change is that it will result in fewer large data file downloads by volunteers. To say it another way, it will tend to localize data files more, so that a given volunteer with a given data file will get more work for that file before having to download a new data file. I think this is a better choice for the project, although it may lead to somewhat longer average times to validation.

Cheers,
Bruce

Director, Einstein@Home

BarryAZ
BarryAZ
Joined: 8 May 05
Posts: 190
Credit: 325269513
RAC: 16089

Ah -- OK. Again, thanks for

Message 25035 in response to message 25034

Ah -- OK. Again, thanks for the explanation.

Quote:


This is because a week or so ago I changed one of the scheduler parameters so that unsent results only get 'forced' out to a host machine if they are more than a week old. Previously this happened if they were more than about two days old. The primary reason I made this change is that it will result in fewer large data file downloads by volunteers. To say it another way, it will tend to localize data files more, so that a given volunteer with a given data file will get more work for that file before having to download a new data file. I think this is a better choice for the project, although it may lead to somewhat longer average times to validation.

Cheers,
Bruce


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.