BarryAZ: Uploaded over a hundred WU's, got diddly for credit. The weakest link has been identified! With all the 1 hour or less WU's (some of mine processed in 24 minutes), the Pentium 233 driving the validator has had a serious increase in read/write demand. This sucks. They got Optys driving the servers and junk running the validator. I have noticed on a good day that this sucker has some serious delay problems servicing the RAC after heavy uploads. God speed and good crunching.
Hello, I am from Germany and I don´t know if i understand that problem right, but my credits are the same as last week. I uploaded results but the didn´t count. So maybe a problem of boinc or whatever!?!
Hello, I am from Germany and I don�t know if i understand that problem right, but my credits are the same as last week. I uploaded results but the didn�t count. So maybe a problem of boinc or whatever!?!
All but one of your pending WU are waiting for quorum to form.
The one at the top of the list is apparently waiting for the validator to run it.
Looks like the validator went back online after another hour or so.
But you are right, all these short cycle work units are increasing the workload. The SETI folks have that problem as well. In the past, the longer (larger) work units that Einstein was working with helped them....
Quote:
BarryAZ: Uploaded over a hundred WU's, got diddly for credit. The weakest link has been identified! With all the 1 hour or less WU's (some of mine processed in 24 minutes), the Pentium 233 driving the validator has had a serious increase in read/write demand. This sucks. They got Optys driving the servers and junk running the validator. I have noticed on a good day that this sucker has some serious delay problems servicing the RAC after heavy uploads. God speed and good crunching.
After six weeks of continuous operation, the validator exposed a bug in the unzip library which it uses to uncompress results, and crashed. I noticed this within a short time, and restarted the validator, but it crashed again on the same bug and this time a number of hours went by before I could look more carefully.
The authors of the zip library have been notified about the bug, and the validator has been restarted. After a number of hours offline, the validator had a backlog of about 14000 workunits to validate, which took some time to grind through. Right now the validator backlog is normal -- a handful of workunits. I really don't understand the P233 remarks: normally, workunits never wait more than about ten seconnds before validation.
Bruce: please understand that lack of communication breeds idle speculation. A red box is a dead box. Forgive my reaction, I am a wounded veteran of the "boikly follies". The validator needed a stress test before the dreaded "24 divided by 16 equals one" WU's hit the grid. 12 hours of wtf on the part of loyal crunchers was clearly a point of order. The validator still seems to have a "random hand"
Bruce -- thanks for the information - I started the thread in the informational void that sometimes happens. Since Einstein runs a lot more smoothly than SETI, the offline status seemed more glaring I suppose.
That being said, I am seeing an increased number of pending credit. My 'run rate' hasn't changed all that much since late last year -- but back in December, my pending number was in the 2K to 2.5K range. Over the past several weeks, that number has climbed to 7K or so.
My own guess is that this reflects something of a load problem (not just on the validator side perhaps) that is being driven by the extra processing load the database encounters with a large increase of results to handle as the average size (time to complete) of the result has dropped.
Quote:
After six weeks of continuous operation, the validator exposed a bug in the unzip library which it uses to uncompress results, and crashed. I noticed this within a short time, and restarted the validator, but it crashed again on the same bug and this time a number of hours went by before I could look more carefully.
The authors of the zip library have been notified about the bug, and the validator has been restarted. After a number of hours offline, the validator had a backlog of about 14000 workunits to validate, which took some time to grind through. Right now the validator backlog is normal -- a handful of workunits. I really don't understand the P233 remarks: normally, workunits never wait more than about ten seconnds before validation.
That being said, I am seeing an increased number of pending credit. My 'run rate' hasn't changed all that much since late last year -- but back in December, my pending number was in the 2K to 2.5K range. Over the past several weeks, that number has climbed to 7K or so.
My own guess is that this reflects something of a load problem (not just on the validator side perhaps) that is being driven by the extra processing load the database encounters with a large increase of results to handle as the average size (time to complete) of the result has dropped.
It actually reflects that only 3 results are sent, instead of 4. My pending has over doubled since this happened. Since it needs a minimum of 3 returned to validate, all 3 must return. With 4 there was extra out there. Now that the fault tolerance is that we need to wait til one or more expire before another one goes out, and that can happen many time. Another thing I notice is more people are carrying larger caches.
So, we get to live with a lot more pending, but hopefully more work is done. I personally do not see it, because if I am over double, and you are over double, it means that more than double of the work is waiting for people than when 4 went out, which seems to me that more than double the time for each WU is taken to finish. My thinking might be slightly flawed, but it's just an observation of what I see.
That being said, I am seeing an increased number of pending credit. My 'run rate' hasn't changed all that much since late last year -- but back in December, my pending number was in the 2K to 2.5K range. Over the past several weeks, that number has climbed to 7K or so.
This is because a week or so ago I changed one of the scheduler parameters so that unsent results only get 'forced' out to a host machine if they are more than a week old. Previously this happened if they were more than about two days old. The primary reason I made this change is that it will result in fewer large data file downloads by volunteers. To say it another way, it will tend to localize data files more, so that a given volunteer with a given data file will get more work for that file before having to download a new data file. I think this is a better choice for the project, although it may lead to somewhat longer average times to validation.
This is because a week or so ago I changed one of the scheduler parameters so that unsent results only get 'forced' out to a host machine if they are more than a week old. Previously this happened if they were more than about two days old. The primary reason I made this change is that it will result in fewer large data file downloads by volunteers. To say it another way, it will tend to localize data files more, so that a given volunteer with a given data file will get more work for that file before having to download a new data file. I think this is a better choice for the project, although it may lead to somewhat longer average times to validation.
Validator offline??
)
BarryAZ: Uploaded over a hundred WU's, got diddly for credit. The weakest link has been identified! With all the 1 hour or less WU's (some of mine processed in 24 minutes), the Pentium 233 driving the validator has had a serious increase in read/write demand. This sucks. They got Optys driving the servers and junk running the validator. I have noticed on a good day that this sucker has some serious delay problems servicing the RAC after heavy uploads. God speed and good crunching.
Regards-tweakster
Hello, I am from Germany and
)
Hello, I am from Germany and I don´t know if i understand that problem right, but my credits are the same as last week. I uploaded results but the didn´t count. So maybe a problem of boinc or whatever!?!
RE: Hello, I am from
)
All but one of your pending WU are waiting for quorum to form.
The one at the top of the list is apparently waiting for the validator to run it.
Kathryn
Kathryn :o)
Einstein@Home Moderator
Looks like the validator went
)
Looks like the validator went back online after another hour or so.
But you are right, all these short cycle work units are increasing the workload. The SETI folks have that problem as well. In the past, the longer (larger) work units that Einstein was working with helped them....
After six weeks of continuous
)
After six weeks of continuous operation, the validator exposed a bug in the unzip library which it uses to uncompress results, and crashed. I noticed this within a short time, and restarted the validator, but it crashed again on the same bug and this time a number of hours went by before I could look more carefully.
The authors of the zip library have been notified about the bug, and the validator has been restarted. After a number of hours offline, the validator had a backlog of about 14000 workunits to validate, which took some time to grind through. Right now the validator backlog is normal -- a handful of workunits. I really don't understand the P233 remarks: normally, workunits never wait more than about ten seconnds before validation.
Cheers,
Bruce
Director, Einstein@Home
Bruce: please understand that
)
Bruce: please understand that lack of communication breeds idle speculation. A red box is a dead box. Forgive my reaction, I am a wounded veteran of the "boikly follies". The validator needed a stress test before the dreaded "24 divided by 16 equals one" WU's hit the grid. 12 hours of wtf on the part of loyal crunchers was clearly a point of order. The validator still seems to have a "random hand"
Regards-tweakster
Bruce -- thanks for the
)
Bruce -- thanks for the information - I started the thread in the informational void that sometimes happens. Since Einstein runs a lot more smoothly than SETI, the offline status seemed more glaring I suppose.
That being said, I am seeing an increased number of pending credit. My 'run rate' hasn't changed all that much since late last year -- but back in December, my pending number was in the 2K to 2.5K range. Over the past several weeks, that number has climbed to 7K or so.
My own guess is that this reflects something of a load problem (not just on the validator side perhaps) that is being driven by the extra processing load the database encounters with a large increase of results to handle as the average size (time to complete) of the result has dropped.
RE: That being said, I am
)
It actually reflects that only 3 results are sent, instead of 4. My pending has over doubled since this happened. Since it needs a minimum of 3 returned to validate, all 3 must return. With 4 there was extra out there. Now that the fault tolerance is that we need to wait til one or more expire before another one goes out, and that can happen many time. Another thing I notice is more people are carrying larger caches.
So, we get to live with a lot more pending, but hopefully more work is done. I personally do not see it, because if I am over double, and you are over double, it means that more than double of the work is waiting for people than when 4 went out, which seems to me that more than double the time for each WU is taken to finish. My thinking might be slightly flawed, but it's just an observation of what I see.
RE: That being said, I am
)
This is because a week or so ago I changed one of the scheduler parameters so that unsent results only get 'forced' out to a host machine if they are more than a week old. Previously this happened if they were more than about two days old. The primary reason I made this change is that it will result in fewer large data file downloads by volunteers. To say it another way, it will tend to localize data files more, so that a given volunteer with a given data file will get more work for that file before having to download a new data file. I think this is a better choice for the project, although it may lead to somewhat longer average times to validation.
Cheers,
Bruce
Director, Einstein@Home
Ah -- OK. Again, thanks for
)
Ah -- OK. Again, thanks for the explanation.