My current suspicion is that these validate errors do happen 'preferably' on 64Bit machines, either Linux or recent Mac OS versions.
I have a lot of Linux hosts doing FGRP1 tasks. They all run a 32 bit OS. Every one I've looked at gets validate errors. I haven't done a proper investigation but whenever I happen to be perusing a tasks list, I routinely see them.
As I posted in reply to a thread in Number Crunching about an hour ago, I've recently observed a case in which Validate Errors reported so far by twelve of the fifteen hosts to which a single WU has been sent suggest some common cause, not likely an unlucky conjunction of twelve random host failures.
I don't have any idea how rare or common this particular sort mabe be, but this repeat offender WU was so far dispatched to fifteen hosts, and has generated a validate error on twelve of them. The rather conservative 20,20,20 setting means this one may yet go to five more unlucky hosts before central dispatch gives up on it.
Here are TaskIDs from hosts reporting Validate Error on this single WU:
Occasionally there are workunits which error out that way. But these are rare, being watched, and I have only seen this with BRP4 WUs. Usually we cancel such WUs, but this requires manual intervention we didn't have time for recently. This is completely independent of the validate errors of FGRP tasks from certain App versions / platforms.
Yes, this is obviously one of those where the workunit itself is the problem. Without manual intervention by an admin, it will eventually reach the limit of 20 error results.
I checked several of the latest ones and the error message for all the ones I checked is
Validate error [6] (00100000)
- result file has too few or too many rows
If you get a resend with lots of validate errors on previous results like this, feel free to abort it.
Actually only two validate errors at the time of checking. One of them has these messages associated with it
Validate error [6] (00111010)
- result file has entries that aren't numbers
- a number is out of valid range for this result
- result file has (lines with) wrong number of columns
- result file has too few or too many rows
I would think that this one is due to an overstressed GPU. The other one has just this single message
Validate error [6] (00001000)
- a number is out of valid range for this result
So at this point there's no indication that there must be a problem with the workunit. You would need to see quite a few with the same message to blame the WU. Resend tasks should NOT be deleted at this point.
RE: My current suspicion is
)
I have a lot of Linux hosts doing FGRP1 tasks. They all run a 32 bit OS. Every one I've looked at gets validate errors. I haven't done a proper investigation but whenever I happen to be perusing a tasks list, I routinely see them.
Cheers,
Gary.
RE: RE: Is it possible
)
Thanks Gary!
My biggest issue of course is whether this is the result of something I'm doing.
RE: My biggest issue of
)
If it's something you're doing then it's also something I'm doing on lots of hosts. I'm not quite ready to accept that yet :-).
Cheers,
Gary.
As I posted in reply to a
)
As I posted in reply to a thread in Number Crunching about an hour ago, I've recently observed a case in which Validate Errors reported so far by twelve of the fifteen hosts to which a single WU has been sent suggest some common cause, not likely an unlucky conjunction of twelve random host failures.
I don't have any idea how rare or common this particular sort mabe be, but this repeat offender WU was so far dispatched to fifteen hosts, and has generated a validate error on twelve of them. The rather conservative 20,20,20 setting means this one may yet go to five more unlucky hosts before central dispatch gives up on it.
Here are TaskIDs from hosts reporting Validate Error on this single WU:
257934442
258537109
257516015
259260242
259002951
259260243
257391991
258910534
259002952
257934441
258910533
258537110
That looks like a good way to
)
That looks like a good way to help the people debugging.
Here's one with 10 validate errors, one completed, one in progress, and one error while computing:
http://einsteinathome.org/workunit/109589019
Occasionally there are
)
Occasionally there are workunits which error out that way. But these are rare, being watched, and I have only seen this with BRP4 WUs. Usually we cancel such WUs, but this requires manual intervention we didn't have time for recently. This is completely independent of the validate errors of FGRP tasks from certain App versions / platforms.
BM
BM
Another one with lots of
)
Another one with lots of validate probs
http://einsteinathome.org/workunit/109582646
http://einstein.phys.uwm.edu/
)
http://einsteinathome.org/workunit/111838125 - only 3 for now :)
RE: Another one with lots
)
Yes, this is obviously one of those where the workunit itself is the problem. Without manual intervention by an admin, it will eventually reach the limit of 20 error results.
I checked several of the latest ones and the error message for all the ones I checked is
If you get a resend with lots of validate errors on previous results like this, feel free to abort it.
Cheers,
Gary.
RE: http://einstein.phys.uw
)
Actually only two validate errors at the time of checking. One of them has these messages associated with it
I would think that this one is due to an overstressed GPU. The other one has just this single message
So at this point there's no indication that there must be a problem with the workunit. You would need to see quite a few with the same message to blame the WU. Resend tasks should NOT be deleted at this point.
Cheers,
Gary.