I had a first look at the invalid results from BRP4. I can clearly see a certain type of failure that I haven't seen before. The worst part is that the validator didn't explicitly catch these, so these scientifically unusable results probably made their way into the canonical results. It's unlikely that we missed a pulsar discovery because of that, but still not as good as it should be.
The first thing was to update the validator so results with this property will now be marked "validate error" (instead of ending up as invalid). I'll have another look hopefully tomorrow to correlate these with specific application versions, driver versions etc. and find a way to exclude these machines from getting work and wasting computing time.
The results at first glance look so weird that I currently have little hope of making an application that could deal with this driver oddities in a way that produces acceptable results.
The first thing was to update the validator so results with this property will now be marked "validate error" (instead of ending up as invalid).
That seems to have worked. As far as I see the results of the HD4600s and 4400s are now "validate errors" if they try to validate wih a result from a HD4000. One example is here: http://einsteinathome.org/workunit/211070094
However, the task is then sent out again and if it is sent out to another host with a HD4600/4400 it ends up as validate error again like in this case: http://einsteinathome.org/workunit/211111170
I'm afraid this goes on until the task finally is sent out to a host not affected by the problem.
Quote:
I'll have another look hopefully tomorrow to correlate these with specific application versions, driver versions etc. and find a way to exclude these machines from getting work and wasting computing time.
I think for starters this would be very useful!
Quote:
The results at first glance look so weird that I currently have little hope of making an application that could deal with this driver oddities in a way that produces acceptable results.
I'm pleased to note that my HD 4600 'Haswell' (host 8864187, running driver 10.18.10.3621) hasn't had a task declared 'invalid' since yesterday, but has had several validated today. I think that is evidence that the problem is 'Haswell+driver', rather than a straight hardware erratum.
I've tried to be neutral in describing this issue, as being an incompatibility between application and driver - without pointing a finger at either participant. But (and subject to Bernd's further checking in the database about correlations between these anomalous results, and the hardware/application/driver versions employed), I'm beginning to be more convinced that this is something that will have to be addressed at the driver level, with Intel and/or Khronos.
When we were discussing a similar problem between OpenCL applications and a newly-released NVidia driver for pre-Fermi NV cards, Jacob Klein found https://developer.nvidia.com/opencl - a suite of 31 standalone sample/test applications demonstrating various OpenCL capabilities. Although the pre-compiled examples are for NV cards, full sources are supplied, and I'm given to understand that the changes needed to compile for other OpenCL platforms aren't too great.
Jacob and I found that the NV driver problems caused test failures in these cases from the test suite:
and those simple, but replicable, samples were sufficient to get NVidia to look further into the driver. They say that their problem has now been solved, although we're still waiting for an update/hotfix driver release to incorporate the solution. That form of approach might be useful here too.
I'm pleased to note that my HD 4600 'Haswell' (host 8864187, running driver 10.18.10.3621) hasn't had a task declared 'invalid' since yesterday
Same for me and my HD4000 (same driver version).
Richard Haselgrove wrote:
I've tried to be neutral in describing this issue, as being an incompatibility between application and driver - without pointing a finger at either participant. But (and subject to Bernd's further checking in the database about correlations between these anomalous results, and the hardware/application/driver versions employed), I'm beginning to be more convinced that this is something that will have to be addressed at the driver level, with Intel and/or Khronos.
I just found a couple of cases in which my HD4000 validated against an HD4600. Interestingly in this case two HD4600s were not able to validate against each other, but I was able to validate against one of them. I think that supports your oppion that it is a driver issue.
I just found a couple of cases in which my HD4000 validated against an HD4600. Interestingly in this case two HD4600s were not able to validate against each other, but I was able to validate against one of them. I think that supports your oppion that it is a driver issue.
Keep an eye on their scheduler logs and you'll see eventualy what drivers they're running:
I just found a couple of cases in which my HD4000 validated against an HD4600. Interestingly in this case two HD4600s were not able to validate against each other, but I was able to validate against one of them. I think that supports your oppion that it is a driver issue.
Keep an eye on their scheduler logs and you'll see eventualy what drivers they're running:
For now I disabled (automatically) sending work to Intel GPUs with drivers of 10.18.10.3907 and newer.
Later today or early tomorrow I'll make sure that older hardware (up to HD 4000) that can be identified as such gets work as well.
I currently don't have any means in house to find out whether there is a newer driver that works. I'll set up another Beta application version, such that Intel GPUs may get work with any driver version.
I had a first look at the
)
I had a first look at the invalid results from BRP4. I can clearly see a certain type of failure that I haven't seen before. The worst part is that the validator didn't explicitly catch these, so these scientifically unusable results probably made their way into the canonical results. It's unlikely that we missed a pulsar discovery because of that, but still not as good as it should be.
The first thing was to update the validator so results with this property will now be marked "validate error" (instead of ending up as invalid). I'll have another look hopefully tomorrow to correlate these with specific application versions, driver versions etc. and find a way to exclude these machines from getting work and wasting computing time.
The results at first glance look so weird that I currently have little hope of making an application that could deal with this driver oddities in a way that produces acceptable results.
BM
BM
RE: The first thing was to
)
That seems to have worked. As far as I see the results of the HD4600s and 4400s are now "validate errors" if they try to validate wih a result from a HD4000. One example is here:
http://einsteinathome.org/workunit/211070094
However, the task is then sent out again and if it is sent out to another host with a HD4600/4400 it ends up as validate error again like in this case:
http://einsteinathome.org/workunit/211111170
I'm afraid this goes on until the task finally is sent out to a host not affected by the problem.
I think for starters this would be very useful!
Ouch. Doesn't sound good :(
I'm pleased to note that my
)
I'm pleased to note that my HD 4600 'Haswell' (host 8864187, running driver 10.18.10.3621) hasn't had a task declared 'invalid' since yesterday, but has had several validated today. I think that is evidence that the problem is 'Haswell+driver', rather than a straight hardware erratum.
I've tried to be neutral in describing this issue, as being an incompatibility between application and driver - without pointing a finger at either participant. But (and subject to Bernd's further checking in the database about correlations between these anomalous results, and the hardware/application/driver versions employed), I'm beginning to be more convinced that this is something that will have to be addressed at the driver level, with Intel and/or Khronos.
When we were discussing a similar problem between OpenCL applications and a newly-released NVidia driver for pre-Fermi NV cards, Jacob Klein found https://developer.nvidia.com/opencl - a suite of 31 standalone sample/test applications demonstrating various OpenCL capabilities. Although the pre-compiled examples are for NV cards, full sources are supplied, and I'm given to understand that the changes needed to compile for other OpenCL platforms aren't too great.
Jacob and I found that the NV driver problems caused test failures in these cases from the test suite:
oclConvolutionSeparable
oclDXTCompression
oclFDTD3d
oclParticles
oclQuasirandomGenerator
oclVolumeRender
and those simple, but replicable, samples were sufficient to get NVidia to look further into the driver. They say that their problem has now been solved, although we're still waiting for an update/hotfix driver release to incorporate the solution. That form of approach might be useful here too.
Richard Haselgrove wrote:I'm
)
Same for me and my HD4000 (same driver version).
I just found a couple of cases in which my HD4000 validated against an HD4600. Interestingly in this case two HD4600s were not able to validate against each other, but I was able to validate against one of them. I think that supports your oppion that it is a driver issue.
RE: I just found a couple
)
Keep an eye on their scheduler logs and you'll see eventualy what drivers they're running:
http://einstein.phys.uwm.edu/host_sched_logs/6317/6317416
http://einstein.phys.uwm.edu/host_sched_logs/11671/11671864
Claggy
RE: RE: I just found a
)
Yes, I've been doing that.
Aleksey (6317416 - valid) is using 9.18.10.3186
Still waiting on tron.
And tron (11671864 - invalid)
)
And tron (11671864 - invalid) is using 10.18.10.3907
I think both cases confirm our previous expectations.
For now I disabled
)
For now I disabled (automatically) sending work to Intel GPUs with drivers of 10.18.10.3907 and newer.
Later today or early tomorrow I'll make sure that older hardware (up to HD 4000) that can be identified as such gets work as well.
I currently don't have any means in house to find out whether there is a newer driver that works. I'll set up another Beta application version, such that Intel GPUs may get work with any driver version.
BM
BM
Ooops, there goes my HD 4000
)
Ooops, there goes my HD 4000 for the duration - she'll be dry long before midnight:
2015-02-19 16:06:35.3977 [PID=20462] [version] [HOST#5744895] device name: 'Intel(R) HD Graphics 4000'; OpenCL driver version: 10.18.10.4061; platform version: OpenCL 1.2; device version: OpenCL 1.2
2015-02-19 16:06:35.3977 [PID=20462] [version] driver version 1018104061, min: 0, max: 1018103906
2015-02-19 16:06:35.3977 [PID=20462] [version] driver version required max: 1018103906, supplied: 1018104061
Now, where did I leave that copy of 3621? ;)
I'll run another shift for
)
I'll run another shift for you.
Should work in a few minutes from now.
BM
BM