Dear developers, Is there any way how can I participate in exact identification and fix of Skylake issue? I have relatively powerful Iris 540 and I'm looking for useful work for it as many others here.
It’s really frustrating to see how much work is thrown because of wrong WUs even I’m not worse one with about 40% of wrong WUs and even GPU is producing similar amount or more work after deduction of wrong WUs than CPU…
I’m not programmer/developer to write or change the code on my own. I can perform any test you want and be one who fills up Intel support with request to fix issue. But as my long experience as network engineer and designer in reputable company working for large enterprises and government agencies it’s much easier to achieve expected fix if you narrow down what exactly is wrong not just what are the symptoms before open any support case. Hence my idea is to run computation tasks in parallel on CPU and GPU and compare results – break down WU in several steps to focus on what is causing wrong results down to particular OpenCL commands/calls…
We just recently got a feedback from Intel about the problem and possible solutions. I briefly discussed this with Benjamin and we will change the validation threshold slightly so newer Intel iGPUs are validated fine. I didn't have time to deploy this change but I'll do it as soon as possible.
Edit (13:40 UTC): I deployed a new validator with the increased tolerance. Please test using the Beta application. If there the validation rate increases I'm going to include newer Intel iGPUs into the non-Beta application.
Thank you for reply. To increase tolerance sounds really strange.
I expect even though you work with probability the process of calculation is exact and repeating the process on the same input data produces always the same results. But this sound like results are near but not same like random number generator is somehow incorporated and if its physical characteristic changes calculation results could change as well.
Could you provide me some link what could explain essence of why are results different?
I'll report validation results once there will be reasonable amount of WU.
This gets converted to the following in assembly.....
Mul %answer_mul, %float0, %float1
Add %answer_add, %answer_mul, %float2
The value in the register "answer_mul" is rounded before it does the addition. In the Intel case (and AARch64 too) these two instructions get fused into a "mad" instruction
Mad %answer_mad, %float0, %float1, %float2
The result of the mad instruction is more precise for it does not do the rounding after the multiply.
And because we do a lot of summing of multiplications the seemingly small rounding errors turn out to be significant in the end. No random numbers involved.
I checked the one invalid task and the value in question is just again right above our new threshold. This is kind of expected and in the nature of thresholds. Let's see what the pending and inconclusive tasks do. At least we should see a better ration of valid to invalids over time.
In light of that I have re-enabled two machines with HD Graphics 530's and they are now running all the Beta Intel_GPU OpenCL apps (FGRP1, BRP4, BRP4G and BRP6).
We just recently got a feedback from Intel about the problem and possible solutions. I briefly discussed this with Benjamin and we will change the validation threshold slightly so newer Intel iGPUs are validated fine. I didn't have time to deploy this change but I'll do it as soon as possible.
Edit (13:40 UTC): I deployed a new validator with the increased tolerance. Please test using the Beta application. If there the validation rate increases I'm going to include newer Intel iGPUs into the non-Beta application.
In light of that I have re-enabled two machines with HD Graphics 530's and they are now running all the Beta Intel_GPU OpenCL apps (FGRP1, BRP4, BRP4G and BRP6).
The 2871149 host totally over-fetched work. I have had it doing nothing else in an attempt to get it under control.
It seems all the BRP6 1.52 are considered invalid so I've aborted the remaining ones. They've been taking over 8 hours each and I think there are enough examples of validate error by now.
It will now process the 12 remaining BRP 1.34 tasks in the hope they might validate.
Dear developers, Is there any
)
Dear developers, Is there any way how can I participate in exact identification and fix of Skylake issue? I have relatively powerful Iris 540 and I'm looking for useful work for it as many others here.
It’s really frustrating to see how much work is thrown because of wrong WUs even I’m not worse one with about 40% of wrong WUs and even GPU is producing similar amount or more work after deduction of wrong WUs than CPU…
I’m not programmer/developer to write or change the code on my own. I can perform any test you want and be one who fills up Intel support with request to fix issue. But as my long experience as network engineer and designer in reputable company working for large enterprises and government agencies it’s much easier to achieve expected fix if you narrow down what exactly is wrong not just what are the symptoms before open any support case. Hence my idea is to run computation tasks in parallel on CPU and GPU and compare results – break down WU in several steps to focus on what is causing wrong results down to particular OpenCL commands/calls…
Does anyone from developers go ahead?
We just recently got a
)
We just recently got a feedback from Intel about the problem and possible solutions. I briefly discussed this with Benjamin and we will change the validation threshold slightly so newer Intel iGPUs are validated fine. I didn't have time to deploy this change but I'll do it as soon as possible.
Edit (13:40 UTC): I deployed a new validator with the increased tolerance. Please test using the Beta application. If there the validation rate increases I'm going to include newer Intel iGPUs into the non-Beta application.
Thank you for reply. To
)
Thank you for reply. To increase tolerance sounds really strange.
I expect even though you work with probability the process of calculation is exact and repeating the process on the same input data produces always the same results. But this sound like results are near but not same like random number generator is somehow incorporated and if its physical characteristic changes calculation results could change as well.
Could you provide me some link what could explain essence of why are results different?
I'll report validation results once there will be reasonable amount of WU.
This goes down to the level
)
This goes down to the level of assembler code that is executed on the GPU. Here is the most basic explanation I got from Intel:
Say you have the following:
Answer_mul = float0 * float1;
Answer_add = Answer_mul + float2;
This gets converted to the following in assembly.....
The value in the register "answer_mul" is rounded before it does the addition.
In the Intel case (and AARch64 too) these two instructions get fused into a "mad" instruction
The result of the mad instruction is more precise for it does not do the rounding after the multiply.
And because we do a lot of summing of multiplications the seemingly small rounding errors turn out to be significant in the end. No random numbers involved.
good news on the "fix" for
)
good news on the "fix" for skylake.
running some more WUs.
not looking good. 1 invalid.
)
not looking good. 1 invalid. several inconclusive.
https://www.einsteinathome.org/host/12407179/tasks
I checked the one invalid
)
I checked the one invalid task and the value in question is just again right above our new threshold. This is kind of expected and in the nature of thresholds. Let's see what the pending and inconclusive tasks do. At least we should see a better ration of valid to invalids over time.
In light of that I have
)
In light of that I have re-enabled two machines with HD Graphics 530's and they are now running all the Beta Intel_GPU OpenCL apps (FGRP1, BRP4, BRP4G and BRP6).
Hosts
https://einsteinathome.org/host/6181626
https://einsteinathome.org/host/2871149
BOINC blog
MarkJ wrote:In light of that
)
The 2871149 host totally over-fetched work. I have had it doing nothing else in an attempt to get it under control.
It seems all the BRP6 1.52 are considered invalid so I've aborted the remaining ones. They've been taking over 8 hours each and I think there are enough examples of validate error by now.
It will now process the 12 remaining BRP 1.34 tasks in the hope they might validate.
BOINC blog
im at 3 valid 5 invalid for
)
im at 3 valid 5 invalid for the work on my 530. several still pending/inconclusive.