As title I havent been able to submit a valid unit for a while it seems
https://einsteinathome.org/host/11700141/tasks&offset=0&show_names=0&state=5&appid=
Anyone able to shed any light?
I run Seti / Miky way and Collatz on this GPU and they all come out fine?
Copyright © 2024 Einstein@Home. All rights reserved.
All tasks reporting invalid
)
Can you share some information about the GPU? Model, maker, any overclocking etc?
What does your CPU run if anything? Free cores etc?
How many work units are you
)
How many work units are you running in parallel ?
Fiji's (and some other AMD GPUs) seem to have problems when running more than 1 WU at once.
-----
Ok its a Saphhire r9 fury, it
)
Ok its a Saphhire r9 fury, it has had half its fused off cores reenabled, I thought this may be a issue at first but other projects run fine, Furmark runs overnight fine and never had a crash on it, so elimitating that.
Only running the one unit on it, CBA messing around to get all my projects to run multiple units on the card.
Unlocked shaders might indeed
)
Unlocked shaders might indeed be the issue. There's (sometimes) a reason why they are disabled and only AMD knows exactly why - whether they really didn't pass all validation tests, or they are just artificially disabled.
The problem might not have manifested in other projects - it might not cause a crash, just a wrong computing result, which might not be visible in tests like Furmark.
If you want to be sure, try to disable those shaders back and try here again.
-----
Im confident its not the
)
Im confident its not the shaders but ill flip it to the backup bios for a bit to check, is there anything in the details for the units? There is a lot of detail in there and I am not sure what I am looking at?
Can anyone recomend anything
)
Can anyone recomend anything that will stress the card from a compute point of view to test stability? something that will either crash the machine if unstable or output "failed"?
I just looked at the logfiles
)
I just looked at the logfiles of the validation. Your host is producing slightly different results than everyone else. And not only by a small margin. Just a little example. We accept a difference of less than 0.00005 (5e-5) but your GPU produces values that differ at least in 0.00010 (10e-5) up to 0.05 in most of the data. So they are rejected by the validator.
Other project may use different instructions on the GPU or they could also do a not so strict validation of the results. I don't think that stress testing the card will help. Depending on how the Stress test is executed the goal may only be to produce stress and not test if the calculations are done correct.
Ok thanks, ill try swapping
)
Ok thanks, ill try swapping the bios tonight back to the stock number of shaders and see if that helps :)
Does it sound like something that could be caused by a slightly faulty cluster of cores?
RE: Does it sound like
)
I'll chime in here with some perspectives of someone whose nearly entire career involved the design, testing, manufacture, and reliability issues of microprocessors.
Some home truths:
1. there is no such thing as a complete full-coverage test.
2. there is an incredibly diverse set of possible defects--and amazingly enough every unit shipped contains many locations which if you viewed them carefully you would consider defective--but most of these happen not to harm the correct logical or speed operation at any condition of interest--so shipping those is OK. But the primary containment of these is testing, and as the test is not perfect, defects that matter do ship, regularly.
3. the popular notion that running some "highly stressful" test constitutes a complete test, and that any system that passes that is perfect, so any malfunction must come from something other than the system is nonsense. Such a universal "perfect test" is likely far less comprehensive in coverage than is the manufacturer's final test--and that for certain is far from complete in coverage, and escapees reach the wild at an appreciable rate.
Getting back to the specifics of your case, I think it is clear your system is getting wrong answers, if, for the purpose of this discussion, we define "right" as the answer that would be given by a preponderance of systems with identical installed hardware and software.
This might be simply due to one or more defects in the hardware which the manufacturer marked out of service which you somehow put to use. Or it could be a defect elsewhere. Or...
My 2 cents.
In your shoes the first two things I'd do would be to turn off the revivified shaders, and if that made no difference turn down the speeds on any accessible clocks. (assuming I'd already looked over the fans to assure they were turning as intended and the dust bunnies were well under control).
Good luck
RE: ... the first two
)
I think he took that first option soon after he posted and seems to have all good results from that point onward.
The interesting thing is that the good results and the bad results both have about the same elapsed times, on a very rough inspection. In other words, even if there was no downside (bad results) from unlocking the extra shaders, there was no performance upside either.
Cheers,
Gary.