Sure, there is a temperature change when a CPU goes from 100% load to idle or vice versa. I've never heard of that causing silicon to crack :). If it could then you had better not fire up your computer in the morning or shut it down at night because there is a bigger temperature change at those times :).
Actually several standard stress tests employed by IC reliability involve cyclic changes in temperature. (at my former employer, the two most common were called Thermal Shock and Thermal Cycling) The most common things that "crack" are not the substrate or epitaxial silicon, but rather delamination in the layers above.
More cycles is bad. Larger temperature change in the cycles is bad.
From this point of view, running a BOINC application all the time, and thus keeping the die temperature much more nearly constant than it is under more normal usage is probably reducing the probability of catastrophic delamination failure.
On the other hand, it is certainly increasing the probability of the large set of failure mechanisms which are more simply temperature related. Unless you have unwittingly purchased a die from a population with a bad thermal cycling problem, your are therefore, overall, by running BOINC almost certainly raising your probability of catastrophic CPU failure, but not because of "silicon cracking"
By the way, I'm a former practicing semiconductor reliability guy from an extremely large semiconductor manufacturer.
.... I've gone from crunching 1 work unit in about 1 hour with the old optimized applications (maybe it was 2 hours with the standard application, I don't remember) to crunching 1 work unit in about 17 hours with the standard application. This is kind of ridiculous....
A lot of people seem to not really remember correctly what has happened. I certainly get confused easily if I don't consult my records. I've kept records so you might be interested in the brief stats for a particular machine. The numbers relate to an "average" long work unit from both S4 and S5 data runs. Also, these comments relate to machines running Windows. Linux users are doing even better so it seems and if Bodley is to be believed, Mac users are doing quite nicely as well :). I only know about Windows.
Machine: P4 - 2.0GHz/256MB/20GB - WinXP SP2 - BOINC 5.2.13 - Albert App crunching S4 data
Stock crunch time: 11 hours approx
S41.07 optimised : 1.5 hours approx
Speed Incr Factor: approx 7x
Approx Credit granted : 50 credits
Approx credit rate : 4.5 per hour (Stock app)
Approx credit rate : 33.3 per hour (S41.07 optimised app - if the quorum had two "stock" claims of 50)
The rate of credit granting was quite variable, ranging from a low around 10 to maybe 60 or more at highest.
Exact same machine - Einstein App crunching S5 data
Stock crunch time: 19 hours approx
Approx credit granted : 175
Approx credit rate : 9.2 per hour
The rate of credit granting is much more consistent and transparent.
Bernd has stated that long S5 results have 5 times the work compared to long S4 results. If the albert app used in S4 was crunching an S5 result it would be taking close to 60 hours on this machine. The current Einstein app is three times faster than the previous Albert app. There is room to improve further the efficiency of the Einstein app and Bernd has stated that there is an ongoing intention to develop these improvements. I wouldn't be surprised to see worthwhile gains in efficiency over the coming months.
So, everybody now enjoys a credit rate significantly higher than what they had under the previous stock app. No longer do we need to be concerned about a credit system that was a complete lottery as to what was actually going to be given for a set amount of work. The new credit system is one of the bonuses that has come out of the change to S5. The real bonus is, of course, the fact that the project is progressing 3 times faster than it was six months ago. So why should anyone be upset about how the project is going?
Here's a suggestion for a change that could be made for people who like seeing results tick over much faster. Please realise that I'm not seriously suggesting that this change should be made. I imagine that within the BOINC Manager it would be fairly easy to implement the concept of a "virtual result". BOINC downloads a single workunit but puts it in the task list as 10 virtual tasks. When the result starts crunching, the status of the first virtual task is shown as 100% complete when the underlying result is only at 10%. Then you could even simulate a virtual upload if you wished :). The second virtual task could then start ticking over - and so on until the 10 virtual tasks were all complete. At that stage the real result could be uploaded and reported and we could see 10 virtual results being uploaded and reported and could feel good about that batch of 10 results we just returned to the project.
I guess the point I'm trying to make is that to say it is somehow "better" to return 10 smaller results rather than 1 big result is simply an illusion.
By the way, I'm a former practicing semiconductor reliability guy from an extremely large semiconductor manufacturer.
Thanks for that. A very interesting and informative set of comments. I've no real knowledge in your field but I can easily imagine that thermal cycling is quite bad. The other point that does concern me is possible diffusion of doping elements in the silicon under higher temperatures. I imagine that people (like me :).) who flog their CPUs by both 100% loading and by overclocking are at possible extra risk from CPU failure due to time related element diffusion. Mind you, in several years of being involved in DC projects using overclocked machines, I've yet to see a single CPU failure.
Here's a suggestion for a change that could be made for people who like seeing results tick over much faster. Please realise that I'm not seriously suggesting that this change should be made. I imagine that within the BOINC Manager it would be fairly easy to implement the concept of a "virtual result". BOINC downloads a single workunit but puts it in the task list as 10 virtual tasks. When the result starts crunching, the status of the first virtual task is shown as 100% complete when the underlying result is only at 10%. Then you could even simulate a virtual upload if you wished :). The second virtual task could then start ticking over - and so on until the 10 virtual tasks were all complete. At that stage the real result could be uploaded and reported and we could see 10 virtual results being uploaded and reported and could feel good about that batch of 10 results we just returned to the project.
CPDN does this, so it's doable in theory. The big question is how much stress it would put on the DB box. In theory if nothing except a progress report was sent until the end, the DB would only need a single 1 byte field per WU to store the progress, so the load should be fairly light. Wether it's worth the diversion of effort or not's annother story entirely.
The other point that does concern me is possible diffusion of doping elements in the silicon under higher temperatures. I imagine that people (like me :).) who flog their CPUs by both 100% loading and by overclocking are at possible extra risk from CPU failure due to time related element diffusion.
( I think the CPU would fail for other reasons long before the dopants migrate.... )
The point is that cooler is better, and we should all inspect, buff, clean and replace our cooling systems regularly. I bought this USB powered mini vacuum cleaner last week. Seriously - it sucks the fans and sinks quite nicely! I also got a cheap ( $40 AUD ) hand held blower too - it looks and works like a rechargeable drill but pumps air instead - to which I attach some plastic tubing with a nozzle so as to reach those hard to get places.
One of my boxes has
- an entrance and exit fan for the case ( big, low RPM, high flow ).
- an entrance and exit fan for the power supply.
- a CPU fan and thumping great sink, with an extensible cyclindrical tunnel from the lateral case side to it.
- a GPU fan and sink.
- a northbridge fan and sink.
I'm a quaint gearhead :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
A lot of people seem to not really remember correctly what has happened. I certainly get confused easily if I don't consult my records.
Well, as I said, I didn't remember the standard application times because I very quickly changed to the optimized applications under S4. Thanks for the information. I didn't realize that the standard S4 app took ~11 hours to run. Akosf made some great improvements to improve its speed by 7 times. Perhaps you're right that the change isn't so drastic if it's going from 11 hours to 19 hours while the amount of work is 5 times greater. But when you're used to pumping out 1 work unit an hour, 1 work unit per day seems *awfully* slow.
As for the credit thing, I didn't bother to work it out. I think part of the argument was that the credit/hour was greater for small work units than for large work units. I'm not so interested in gaining credit; I mainly like to help return the work quickly. And the changes from S4 optimized to S5 standard have made it take so much longer that I'm disappointed. Still, thanks for the information. I really didn't realize the standard application took so long in S4.
The other point that does concern me is possible diffusion of doping elements in the silicon under higher temperatures.
The temperatures employed during manufacture are rather higher than those you could reach with a still functional CPU by a substantial amount.
Thermal diffusion of dopants by the manufacturing steps (and, more importantly, variation thereof) is a genuine concern. When last I was active in this art, it was not even on the list of end-use failure mechanisms. Other things kill you first.
Boring as it may sound, for populations not afflicted with special problems, simple oxide breakdown associated with unintended shapes you would have called a defect had they been just a bit worse and caused a zero time failure was a big part of in-service failure probability when last I was privy to the data. People mis-interpret this to say that gate oxides at their intended thickness, or other oxides, are failing, but that is (usually, for decent production processes) not the point. Nearly all defect classes which create shorts create also "almost shorts". Some fraction of those create dielectric thickness thick enough to work through test and burnin, but thin enough to fail in service. Many, many field failures are nothing more complicated than that. The only two perfect remedies are "zero defects" and "zero use", which are about equally practical. However "fewer defects" and "lower temperature of use" both help, for simple statistical reasons.
That is why I cringe each time I see someone posting that their CPU worked for tens of hours at nnn degrees C, and thus is "reliable". The speed impact of temperature (which is what measures of getting the right answer respond to) and the catastrophic failure impact of temperature have almost nothing at all to do with each other.
So, as our esteemed moderator has suggested, follow good practice to keep your CPU as cool as is convenient, and accept that with CPUs as with automobiles, using them does eventually hurt their likelihood of continuing to work. For neither one is not using them a reasonable alternative.
RE: Sure, there is a
)
Actually several standard stress tests employed by IC reliability involve cyclic changes in temperature. (at my former employer, the two most common were called Thermal Shock and Thermal Cycling) The most common things that "crack" are not the substrate or epitaxial silicon, but rather delamination in the layers above.
More cycles is bad. Larger temperature change in the cycles is bad.
From this point of view, running a BOINC application all the time, and thus keeping the die temperature much more nearly constant than it is under more normal usage is probably reducing the probability of catastrophic delamination failure.
On the other hand, it is certainly increasing the probability of the large set of failure mechanisms which are more simply temperature related. Unless you have unwittingly purchased a die from a population with a bad thermal cycling problem, your are therefore, overall, by running BOINC almost certainly raising your probability of catastrophic CPU failure, but not because of "silicon cracking"
By the way, I'm a former practicing semiconductor reliability guy from an extremely large semiconductor manufacturer.
RE: .... I've gone from
)
A lot of people seem to not really remember correctly what has happened. I certainly get confused easily if I don't consult my records. I've kept records so you might be interested in the brief stats for a particular machine. The numbers relate to an "average" long work unit from both S4 and S5 data runs. Also, these comments relate to machines running Windows. Linux users are doing even better so it seems and if Bodley is to be believed, Mac users are doing quite nicely as well :). I only know about Windows.
Machine: P4 - 2.0GHz/256MB/20GB - WinXP SP2 - BOINC 5.2.13 - Albert App crunching S4 data
Stock crunch time: 11 hours approx
S41.07 optimised : 1.5 hours approx
Speed Incr Factor: approx 7x
Approx Credit granted : 50 credits
Approx credit rate : 4.5 per hour (Stock app)
Approx credit rate : 33.3 per hour (S41.07 optimised app - if the quorum had two "stock" claims of 50)
The rate of credit granting was quite variable, ranging from a low around 10 to maybe 60 or more at highest.
Exact same machine - Einstein App crunching S5 data
Stock crunch time: 19 hours approx
Approx credit granted : 175
Approx credit rate : 9.2 per hour
The rate of credit granting is much more consistent and transparent.
Bernd has stated that long S5 results have 5 times the work compared to long S4 results. If the albert app used in S4 was crunching an S5 result it would be taking close to 60 hours on this machine. The current Einstein app is three times faster than the previous Albert app. There is room to improve further the efficiency of the Einstein app and Bernd has stated that there is an ongoing intention to develop these improvements. I wouldn't be surprised to see worthwhile gains in efficiency over the coming months.
So, everybody now enjoys a credit rate significantly higher than what they had under the previous stock app. No longer do we need to be concerned about a credit system that was a complete lottery as to what was actually going to be given for a set amount of work. The new credit system is one of the bonuses that has come out of the change to S5. The real bonus is, of course, the fact that the project is progressing 3 times faster than it was six months ago. So why should anyone be upset about how the project is going?
Here's a suggestion for a change that could be made for people who like seeing results tick over much faster. Please realise that I'm not seriously suggesting that this change should be made. I imagine that within the BOINC Manager it would be fairly easy to implement the concept of a "virtual result". BOINC downloads a single workunit but puts it in the task list as 10 virtual tasks. When the result starts crunching, the status of the first virtual task is shown as 100% complete when the underlying result is only at 10%. Then you could even simulate a virtual upload if you wished :). The second virtual task could then start ticking over - and so on until the 10 virtual tasks were all complete. At that stage the real result could be uploaded and reported and we could see 10 virtual results being uploaded and reported and could feel good about that batch of 10 results we just returned to the project.
I guess the point I'm trying to make is that to say it is somehow "better" to return 10 smaller results rather than 1 big result is simply an illusion.
Cheers,
Gary.
RE: By the way, I'm a
)
Thanks for that. A very interesting and informative set of comments. I've no real knowledge in your field but I can easily imagine that thermal cycling is quite bad. The other point that does concern me is possible diffusion of doping elements in the silicon under higher temperatures. I imagine that people (like me :).) who flog their CPUs by both 100% loading and by overclocking are at possible extra risk from CPU failure due to time related element diffusion. Mind you, in several years of being involved in DC projects using overclocked machines, I've yet to see a single CPU failure.
Cheers,
Gary.
RE: Here's a suggestion for
)
CPDN does this, so it's doable in theory. The big question is how much stress it would put on the DB box. In theory if nothing except a progress report was sent until the end, the DB would only need a single 1 byte field per WU to store the progress, so the load should be fairly light. Wether it's worth the diversion of effort or not's annother story entirely.
RE: The other point that
)
( I think the CPU would fail for other reasons long before the dopants migrate.... )
The point is that cooler is better, and we should all inspect, buff, clean and replace our cooling systems regularly. I bought this USB powered mini vacuum cleaner last week. Seriously - it sucks the fans and sinks quite nicely! I also got a cheap ( $40 AUD ) hand held blower too - it looks and works like a rechargeable drill but pumps air instead - to which I attach some plastic tubing with a nozzle so as to reach those hard to get places.
One of my boxes has
- an entrance and exit fan for the case ( big, low RPM, high flow ).
- an entrance and exit fan for the power supply.
- a CPU fan and thumping great sink, with an extensible cyclindrical tunnel from the lateral case side to it.
- a GPU fan and sink.
- a northbridge fan and sink.
I'm a quaint gearhead :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: A lot of people seem to
)
Well, as I said, I didn't remember the standard application times because I very quickly changed to the optimized applications under S4. Thanks for the information. I didn't realize that the standard S4 app took ~11 hours to run. Akosf made some great improvements to improve its speed by 7 times. Perhaps you're right that the change isn't so drastic if it's going from 11 hours to 19 hours while the amount of work is 5 times greater. But when you're used to pumping out 1 work unit an hour, 1 work unit per day seems *awfully* slow.
As for the credit thing, I didn't bother to work it out. I think part of the argument was that the credit/hour was greater for small work units than for large work units. I'm not so interested in gaining credit; I mainly like to help return the work quickly. And the changes from S4 optimized to S5 standard have made it take so much longer that I'm disappointed. Still, thanks for the information. I really didn't realize the standard application took so long in S4.
RE: The other point that
)
The temperatures employed during manufacture are rather higher than those you could reach with a still functional CPU by a substantial amount.
Thermal diffusion of dopants by the manufacturing steps (and, more importantly, variation thereof) is a genuine concern. When last I was active in this art, it was not even on the list of end-use failure mechanisms. Other things kill you first.
Boring as it may sound, for populations not afflicted with special problems, simple oxide breakdown associated with unintended shapes you would have called a defect had they been just a bit worse and caused a zero time failure was a big part of in-service failure probability when last I was privy to the data. People mis-interpret this to say that gate oxides at their intended thickness, or other oxides, are failing, but that is (usually, for decent production processes) not the point. Nearly all defect classes which create shorts create also "almost shorts". Some fraction of those create dielectric thickness thick enough to work through test and burnin, but thin enough to fail in service. Many, many field failures are nothing more complicated than that. The only two perfect remedies are "zero defects" and "zero use", which are about equally practical. However "fewer defects" and "lower temperature of use" both help, for simple statistical reasons.
That is why I cringe each time I see someone posting that their CPU worked for tens of hours at nnn degrees C, and thus is "reliable". The speed impact of temperature (which is what measures of getting the right answer respond to) and the catastrophic failure impact of temperature have almost nothing at all to do with each other.
So, as our esteemed moderator has suggested, follow good practice to keep your CPU as cool as is convenient, and accept that with CPUs as with automobiles, using them does eventually hurt their likelihood of continuing to work. For neither one is not using them a reasonable alternative.