A case in point: the ABS computer in my car has developed a fault. The garage assures me that it's safe to drive, because the computer has built-in self-monitoring, detects the fault, and switches itself off so the car drives like an old model without ABS fitted. I know how to cope with that, even with snow on the road.
The trouble comes when I reboot the computer (turn the ignition key). For the first couple of miles, the self-monitoring doesn't notice the fault, and the ABS tries to 'help' me slow down - actually quite alarming the first time it did it. So now I have to drive extra-carefully to start with, until the fault monitoring kicks in, and only then can I drive as normal.
We had a similar problem many years ago on a brand new fire engine we bought. Antilock(ABS) brakes were brand new back then and they didn't work correctly on this fire engine. The shop ended up removing them altogether and we went back to standard air brakes, which then worked fine for the life of the fire engine.
No computers aren't perfect, that is the whole point of a test workunit, to make sure each individual piece of hardware is up to snuff project wise. Yes our pc's can do email and internet and whatever just fine, but can they produce Scientific results with sufficient accuracy to be helpful. Or do they have a problem that produces Scientific junk? If the former let's welcome them in and encourage their participation. If the later lets let them know so they can either chose a different not so demanding project, or not crunch. I think inherently we all want to believe our computers to be decent and up to snuff, unfortunately that just is not true in all cases. Some of that can be maintenance, ie dust etc, some could be because we do upgrades that are not well thought out and actually make things worse. Sometimes it is the manufacturer that makes compromises that we then chose to buy based on decisions we make at the time of purchase. Ideally Boinc would put out a pci card that can be installed in any computer and be used to crunch off of, but that is just not going to happen.
Every big scientific mainframe is now made up from thousands or millions of CPUs, be they Intel. AMD, Power or even CELL CPUs. If they are faulty, so are the mainframes. See the top500 list if you don't believe this. So if the Roadrunner mainframe, made up from AMD CPUs and CELL CPUs, gives good scientific results. so does my PC.
Tullio
Every big scientific mainframe is now made up from thousands or millions of CPUs, be they Intel. AMD, Power or even CELL CPUs. If they are faulty, so are the mainframes. See the top500 list if you don't believe this. So if the Roadrunner mainframe, made up from AMD CPUs and CELL CPUs, gives good scientific results. so does my PC.
Tullio
Something tells me that Roadrunner doesn't suffer from dust bunnies to quite the same extent as your or my computer.
Something tells me that Roadrunner doesn't suffer from dust bunnies to quite the same extent as your or my computer.
Well, my SUN WS is completely open in its front panel and has a very good airflow, so dust bunnies are unlikely. I appreciated the SUN engineering design, as I had appreciated the design of the Onyx computers by Scott McNealy in the Eighties. First UNIX computers sold in Italy.
Tullio
The problem with injected test signals is they tend to be singular. And yes I was aware of that test signal injection. Just as I am aware that there are situations where the black box tests fine in the aircraft, bad at the black box level at the next level of maintenance and the cards detected as faulty test good at the circuit card level. I was a member of a group that ran these tests at the three levels of Navy maintenance. The problem is that the testing methods used in each case are different ... and so detect different failure modes.
With single test injection you may be able to detect that one signal, correctly, but what about others.
As to the point about resistance to the new methods ... well, all the more reason to increase the level of rigor to prevent naysayers like me, and those other staid scientists holding up their noses at this new fangled way, from holding it back. And the reason I have been bringing this up is that I am well aware of the resistance to the new way problem ... which, as I said, is all the more reason to be more rigorous ...
The fact that you have not had a compute failure on CPDN is truly remarkable ... I have not had that many models run flawlessly over the long haul ... I get streaks ... but, the CPDN models are known to be unstable and that computer errors and crashes are part of the territory.
But, the problems I talk to are the more insidious ones where you have a "silent failure" which is not an obvious death of the model / task, but the one where all looks good. And, as I have pointed out we have the classic problem of the F-Div bug in a whole line of CPUs which means that "poisoned" calculations can occur ... but if the CPU is in wide use then all the outputs will agree, though they are wrong.
There are other long standing bugs that have shown up in our computers, one was the calculator that effectively could not subtract two numbers correctly giving rise to the situation were 1 - 1 was not zero (I forget the exact error values).
Oh, and in many fly by wire aircraft, if you unteather the pilot he can initiate a movement of the aircraft that will cause it to come apart in flight. So, the plane will miss the mountain, but the parts, and you, will not ... because the plane will come apart in flight ... raining debris all over that mountainside you wanted to miss ...
To add to what you've said. . .
I'm a former electronics worker for the U.S. Navy. One of the most frustrating things for us was when we would run a test on some subsystem, only to have it fail. Then, we'd try to troubleshoot it, and it would mysteriously start to pass. It makes one wonder how many silent, intermittent failures occur inside of a microprocessor.
I'm a former electronics worker for the U.S. Navy. One of the most frustrating things for us was when we would run a test on some subsystem, only to have it fail. Then, we'd try to troubleshoot it, and it would mysteriously start to pass. It makes one wonder how many silent, intermittent failures occur inside of a microprocessor.
Read "The soul of a new machine" by Tracy Kidder. It's all being told in that book, which I translated into Italian.
Tullio
I'm a former electronics worker for the U.S. Navy. One of the most frustrating things for us was when we would run a test on some subsystem, only to have it fail. Then, we'd try to troubleshoot it, and it would mysteriously start to pass. It makes one wonder how many silent, intermittent failures occur inside of a microprocessor.
Read "The soul of a new machine" by Tracy Kidder. It's all being told in that book, which I translated into Italian.
Tullio
I've read that germanium transistors are more resilient to nuclear radiations than silicon transistors and are therefore used in integrated circuits to be sent into space. But cosmic radiation reaches also ground level and can have very energetic particles, mostly protons, as far as I know. Recently a strong component of antiprotons has been measured in space, but the atmosphere shields it with protons-antiprotons annihilation mechanism.
Tullio
I'm a former electronics worker for the U.S. Navy. One of the most frustrating things for us was when we would run a test on some subsystem, only to have it fail. Then, we'd try to troubleshoot it, and it would mysteriously start to pass. It makes one wonder how many silent, intermittent failures occur inside of a microprocessor.
Read "The soul of a new machine" by Tracy Kidder. It's all being told in that book, which I translated into Italian.
Tullio
Thus you should remember one of the main lessons from the book. That machines are built by man and can contain flaws ... the wire-wrap parties to keep up with the change orders comes to mind.
And the next point, the diagnostics that were part and parcel of the design of that machine. You can buy superficial diagnostic tools, Norton, Tech Tool 5 and the like ... Mem Test, etc... but how many of us run those as a normal part of a maintenance cycle?
Ronald Reagan got one thing right ... "Trust but verify"... we just trust ...
As to the other point you raised about clusters of PCs being used to make super-computers ... this it true... and as some scientists have found out, a serious problem. Because some problems are not amenable to being divided up to be parallelized ... Very large weather models for example, where the connection between adjacent cells is critical and the ability to pass values back and forth is needed. With loosely coupled computers this is impossible because the network bandwidth is very high.
In the old days, there were any number of companies that made supercomputers (Cray et al) and they were of this tightly coupled architecture. THen the loosely coupled model hit and with the lower cost of the systems came to dominate.
You are certainly right. When I worked on the Onyx computers in the Eighties, hard disks (then called Winchester disks) were very fragile. So we made incremental backups every day on streamer tapes. Then somebody started putting 8 and 5 inches floppy disks on Unix systems instead of tapes (they cost less) and nobody made any backup. Now I backup my personal files on flash cartridges in an USB port and the OS stays on CDs and DVDs. Times have changed.
Tullio
RE: A case in point: the
)
We had a similar problem many years ago on a brand new fire engine we bought. Antilock(ABS) brakes were brand new back then and they didn't work correctly on this fire engine. The shop ended up removing them altogether and we went back to standard air brakes, which then worked fine for the life of the fire engine.
No computers aren't perfect, that is the whole point of a test workunit, to make sure each individual piece of hardware is up to snuff project wise. Yes our pc's can do email and internet and whatever just fine, but can they produce Scientific results with sufficient accuracy to be helpful. Or do they have a problem that produces Scientific junk? If the former let's welcome them in and encourage their participation. If the later lets let them know so they can either chose a different not so demanding project, or not crunch. I think inherently we all want to believe our computers to be decent and up to snuff, unfortunately that just is not true in all cases. Some of that can be maintenance, ie dust etc, some could be because we do upgrades that are not well thought out and actually make things worse. Sometimes it is the manufacturer that makes compromises that we then chose to buy based on decisions we make at the time of purchase. Ideally Boinc would put out a pci card that can be installed in any computer and be used to crunch off of, but that is just not going to happen.
Every big scientific
)
Every big scientific mainframe is now made up from thousands or millions of CPUs, be they Intel. AMD, Power or even CELL CPUs. If they are faulty, so are the mainframes. See the top500 list if you don't believe this. So if the Roadrunner mainframe, made up from AMD CPUs and CELL CPUs, gives good scientific results. so does my PC.
Tullio
RE: Every big scientific
)
Something tells me that Roadrunner doesn't suffer from dust bunnies to quite the same extent as your or my computer.
RE: Something tells me
)
Well, my SUN WS is completely open in its front panel and has a very good airflow, so dust bunnies are unlikely. I appreciated the SUN engineering design, as I had appreciated the design of the Onyx computers by Scott McNealy in the Eighties. First UNIX computers sold in Italy.
Tullio
RE: The problem with
)
To add to what you've said. . .
I'm a former electronics worker for the U.S. Navy. One of the most frustrating things for us was when we would run a test on some subsystem, only to have it fail. Then, we'd try to troubleshoot it, and it would mysteriously start to pass. It makes one wonder how many silent, intermittent failures occur inside of a microprocessor.
RE: To add to what you've
)
Read "The soul of a new machine" by Tracy Kidder. It's all being told in that book, which I translated into Italian.
Tullio
RE: RE: To add to what
)
Darn Cosmic Rays
When I was in the field that what I blamed all my flakey faults on :-)
Edit: Or Radioactive trace elements in the IC packaging.. but they took care of that a while back
There are some who can live without wild things and some who cannot. - Aldo Leopold
I've read that germanium
)
I've read that germanium transistors are more resilient to nuclear radiations than silicon transistors and are therefore used in integrated circuits to be sent into space. But cosmic radiation reaches also ground level and can have very energetic particles, mostly protons, as far as I know. Recently a strong component of antiprotons has been measured in space, but the atmosphere shields it with protons-antiprotons annihilation mechanism.
Tullio
RE: RE: To add to what
)
Thus you should remember one of the main lessons from the book. That machines are built by man and can contain flaws ... the wire-wrap parties to keep up with the change orders comes to mind.
And the next point, the diagnostics that were part and parcel of the design of that machine. You can buy superficial diagnostic tools, Norton, Tech Tool 5 and the like ... Mem Test, etc... but how many of us run those as a normal part of a maintenance cycle?
Ronald Reagan got one thing right ... "Trust but verify"... we just trust ...
As to the other point you raised about clusters of PCs being used to make super-computers ... this it true... and as some scientists have found out, a serious problem. Because some problems are not amenable to being divided up to be parallelized ... Very large weather models for example, where the connection between adjacent cells is critical and the ability to pass values back and forth is needed. With loosely coupled computers this is impossible because the network bandwidth is very high.
In the old days, there were any number of companies that made supercomputers (Cray et al) and they were of this tightly coupled architecture. THen the loosely coupled model hit and with the lower cost of the systems came to dominate.
You are certainly right. When
)
You are certainly right. When I worked on the Onyx computers in the Eighties, hard disks (then called Winchester disks) were very fragile. So we made incremental backups every day on streamer tapes. Then somebody started putting 8 and 5 inches floppy disks on Unix systems instead of tapes (they cost less) and nobody made any backup. Now I backup my personal files on flash cartridges in an USB port and the OS stays on CDs and DVDs. Times have changed.
Tullio