I have long advocated that we test and validate the machines that are being used along with redundant computing. Because real science is about the accuracy of the results and the care with which we produce those results.
Redundant computing seems good enough to me - I'm sure all the projects that get back a particularly interesting result re-run the relevant WUs on their own hardware.
How could you possibly reliably test & validate the BOINC hosts anyway?
An easy way, which was rejected many many moons ago, was to send every host a small test workunit as the first unit they get. That unit would have certain parameters the host would have to meet in order to crunch for that project. One of which would be to come up with a certain result, again within certain degrees of error.
An easy way, which was rejected many many moons ago, was to send every host a small test workunit as the first unit they get. That unit would have certain parameters the host would have to meet in order to crunch for that project. One of which would be to come up with a certain result, again within certain degrees of error.
Indeed! That could be done regularly, say every 100 WU's or so would be such a test unit and at it could be required/agreed that awarding credit was contingent upon that. No doubt one could design code behaviour to detect unstable/undesirable host characteristics. To get around those users who may wish to change host features during such a test run then : you could even have code embedded in each WU, some small but fault sensitive task ( consuming say 1% of the workload ) to validate host 'honesty' ( the hardware, not the user ).
This type of thing could be a useful mechanism, outside of the quorum/validation pathway, both to ensure solid compliance to the science problem and also make it to be seen to be that. This would encourage any otherwise reluctant institutions/projects/efforts to have greater confidence in the distributed computing paradigm. This would not require any alteration of BOINC per se, simply a design issue of the executable provided by the project and the post-processing.
A viewpoint we as volunteers could find hard to appreciate is that the scientists involved here and other projects could well be taking a professional/career risk in relying on us. In this sense I must say Prof Allen has taken a bold/brave step by entrusting what is real cutting edge science to us at all! Clearly that is also to do with economics, his especial interest in hardware ( see Our Prof in the News ), and the sheer volume of material/data that must be sieved to yield substantial/sensible scientific output. Quite frankly I'm amazed we are in the loop at all on this project ...... see "Gravity's Shadow" for insights into the sociological aspect of this corner of science.
So then calibration could occur of each host with respect to some clever piece of code designed to weed out faulty hardware - as well as the current technique of testing hosts against the remainder of the machine pool.
Cheers, Mike.
( edit ) It's not simply the case that any 'positive' result flagged by any of our machines will then necessarily be re-examined by in-house procedures. Bear in mind that, to date, there have been a number of conclusions drawn regarding upper bounds of astrophysical events ie. what we didn't find!
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
In two of the projects I am running. quorum is 1: AQUA@home and QMC@home. They both make use of MonteCarlo methods. I asked once the QMC developer how they differentiate wheat from weeds in the results coming from different hardware/software combinations and he told me, without detailing how, that the MonteCarlo method allows this.
Tullio
Mike has a good summary of one of the things I proposed.
The only place I went further than he is that in my proposal the user does not even know which tasks are calibration/test tasks and which are actual run data. The external presentation would be the same and the credit awarded likewise.
SaH and I think Einstein would be among the easiest to make replicable tests in that artificial test signals can be generated ... in that the the content of the signals is known the capability of the system can then be accurately validated.
The problem with redundant comparison is that if two computers have the same flaw, such as the f-Div bug of the Intel chips we can have perfect agreement to the wrong answer. In a case such as this the only way to detect this error would be to compare the return from a machine that has the bug with one that does not ... but even then, using third times the charm system (as is in BOINC) if the next machine has the same bug then the wrong answer is selected as the "correct" answer.
Mike has a good summary of one of the things I proposed.
The only place I went further than he is that in my proposal the user does not even know which tasks are calibration/test tasks and which are actual run data. The external presentation would be the same and the credit awarded likewise.
SaH and I think Einstein would be among the easiest to make replicable tests in that artificial test signals can be generated ... in that the the content of the signals is known the capability of the system can then be accurately validated.
The problem with redundant comparison is that if two computers have the same flaw, such as the f-Div bug of the Intel chips we can have perfect agreement to the wrong answer. In a case such as this the only way to detect this error would be to compare the return from a machine that has the bug with one that does not ... but even then, using third times the charm system (as is in BOINC) if the next machine has the same bug then the wrong answer is selected as the "correct" answer.
Since Intel CPUs and AMD CPUs are the majority, maybe to compare the result obtained by an Intel CPU and an AMD CPU would be sufficient. I remember that, when working in Trieste Science Park I could compare the results of a computation done on a MIPS CPU and a SUN CPU and they were different, although both were using a UNIX OS. The floating point processors used different rounding methods.
Tullio
Indeed! That could be done regularly, say every 100 WU's or so would be such a test unit and at it could be required/agreed that awarding credit was contingent upon that. No doubt one could design code behaviour to detect unstable/undesirable host characteristics. To get around those users who may wish to change host features during such a test run then : you could even have code embedded in each WU, some small but fault sensitive task ( consuming say 1% of the workload ) to validate host 'honesty' ( the hardware, not the user ).
Then have a NICE way of telling the user that their hardware is not up to the standards of whatever project, and lead them to a list of other Boinc projects that they could then try. Maybe something like ...this project is doing some Science that requires a very high level of reliability of computer hardware. Yours does not currently meet that level, here is a list of other Boinc projects you could try instead. Then invite them to come back and retest their machine if something changes in the future. And then give them the full current list of Boinc projects.
Then have a NICE way of telling the user that their hardware is not up to the standards of whatever project, and lead them to a list of other Boinc projects that they could then try. Maybe something like ...this project is doing some Science that requires a very high level of reliability of computer hardware. Yours does not currently meet that level, here is a list of other Boinc projects you could try instead. Then invite them to come back and retest their machine if something changes in the future. And then give them the full current list of Boinc projects.
Fair point. Does anyone know how Stardust @ Home dealt with those contributors whom they thought may have been 'overzealous' in their reporting?
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
A viewpoint we as volunteers could find hard to appreciate is that the scientists involved here and other projects could well be taking a professional/career risk in relying on us.
Is the risk that the BOINC hardware is less reliable than a large in-house super-computer / cluster a significant risk?
Is the risk that BOINCers deliberately return incorrect results more often than another researcher / institution fakes their results a significant risk?
I understand false results from BOINC could lead to interesting WUs being ignored by a project. But surely, prior to publishing, every project either checks the interesting WUs on their own hardware, or ignores oddball outliers?
And as for "check WUs":
How do you know that the in-house hardware used to compare the "check WU" is accurate?
How do you ensure that the "check WU" will quickly find a hardware issue given that the issue might not appear until the hardware is stressed (run 24x7, or overheats)?
Then have a NICE way of telling the user that their hardware is not up to the standards of whatever project, and lead them to a list of other Boinc projects that they could then try. Maybe something like ...this project is doing some Science that requires a very high level of reliability of computer hardware. Yours does not currently meet that level, here is a list of other Boinc projects you could try instead. Then invite them to come back and retest their machine if something changes in the future. And then give them the full current list of Boinc projects.
Fair point. Does anyone know how Stardust @ Home dealt with those contributors whom they thought may have been 'overzealous' in their reporting?
Cheers, Mike.
Well, it's been a while since I graded any photos for Stardust.
IIRC, scoring there is primarily a rating of how well you do on the 'calibration' slides which they throw at you quasi-randomly.
Other than that, they list and rank participants by the total number of slides they have reviewed. They also list the potential candidates a participant has found, and a few more tidbits.
Slides which a participant picks as having a possible dust particle target are then sent out again for review by other particpants, and if enough agree, then the slide is tagged for review by by the lab staff and possibly the actual gel sample will get examined.
A viewpoint we as volunteers could find hard to appreciate is that the scientists involved here and other projects could well be taking a professional/career risk in relying on us.
Is the risk that the BOINC hardware is less reliable than a large in-house super-computer / cluster a significant risk?
Yes it is. The machines used and maintained by professionals will inherently be more reliable. Also, depending on the machine may have specialize check hardware built into the machine to detect and report problems. Also, regular maintenance is part and parcel of the operation of a facility such as that ...
Quote:
Is the risk that BOINCers deliberately return incorrect results more often than another researcher / institution fakes their results a significant risk?/quote]
Already proven. One of the primary motivations to the redundant tasks is to check to see that they are not being "forged" and dummy results are not being returned. The same is the reason that you cannot move the tasks from one computer to another.
In SaH CLassic there were a whole group of people that were just "cloning" result files and were not even doing any work at all ... they were just feed junk back to the system... all so they could have higher numbers of "processed" tasks ... so they could "win" the race ... but it kinda obviate the whole point of the race in the first place ...
Quote:
I understand false results from BOINC could lead to interesting WUs being ignored by a project. But surely, prior to publishing, every project either checks the interesting WUs on their own hardware, or ignores oddball outliers?
Um, therein is the point. Unless they duplicate all the work, if we miss "The Signal", they will never know that it was there ...
Quote:
And as for "check WUs":
How do you know that the in-house hardware used to compare the "check WU" is accurate?
How do you ensure that the "check WU" will quickly find a hardware issue given that the issue might not appear until the hardware is stressed (run 24x7, or overheats)?
If I "build" the task from scratch I have no need to process it, I know what is in the task because I am the one that created the signals / data that is in the task. In a simple example I create a task on SaH and put into that task exactly one pulse ... well, it should be pretty easy to find that pulse ...
My objection to the use, as current standard, of real world signals is that with all the noise in the sample you really don't know what is in there. We run them and say that we know what is in there ... but, it is not something that you can eyeball to validate ... with generated and "clean" test signals you can actually look at the data in the file and can see the test signals.
After we have validated the system, and note that this is not only a test of the computer hardware, it is also a test of the running software, we can then add noise to the test samples and rerun the tests ...
Were the SaH people to come to me to tell me about their wonderful system and ask for money these are the types of questions I would be asking. How do you know that the software is working? How did you prove it? etc ...
And the problem is that they don't have the "chain" of proof that you would expect ... at least not to my mind ... but then again ... I like rigor ... I spent too much time out on the pointy end with someone's unproven thingie and I got tired of being shafted ...
RE: RE: I have long
)
An easy way, which was rejected many many moons ago, was to send every host a small test workunit as the first unit they get. That unit would have certain parameters the host would have to meet in order to crunch for that project. One of which would be to come up with a certain result, again within certain degrees of error.
RE: An easy way, which was
)
Indeed! That could be done regularly, say every 100 WU's or so would be such a test unit and at it could be required/agreed that awarding credit was contingent upon that. No doubt one could design code behaviour to detect unstable/undesirable host characteristics. To get around those users who may wish to change host features during such a test run then : you could even have code embedded in each WU, some small but fault sensitive task ( consuming say 1% of the workload ) to validate host 'honesty' ( the hardware, not the user ).
This type of thing could be a useful mechanism, outside of the quorum/validation pathway, both to ensure solid compliance to the science problem and also make it to be seen to be that. This would encourage any otherwise reluctant institutions/projects/efforts to have greater confidence in the distributed computing paradigm. This would not require any alteration of BOINC per se, simply a design issue of the executable provided by the project and the post-processing.
A viewpoint we as volunteers could find hard to appreciate is that the scientists involved here and other projects could well be taking a professional/career risk in relying on us. In this sense I must say Prof Allen has taken a bold/brave step by entrusting what is real cutting edge science to us at all! Clearly that is also to do with economics, his especial interest in hardware ( see Our Prof in the News ), and the sheer volume of material/data that must be sieved to yield substantial/sensible scientific output. Quite frankly I'm amazed we are in the loop at all on this project ...... see "Gravity's Shadow" for insights into the sociological aspect of this corner of science.
So then calibration could occur of each host with respect to some clever piece of code designed to weed out faulty hardware - as well as the current technique of testing hosts against the remainder of the machine pool.
Cheers, Mike.
( edit ) It's not simply the case that any 'positive' result flagged by any of our machines will then necessarily be re-examined by in-house procedures. Bear in mind that, to date, there have been a number of conclusions drawn regarding upper bounds of astrophysical events ie. what we didn't find!
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
In two of the projects I am
)
In two of the projects I am running. quorum is 1: AQUA@home and QMC@home. They both make use of MonteCarlo methods. I asked once the QMC developer how they differentiate wheat from weeds in the results coming from different hardware/software combinations and he told me, without detailing how, that the MonteCarlo method allows this.
Tullio
Mike has a good summary of
)
Mike has a good summary of one of the things I proposed.
The only place I went further than he is that in my proposal the user does not even know which tasks are calibration/test tasks and which are actual run data. The external presentation would be the same and the credit awarded likewise.
SaH and I think Einstein would be among the easiest to make replicable tests in that artificial test signals can be generated ... in that the the content of the signals is known the capability of the system can then be accurately validated.
The problem with redundant comparison is that if two computers have the same flaw, such as the f-Div bug of the Intel chips we can have perfect agreement to the wrong answer. In a case such as this the only way to detect this error would be to compare the return from a machine that has the bug with one that does not ... but even then, using third times the charm system (as is in BOINC) if the next machine has the same bug then the wrong answer is selected as the "correct" answer.
RE: Mike has a good summary
)
Since Intel CPUs and AMD CPUs are the majority, maybe to compare the result obtained by an Intel CPU and an AMD CPU would be sufficient. I remember that, when working in Trieste Science Park I could compare the results of a computation done on a MIPS CPU and a SUN CPU and they were different, although both were using a UNIX OS. The floating point processors used different rounding methods.
Tullio
RE: Indeed! That could be
)
Then have a NICE way of telling the user that their hardware is not up to the standards of whatever project, and lead them to a list of other Boinc projects that they could then try. Maybe something like ...this project is doing some Science that requires a very high level of reliability of computer hardware. Yours does not currently meet that level, here is a list of other Boinc projects you could try instead. Then invite them to come back and retest their machine if something changes in the future. And then give them the full current list of Boinc projects.
RE: Then have a NICE way of
)
Fair point. Does anyone know how Stardust @ Home dealt with those contributors whom they thought may have been 'overzealous' in their reporting?
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: A viewpoint we as
)
Is the risk that the BOINC hardware is less reliable than a large in-house super-computer / cluster a significant risk?
Is the risk that BOINCers deliberately return incorrect results more often than another researcher / institution fakes their results a significant risk?
I understand false results from BOINC could lead to interesting WUs being ignored by a project. But surely, prior to publishing, every project either checks the interesting WUs on their own hardware, or ignores oddball outliers?
And as for "check WUs":
How do you know that the in-house hardware used to compare the "check WU" is accurate?
How do you ensure that the "check WU" will quickly find a hardware issue given that the issue might not appear until the hardware is stressed (run 24x7, or overheats)?
RE: RE: Then have a NICE
)
Well, it's been a while since I graded any photos for Stardust.
IIRC, scoring there is primarily a rating of how well you do on the 'calibration' slides which they throw at you quasi-randomly.
Other than that, they list and rank participants by the total number of slides they have reviewed. They also list the potential candidates a participant has found, and a few more tidbits.
Slides which a participant picks as having a possible dust particle target are then sent out again for review by other particpants, and if enough agree, then the slide is tagged for review by by the lab staff and possibly the actual gel sample will get examined.
Alinator
RE: RE: A viewpoint we
)
Yes it is. The machines used and maintained by professionals will inherently be more reliable. Also, depending on the machine may have specialize check hardware built into the machine to detect and report problems. Also, regular maintenance is part and parcel of the operation of a facility such as that ...