After looking at your invalid tasks, I would guess that there is some sort of incompatibility/integration with the Einstein applications, ATI drivers and the OpenCL implementation. It looks like a lack of card resources when running 2X that immediately throws an exception handling event at task startup. I think the Einstein developers need to look closely at this. I don't believe it has anything to do with one specific host. First thing I would do is update to the latest 7.6.6 Boinc Manager since there are some specific fixes made for SETI and MW projects to prevent invalids. Might help with Einstein. Second would be to set some of the debug flags in the cc_config file using the BM interface. I would set co-processsor_debug, mem_usage_debug,checkpoint_debug,statefile_debug and task_debug. Then post the log results for an invalidated task to see if we can figure out just what the application or BOINC is complaining about that causes the invalid.
I don't think the problem lies with Boinc manager version, I already have several machines on 7.6.6... I also don't think there's a card/system resource issue at play here. The 'Activated exception handling' message in the Stderr output is informational only and not a sign of error, all Windows hosts display this line (my Linux hosts do not). Take a look at the output from your own (valid) tasks and you will find that line :-) I also don't believe setting debug flags will help in this instance as, as Juan BFB states, the tasks run normally to completion. The validate server then decides the result is rubbish and marks it as error... From my end all has proceeded correctly.
In answer to John's (Chase 1902) question, none of the cards applicable to this thread are overclocked beyond factory settings and the same is true for all but two of my cards (non 'X' 280's running OC'd under Kubuntu with tasks x4 without issue for many months).
Gav,
Sorry my question about over clocking wasn't really related to the thread, probably should have said that.
Just interested in getting better through put, my systems are in desperate need of updating, cleaning etc, at the same time I could do with optimizing them a bit better.
John
I don't think the problem lies with Boinc manager version, I already have several machines on 7.6.6... I also don't think there's a card/system resource issue at play here. The 'Activated exception handling' message in the Stderr output is informational only and not a sign of error, all Windows hosts display this line (my Linux hosts do not). Take a look at the output from your own (valid) tasks and you will find that line :-) I also don't believe setting debug flags will help in this instance as, as Juan BFB states, the tasks run normally to completion. The validate server then decides the result is rubbish and marks it as error... From my end all has proceeded correctly.
In answer to John's (Chase 1902) question, none of the cards applicable to this thread are overclocked beyond factory settings and the same is true for all but two of my cards (non 'X' 280's running OC'd under Kubuntu with tasks x4 without issue for many months).
Gav.
Yes, I agree. I should have looked at my own valid tasks and noticed that all tasks get the exception message. The fact that the task runs to completion apparently normally until the end when the server decides the task is invalid lends credence to my suspicion about the whether the stderr.txt output result is getting truncated or not closed correctly at the time of reporting. This is the issue we fought for over half a year at SETI and MilkyWay. The new 7.6.6. BOINC Manager client was created to resolve the issue. I would suggest Juan at least try the update, even though you state you are running 7.6.6 BM and seeing the issue on your own machines, Gavin. I still would like to see the logfile output for a failed task. I would also set the slot_debug flag because it helps show just how a task gets moved into a slot and out of a slot. The issue with MW and SETI was that the slots weren't getting cleared of their previous occupant before a new task was being assigned to it and corrupting the stderr.txt file at MW. The new 7.6.6. client added some code to make sure that Windows had enough time to close out the files properly. You can read about it in my What is the cause of these 'validate errors' thread at MilkyWay@Home. And here is the thread over at SETI@Home; Stderr Truncations I strongly suggest the new 7.6.6 client first.
Do let us know how you make out with 7.6.9 client. It rolled up some of the corrections and additions for 7.6.6. Some questions about VBoxWrapper only thing outstanding.
Yeah, Juan, I took a look at your tasks today and saw the problem remains. The only thing left to do would be to set some of the cc_config.xml flags I suggested and post the resulting logfile entries to the thread for a task that failed invalid. I'm no expert, but I do have experience with setting the flags and looking for problems in the way a task moves onto the GPU and off for reporting from my troubleshooting of my invalids at MilkyWay that led to the fix in the 7.6.6. client. Once you post the logfile I would like to entice Richard Haselgrove and Jason Gee to look the logs over. They understand the nitty gritty of how BOINC works and were instrumental in developing the fixes for 7.6.6 that David Anderson implemented.
Have you heard of anyone else having issues that mirror exactly your symptoms? There is always a chance that you have a real hardware problem that only rears its head when the card is stressed with more than one work unit. I had a recent failure of my CPU that took some time to diagnose why it was producing invalids. I would be interested in seeing the memory_debug and coprocessor_debug results in the logfile.
One interesting thing in the stderr of the invalids is the sumspec pages on some
of them seem to increase way beyond what I would consider to be a normal value.
some of the entries are over 4000 pages, while the successful valids (one at a time) are normal around 500, If you are only running two at a time I would not think sumspec would be too much over 1500.??
At least that is the way it works on my HD7950's
Glad you are back and posting from Panama, sorry you couldn't bring your GTX690's
with ya.
Maybe this could help The
)
Maybe this could help
The WU crunches normaly for the normal time, the problem apparently apears only at the end of the process.
Hi Keith, RE: After
)
Hi Keith,
I don't think the problem lies with Boinc manager version, I already have several machines on 7.6.6... I also don't think there's a card/system resource issue at play here. The 'Activated exception handling' message in the Stderr output is informational only and not a sign of error, all Windows hosts display this line (my Linux hosts do not). Take a look at the output from your own (valid) tasks and you will find that line :-) I also don't believe setting debug flags will help in this instance as, as Juan BFB states, the tasks run normally to completion. The validate server then decides the result is rubbish and marks it as error... From my end all has proceeded correctly.
In answer to John's (Chase 1902) question, none of the cards applicable to this thread are overclocked beyond factory settings and the same is true for all but two of my cards (non 'X' 280's running OC'd under Kubuntu with tasks x4 without issue for many months).
Gav.
Gav, Sorry my question about
)
Gav,
Sorry my question about over clocking wasn't really related to the thread, probably should have said that.
Just interested in getting better through put, my systems are in desperate need of updating, cleaning etc, at the same time I could do with optimizing them a bit better.
John
RE: Hi Keith, I don't
)
Yes, I agree. I should have looked at my own valid tasks and noticed that all tasks get the exception message. The fact that the task runs to completion apparently normally until the end when the server decides the task is invalid lends credence to my suspicion about the whether the stderr.txt output result is getting truncated or not closed correctly at the time of reporting. This is the issue we fought for over half a year at SETI and MilkyWay. The new 7.6.6. BOINC Manager client was created to resolve the issue. I would suggest Juan at least try the update, even though you state you are running 7.6.6 BM and seeing the issue on your own machines, Gavin. I still would like to see the logfile output for a failed task. I would also set the slot_debug flag because it helps show just how a task gets moved into a slot and out of a slot. The issue with MW and SETI was that the slots weren't getting cleared of their previous occupant before a new task was being assigned to it and corrupting the stderr.txt file at MW. The new 7.6.6. client added some code to make sure that Windows had enough time to close out the files properly. You can read about it in my What is the cause of these 'validate errors' thread at MilkyWay@Home. And here is the thread over at SETI@Home; Stderr Truncations I strongly suggest the new 7.6.6 client first.
RE: I strongly suggest the
)
Thanks, will do that ASAP. The computer is at a remote location.
Now running 2 at a time with
)
Now running 2 at a time with 7.6.9. Lets see if the error realy dissapears.
Do let us know how you make
)
Do let us know how you make out with 7.6.9 client. It rolled up some of the corrections and additions for 7.6.6. Some questions about VBoxWrapper only thing outstanding.
Changed to Boinc 7.6.9 and
)
Changed to Boinc 7.6.9 and the problem remains.
Yeah, Juan, I took a look at
)
Yeah, Juan, I took a look at your tasks today and saw the problem remains. The only thing left to do would be to set some of the cc_config.xml flags I suggested and post the resulting logfile entries to the thread for a task that failed invalid. I'm no expert, but I do have experience with setting the flags and looking for problems in the way a task moves onto the GPU and off for reporting from my troubleshooting of my invalids at MilkyWay that led to the fix in the 7.6.6. client. Once you post the logfile I would like to entice Richard Haselgrove and Jason Gee to look the logs over. They understand the nitty gritty of how BOINC works and were instrumental in developing the fixes for 7.6.6 that David Anderson implemented.
Have you heard of anyone else having issues that mirror exactly your symptoms? There is always a chance that you have a real hardware problem that only rears its head when the card is stressed with more than one work unit. I had a recent failure of my CPU that took some time to diagnose why it was producing invalids. I would be interested in seeing the memory_debug and coprocessor_debug results in the logfile.
Cheers, Keith
Hi Juan long time, One
)
Hi Juan long time,
One interesting thing in the stderr of the invalids is the sumspec pages on some
of them seem to increase way beyond what I would consider to be a normal value.
some of the entries are over 4000 pages, while the successful valids (one at a time) are normal around 500, If you are only running two at a time I would not think sumspec would be too much over 1500.??
At least that is the way it works on my HD7950's
Glad you are back and posting from Panama, sorry you couldn't bring your GTX690's
with ya.
Tom* aka Bill