As of this morning I've given up on working this problem. I removed the dual 750Ti cards and put a single GTX 1050 in the primary slot, and resumed production work.
In summary, it appears that the combination of Windows 10 Fall Creators Update, the most recent Nvidia driver, and my system configuration somehow lead to a situation in which some, but not all, recognition of the graphics cards installed does not work properly for at least three distinct combinations of cards I've tried. Clinfo and BOINC both report there is only one card installed, and in all three cases where there were really two cards, it was the card in the secondary slot which BOINC and clinfo reported to be present (and which BOINC actually used). When there is only a single card installed in the primary slot, that is recognized by clinfo and BOINC and used successfully by BOINC.
I reviewed the BOINCstats top Einstein host list, and found more than one top 100 host which runs Windows 10, has installed the Fall Creators Update, runs Nvidia, and runs more than one card. I don't know their install date, so can't rely on their credit history to confirm or deny that they share my pain.
I had two more dual-GPU machines running Windows 10 which were yet to do the big Fall Creators Update and was very concerned that I'd lose the ability to run a second GPU on them as well.
Finally today I ran the update on the machine I call Stoll7. To my great surprise, it immediately was able to run Einstein on both GPU's, with no additional work (no installing a fresh Nvidia driver, no DDU scrub).
I'll be trying the third machine soon, and will post here my success or failure.
"Finally today I ran the update on the machine I call Stoll7. To my great surprise, it immediately was able to run Einstein on both GPU's, with no additional work (no installing a fresh Nvidia driver, no DDU scrub)."
Today I ran the Fall Creators Update on my daily driver, a dual GPU (1070 + 1060) host I call Stoll9.
I'm happy to report that after the 2.5 hour process it was immediately able and willing to run Einstein on both GPUs, continuing at the 2X configuration I've recently been using.
The sole relevant oddity I've noticed so far is that the device 0 and device 1 designations, as reported by BOINCTasks (and thus presumably BOINC itself) and by TThrottle swapped.
Oddly, the little script I run to set my overclocks shortly after I log in, which uses Nvidia Inspector command line commands, set the clocks as I wished, so apparently the 0, 1 designation as seen by that application at that moment was as before.
I don't know whether my success with two machines where one dismally failed (to see and use both GPUs) was due to some oddball difference in the failing machine, or whether Microsoft or Nvidia fixed something in the weeks between.
I should mention that the Einstein web page for that machine describes it as running Nvidia driver 388.13. That must have been installed as part of the 2.5 hour update process, as that is not a version I have ever downloaded to this machine. So in the last two cases I have reported, running the Windows Fall Creator update installed a fresh Nvidia driver which was able to talk to and run with BOINC and Einstein successfully as initially configured.
The sole relevant oddity I've noticed so far is that the device 0 and device 1 designations, as reported by BOINCTasks (and thus presumably BOINC itself) and by TThrottle swapped.
Updates. The device number oddity reversed again when I rebooted. Good news.
Bad news: the GTX 1060 has repeatedly suddenly downclocked the core clock rate to 1506 MHz. I've never before seen that rate on this card. This happens to be a card on which I've run very finegrained overclock tests for many weeks, just ended. While I have seen compute errors (mostly number 28) and a few other things, I've never before seen a 1506 downclock. 1506 is long way down from the slightly over 2000 it has been stably been running at a nominal requested core overclock of 175.
1506 is not some random number, nor a crazy low "safe mode" like the class 405 memory clock, but instead is actually the nominal core clock for this card. All above that is boost clock plus overclock. But under normal conditions I have never seen any of the five Pascal cards I own run Einstein without substantial boost clock (hundreds of MHz).
So far I've seen the 1506 downclock about two hours into the first run after the Creators Update install. It went away with a reboot, but returned after two hours of uptime. I uninstalled the graphics driver, and did a DDU-enhanced clean install of the latest Nvidia driver. That ran about six hours before 1506. So far all of these had run at the overclocks I had recently and laboriously found to be safe.
I next rebooted and lowered both memory and core clock overclocks by a factor of two. That stayed up for 2.5 hours before falling to 1506. I next set it to zero overclock (which, by the way, gives a core clock of 1848, not 1506, because of what Nvidia calls Boost clock). So far that has gotten a little past four hours without losing core boost, but I have little confidence.
I don't have good candidate issues for provoking instant full loss of core overclock AND core boost clock (while memory overclock stays intact). I don't know what about the Fall Creators Update might make this more likely, nor why only this card of the five Nvidia Pascal type cards I currently run under Fall Creators update on Einstein work is doing this.
I have some more observations regarding the 1506 MHz core clock durable state my 1060 has repeatedly gone to since this software update.
It even downclocked there after about six hours when I initialized it in a requested overclock of -100 MHz core clock. Mind you, since the request is with respect to default boost, that run started up at core clock 1759 MHz.
In the 1506 state GPU-Z reports the card as sitting at VDDC of .843, whereas it had been stably at 1.0378. No surprise the combined clock and voltage reduction gave a big reduction in reported power consumption, down at 45.7% of TDP as opposed to 65.8%. in the 1506 state GPU-Z reports the PerfCap reason as "Idle", where before it reported something like VRel. Also no surprise that the reported GPU temperature came down by about 9 degrees C.
The power reduction is so very great that it may well be that from a power efficiency of BOINC work production I am getting a good deal, even considering amortizing system overhead. At the card level I'm definitely getting a power efficiency improvement.
As I'm running out of ideas, I may just embrace my power reduction and accept the rather moderate loss of total Einstein useful output.
As of this morning I've given
)
As of this morning I've given up on working this problem. I removed the dual 750Ti cards and put a single GTX 1050 in the primary slot, and resumed production work.
In summary, it appears that the combination of Windows 10 Fall Creators Update, the most recent Nvidia driver, and my system configuration somehow lead to a situation in which some, but not all, recognition of the graphics cards installed does not work properly for at least three distinct combinations of cards I've tried. Clinfo and BOINC both report there is only one card installed, and in all three cases where there were really two cards, it was the card in the secondary slot which BOINC and clinfo reported to be present (and which BOINC actually used). When there is only a single card installed in the primary slot, that is recognized by clinfo and BOINC and used successfully by BOINC.
I reviewed the BOINCstats top Einstein host list, and found more than one top 100 host which runs Windows 10, has installed the Fall Creators Update, runs Nvidia, and runs more than one card. I don't know their install date, so can't rely on their credit history to confirm or deny that they share my pain.
I had two more dual-GPU
)
I had two more dual-GPU machines running Windows 10 which were yet to do the big Fall Creators Update and was very concerned that I'd lose the ability to run a second GPU on them as well.
Finally today I ran the update on the machine I call Stoll7. To my great surprise, it immediately was able to run Einstein on both GPU's, with no additional work (no installing a fresh Nvidia driver, no DDU scrub).
I'll be trying the third machine soon, and will post here my success or failure.
"Finally today I ran the
)
Today I ran the Fall Creators
)
Today I ran the Fall Creators Update on my daily driver, a dual GPU (1070 + 1060) host I call Stoll9.
I'm happy to report that after the 2.5 hour process it was immediately able and willing to run Einstein on both GPUs, continuing at the 2X configuration I've recently been using.
The sole relevant oddity I've noticed so far is that the device 0 and device 1 designations, as reported by BOINCTasks (and thus presumably BOINC itself) and by TThrottle swapped.
Oddly, the little script I run to set my overclocks shortly after I log in, which uses Nvidia Inspector command line commands, set the clocks as I wished, so apparently the 0, 1 designation as seen by that application at that moment was as before.
I don't know whether my success with two machines where one dismally failed (to see and use both GPUs) was due to some oddball difference in the failing machine, or whether Microsoft or Nvidia fixed something in the weeks between.
I should mention that the Einstein web page for that machine describes it as running Nvidia driver 388.13. That must have been installed as part of the 2.5 hour update process, as that is not a version I have ever downloaded to this machine. So in the last two cases I have reported, running the Windows Fall Creator update installed a fresh Nvidia driver which was able to talk to and run with BOINC and Einstein successfully as initially configured.
archae86 wrote:The sole
)
Updates. The device number oddity reversed again when I rebooted. Good news.
Bad news: the GTX 1060 has repeatedly suddenly downclocked the core clock rate to 1506 MHz. I've never before seen that rate on this card. This happens to be a card on which I've run very finegrained overclock tests for many weeks, just ended. While I have seen compute errors (mostly number 28) and a few other things, I've never before seen a 1506 downclock. 1506 is long way down from the slightly over 2000 it has been stably been running at a nominal requested core overclock of 175.
1506 is not some random number, nor a crazy low "safe mode" like the class 405 memory clock, but instead is actually the nominal core clock for this card. All above that is boost clock plus overclock. But under normal conditions I have never seen any of the five Pascal cards I own run Einstein without substantial boost clock (hundreds of MHz).
So far I've seen the 1506 downclock about two hours into the first run after the Creators Update install. It went away with a reboot, but returned after two hours of uptime. I uninstalled the graphics driver, and did a DDU-enhanced clean install of the latest Nvidia driver. That ran about six hours before 1506. So far all of these had run at the overclocks I had recently and laboriously found to be safe.
I next rebooted and lowered both memory and core clock overclocks by a factor of two. That stayed up for 2.5 hours before falling to 1506. I next set it to zero overclock (which, by the way, gives a core clock of 1848, not 1506, because of what Nvidia calls Boost clock). So far that has gotten a little past four hours without losing core boost, but I have little confidence.
I don't have good candidate issues for provoking instant full loss of core overclock AND core boost clock (while memory overclock stays intact). I don't know what about the Fall Creators Update might make this more likely, nor why only this card of the five Nvidia Pascal type cards I currently run under Fall Creators update on Einstein work is doing this.
Any bright ideas?
I have some more observations
)
I have some more observations regarding the 1506 MHz core clock durable state my 1060 has repeatedly gone to since this software update.
It even downclocked there after about six hours when I initialized it in a requested overclock of -100 MHz core clock. Mind you, since the request is with respect to default boost, that run started up at core clock 1759 MHz.
In the 1506 state GPU-Z reports the card as sitting at VDDC of .843, whereas it had been stably at 1.0378. No surprise the combined clock and voltage reduction gave a big reduction in reported power consumption, down at 45.7% of TDP as opposed to 65.8%. in the 1506 state GPU-Z reports the PerfCap reason as "Idle", where before it reported something like VRel. Also no surprise that the reported GPU temperature came down by about 9 degrees C.
The power reduction is so very great that it may well be that from a power efficiency of BOINC work production I am getting a good deal, even considering amortizing system overhead. At the card level I'm definitely getting a power efficiency improvement.
As I'm running out of ideas, I may just embrace my power reduction and accept the rather moderate loss of total Einstein useful output.