Pascal again available, Turing may be coming soon

archae86
archae86
Joined: 6 Dec 05
Posts: 3158
Credit: 7244800062
RAC: 1306388

Moments ago, I finally

Moments ago, I finally submitted a trouble report to Nvidia, using the Feedback site pointed to by the Reddit Nvidia forum  Nvidia driver feedback.

I was able to say that four of four Einstein users reporting their Turing card experience have seen the same rapid failure syndrome on the "high-pay" WUs, and that one person (Vyper from SETI) had successfully reproduced the problem relying only on the ZIP file test case I provided them.  (This was the portable test I developed with massive guidance from Juha, and extra help from Gary Roberts and Richard Haselgrove).

I see a series of obstacles:

- We are not their dominant user base, and they are probably knee-deep in new release issues

- They are not probably used to being pointed to ZIP files with test environments

- If they do see my test case fail, they may lack tools to investigate what is going on

- They may be inclined to blame the application

- They may request application instrumentation

- If they think they understand the problem, it still may not make it onto the fix priority list

But I've done what I can, and what I've done has been made vastly better than it would have been by input here.  Thank you.

 

 

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4992
Credit: 18831198789
RAC: 5829723

Maybe you could explain to me

Maybe you could explain to me your use of "high-pay" and "low-pay" task terminology in your posts.  As far as I have been able to figure out. Einstein uses a fixed credit mechanism that allots 3465 credits for a gpu task and it doesn't matter how long it takes to compute.  So how can there be a higher or lower paying task?

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3158
Credit: 7244800062
RAC: 1306388

Keith Myers wrote:So how can

Keith Myers wrote:
So how can there be a higher or lower paying task?

We are doing piecework.  Constant credit per piece, but the high-pay units take far less time to finish.  So the pay rate (per unit time) is much higher on the high-pay units.

Gary Roberts spoke against my terminology also, but I half thought he was joking.  Maybe I should adopt another, but I don't like the one he proposed, either.

By the way, I don't even know whether the lethal difference between the two different work types distributed in the last month is actually in the data files, template files, or the (very long) string of input parameters.  Personally, I suspect the input parameters.  Vyper tried hacking off crudely more than half the parameter string on my test case, and the application then got the GPU going.  But of course it seems unlikely the result would have met requirements, so it is a stretch to say that made it "work".

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 117979638242
RAC: 21685111

archae86 wrote:... Gary

archae86 wrote:
... Gary Roberts spoke against my terminology ...

No I didn't!  I called it 'sexy and fashionable' and then said I wasn't complaining :-).

archae86 wrote:
... but I half thought he was joking.

As I certainly was.  I just knew someone was bound to come along sooner or later and ask the obvious question that was just begging to be asked :-).  So I tried to make a joke about it so you wouldn't have to spend time explaining that there weren't actually any tasks that 'paid' more than the standard amount :-).  Looks like that didn't work too well either :-).

archae86 wrote:
Maybe I should adopt another ...

Don't you dare do that!!  Your chosen terminology is part of the folklore now so changing it would be a disaster :-).  It's just like the old days when somebody came up with the term 'wingman' instead of 'quorum partner' (or some other more official equivalent - if there ever was one).  Everyone quickly got to know what 'wingman' meant and if this chopping and changing of tasks with distinctly different duration continues, we'll certainly need a popular term for it.  High-pay and Low-pay are as good as any.

 

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4992
Credit: 18831198789
RAC: 5829723

Ahh, OK, got it.  At Seti, we

Ahh, OK, got it.  At Seti, we call the fast computing Arecibo tasks "shorties"  You are correct, in little time in the forums the vernacular shorthand becomes common and accepted.  OK low-pay and high-pay it is.

 

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2968465235
RAC: 693706

archae86 wrote:Vyper tried

archae86 wrote:
Vyper tried hacking off crudely more than half the parameter string on my test case, and the application then got the GPU going.  But of course it seems unlikely the result would have met requirements, so it is a stretch to say that made it "work".

But if he could file a proper bug report stating which parameters were 'hacked off', that might point a programmer to the area of code which is either incompatible with, or needs re-compiling for, the new hardware. RTX cards aren't going away - and going by previous experience, people will just throw them into a working machine and break it. Einstein will have to get the debugger and the compiler out sooner or later, or suffer the error rate.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 0

Until the new GW tasks are

Until the new GW tasks are fully sorted out and transitioned from beta to production, I doubt E@H is going to have any developer resources available to look into other problems.

mmonnin
mmonnin
Joined: 29 May 16
Posts: 292
Credit: 3444636540
RAC: 2214915

Not sure what NV will do

Not sure what NV will do considering a different dataset runs ok. On the other hand just the RTX cards are having issues. I'd think a joint investigation with E@H would be needed to really resolve it. Otherwise as mentioned, we're just a minority.

archae86
archae86
Joined: 6 Dec 05
Posts: 3158
Credit: 7244800062
RAC: 1306388

As of somewhat over a half

As of somewhat over a half day ago, Einstein current issue of Gamma-ray Pulsar GPU work has switched from the recent string of O104* files that have "high-pay" characteristics and fail fast on Turing cards to 1025L file work, which on the established naming pattern I expect to be low-pay work which will function on Turing cards entirely properly.

I plan to work down my stock of existing work before putting the 2080 card back in the box, but if anyone has an interest in trying out their Turing now would be a good time.

archae86
archae86
Joined: 6 Dec 05
Posts: 3158
Credit: 7244800062
RAC: 1306388

And now there are five

And now there are five Einstein users with same syndrome Turing fast failures on Einstein high-pay GRP WUs.

User CElliott has a  2070 host (the first of that variant for which we have any report here). 

The system has processed 22 high-pay WUs in the 104V file, all failing, with typical elapsed time around 22 seconds.  The error 36 is returned.

The user reports seeing a short dark screen interval, and has observed the error report "Display driver nvlddmkm stopped responding and has successfully recovered" (this specific observation matches one by Vyper when he tried out my portable trial directory).

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.