Latest data file for FGRPB1G GPU tasks

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 618397977
RAC: 902880

That is what I am seeing too.

That is what I am seeing too. My Nvidia 2080ti Founders Edition completed the first successful Einstein execution. Interestingly, it was slightly slower than the EVGA 1080ti at 510 seconds vs 495.

GPUZ shows temperature at TDP power at 80%, GPU 70 degrees and 65% GPU load.

Ooops. Just got a 2008L  and it failed.

 

archae86 wrote:
Gary Roberts wrote:

A new data file LATeah0104Y.dat came into play more than 12 hours ago.  It has the same size as, and would appear to be a continuation of a previous series that ended with LATeah0104X.dat (first mentioned in the opening post of this thread).  The tasks based on the new file will most likely crunch faster than the previous 2103L tasks.

Based on the fact that 0104X tasks did fail on Turing GPUs, I imagine the new ones would also fail as well, unfortunately.

 

As was typical in that series of datafiles in the past, the 0104Y was only in new issue for a few days.  Now we are getting new work issue from 1041L.  If past behavior based on groups by the file name holds, these would be predicted to have long elapsed times, and to work correctly on Turing cards with the current applications and drivers.  The data file size is 819,029, which matches exactly the datafile size Gary Roberts reported for previous groups of tasks in that series.

To re-state our observations on similar-behaving sets of Einstein Gamma-Ray Pulsar tasks:

Filename bytes      Elapsed time Turing
10nnL      819,029  longest      works
0104?    2,720,502  shortest     fails
20nnL    1,935,482  intermediate fails

These behaviors have held true since Turing cards appeared on Einstein late in September, 2018.  The 2103L file run very recently is for this purpose classed in the 20nnL group, because of renaming mentioned earlier in this thread.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7232297973
RAC: 1158135

rjs5 wrote:That is what I am

rjs5 wrote:
That is what I am seeing too. My Nvidia 2080ti Founders Edition completed the first successful Einstein execution. Interestingly, it was slightly slower than the EVGA 1080ti at 510 seconds vs 495.

Interesting:

User Jan's 2080 Ti host has run a few 1041L tasks successfully recently with elapsed times around 367 seconds.  There must be an important configuration difference from your machine.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 19

DanNeely wrote:I think having

DanNeely wrote:
I think having a single moderator on point to send "have you seen this" messages to project staff is probably sufficient.

To my understanding this has been the modus operandi up to now, and it has worked quite well from my point of view. I'm not sure where or why this broke down in this particular case. If this process does need some adjustment we can and should of course talk about that. I'll open a thread on the moderators mailing list this week.

Cheers,
Oliver

Einstein@Home Project

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 19

Ok, I presume you guys are

Ok, I presume you guys are longing for any kind of feedback from us so I rather post updates as I/we get them instead of having you to wait for a full solution. So please take all these coming updates as preliminary.

  • I'll regard this tread as the canonical one for the RTX 2080 problem, unless being advised otherwise (by you)
  • Bernd is currently unable to look into this, so I took over for the time being
  • I'm currently running a task on our GeForce RTX 2080 Ti on 64bit Linux (Driver 410.73 / OpenCL 1.2 CUDA 10.0.185)
  • Workunit: LATeah1041L_180.0_0_0.0_17609907
  • App: hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl-nvidia OK (runtime 6:12.18)
  • App: hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia OK (runtime 6:22.58)
  • Will try Peter's test set next
  • Shot in the dark: error -36 on Windows could indicate a driver timeout issue. You could try increasing the TdrDelay and/or TdrDdiDelay settings to rule that out. You could also disable those timeouts entirely by setting TdrDebugMode to 1, but that could lock up you system, so please proceed with caution and be warned.

Stay tuned...

 

Einstein@Home Project

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 19

Update: I ran Peter's test

Update:

  • I ran Peter's test case on Linux, switching the app only, and it fails there as well - GOOD!
  • The test on Linux seems to fail at the same stage as the Windows apps - GOOD!
  • The Linux does hang instead of returning an error -> this corroborates my idea that there might be a (protective) timeout involved on Windows which Linux doesn't have/use
  • The underlying issue is thus definitely related to the dataset being analyzed and not a general incompatibility with the RTX 2080 cards
  • I already see a major difference between the two sets so that gives us something to look into

Cheers

Einstein@Home Project

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 618397977
RAC: 902880

Thanks MUCH for the  update.

Thanks MUCH for the  update. It tells me that you were able to devote some time for this problem AND you are seeing the exact behavior that I am. Very good.

Oliver Behnke wrote:

Update:

  • I ran Peter's test case on Linux, switching the app only, and it fails there as well - GOOD!
  • The test on Linux seems to fail at the same stage as the Windows apps - GOOD!
  • The Linux does hang instead of returning an error -> this corroborates my idea that there might be a (protective) timeout involved on Windows which Linux doesn't have/use
  • The underlying issue is thus definitely related to the dataset being analyzed and not a general incompatibility with the RTX 2080 cards
  • I already see a major difference between the two sets so that gives us something to look into

Cheers

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4977
Credit: 18789002557
RAC: 7520879

Was there previous failures

Was there previous failures with Titan V cards on these fast datasets? If so, I would suspect the same mechanism is in play.  Does the science app expect the same architecture for Turing cards as Pascal?  There is a difference in how those card families report how many cores per sm.

Pascal seems to use cores_per_proc = 128

Turing and Titan V seems to use cores_per_proc = 64

So if you are setting up your array with your parameter set expecting 128 cores per sm and what you actually have is 64 cores per sm, then I would think there would be issues.

The BOINC developers already had to change the code in BOINC to properly calculate peak_flops values for the new Turing cards.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7232297973
RAC: 1158135

Oliver Behnke wrote: The

Oliver Behnke wrote:

The underlying issue is thus definitely related to the dataset being analyzed and not a general incompatibility with the RTX 2080 cards
I already see a major difference between the two sets so that gives us something to look into

Some time ago it occurred to me that in addition to differing in data  and template files, the "good" and "bad" for Turing task groups might differ systematically in one or more input parameters in the project provided command string which might influence the bad Turing outcome.

Comparing the input parameter strings for the two groups of tasks, I identified 

Alpha
Delta
skyRadius
IdiBins
Df1dot

as potential candidates, purely on the basis of having fixed values within a group, that differed between the two groups.

One question I tested were: could I change the behavior of a "good" task to fast fail by changing a single parameter value from that used in the good group to that used in the bad group.  The answer was YES, and, to my surprise, it was true of two of these five parameters, individually.  Alpha or Delta.

So the Data file plus template file on the troublesome tasks is not necessary to creating the Turing fast failure.  But another test was alter all five of these values on a failing tasks to their "good group" values.  This did not convert the failing task into a passing task.

Assuming I actually did what I intended to do, it appears to me that the application code responds both to the data and template file input and to at least two of the command line parameters in ways which trip the condition that leads to the Turing-associated fast fail.

I did not include this result in the summary I prepared for project staff recently, and don't think I posted it on the forums here.  Quite likely it is not useful.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 19

Update: Ran Peter's test

Update:

  • Ran Peter's test case on a Quadro GV100 (Volta): same FAILURE
  • Ran Peter's test case on a GeForce GTX 1080 Ti (Pascal): different ERROR, similar "area"
  • Interestingly the GTX 1080 Ti and RTX 2080 Ti both sport 11 GB memory, yet throw errors at different stages

This all paints a pretty clear picture right now. I'm curious which NVIDIA GPUs were able to process this and similar datasets (i.e. all LATeah0xxxy) at all in the past. If you guys know any for sure, please let me know. I'm going to dig through our archives in the meantime. I'm also going to review any potential code changes that might play a role here.

Cheers,
Oliver

Einstein@Home Project

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7232297973
RAC: 1158135

Oliver, My test case should

Oliver,

My test case should have worked on the GTX 1080 Ti.  There are many of them successfully running here at Einstein, including the group of tasks represented by the test case.

I'm afraid you may have found a flaw in my test case--or at least an imperfection in how portable it is.

In my flotilla, I have GTX 1050, GTX 1060 3GB, GTX 1060 6GB, and GTX 1070 GB--all Pascal cards which correctly run work in the LATeah0104? group.

On the other hand, if on a Windows RTX 2080 Ti machine the test case generates a black screen (and driver restart) about seven seconds after initiation, and terminates with the reported error syndrome after about 25 seconds, probably for that test case you really are seeing the behavior of interest--so perhaps the test case is not completely useless.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.