Gravitational Wave Engineering run on LIGO O1 Open Data

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1589395724
RAC: 762526

Richie wrote:Betreger

Richie wrote:
Betreger wrote:
Since the GPU app came most work stalls after completing with a " waiting to acquire lock" They eventually clear and validate. I don't know if this is a feature or a bug.

Hi! That was a bug on v0.12 which is now deprecated and current version is v0.13. 

Well V0.13 seems to have that bug on my GTX1060

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117580009872
RAC: 35190334

DanNeely wrote:Zalster

DanNeely wrote:
Zalster wrote:

Gary, what about an exclude gpu in the cc_config?

 

Your attempt to limit the exclusion to GW GPU tasks didn't work, it also showed GPU missing on my Fermi tasks, failed over to a backup project, and at some point in there process began aborting the fermi GPU tasks (I managed to stop boinc and revert the change before it took out more than 50 or 60 of them).

Dan,
I'm not sure who you were referring to when you said, "Your attempt ...".  Your message is responding to Zalster but I suspect it might have been my suggestion that caused your grief.  If it was, I'm very sorry for the adverse effects.

I don't use a cc_config.xml on most machines.  For the few that I do, it was for the purpose of increasing the daily quota when the fast crunching GPU tasks were on offer a while ago.  I've perused all the cc_config options again now and have noted the extra functionality available for the <exclude_gpu> option.  Last time I looked at that, (probably a long time ago) I don't remember it being able to select on the basis of the app name.  I thought it worked just at the project level.  I'm not familiar with that option at all so I could easily be mistaken.

I think Zalster's suggestion should work.  I imagine you just want to exclude the GW GPU app so no device_num or type should be needed, just the project URL and the app short name.  If the documentation describes things accurately,  just add the <exclude_gpu> stuff inside the <options> clause of your current config file or create the file outright if you don't already have one.  Here is what I believe you would need if you were starting with a new file and not adding further options.  Read the documentation if you're not sure but it seems to me that the following should do what you want.

<cc_config>
    <options>
        <exclude_gpu>
            <url>http://einstein.phys.uwm.edu</url>
            <app>einstein_O1OD1E</app>
        </exclude_gpu>
    </options>
</cc_config>

Once again, sorry if my suggestions stuffed things up for you.  One of the side effects of using an app_config.xml file (as opposed to the current cc_config suggestion) is that you can't remove the influence by just deleting the app_config.xml file.  This is because stuff from the file gets permanently inserted into the state file.  So if your attempt with app_config has ended with just deleting the file, there might still be issues.

The documentation says you need to reset the project to clean out that inserted stuff.  I've found that just editing the file to change all options back to the default values seems to work OK without actually performing a full project reset.  I also believe it should be possible to find what has been inserted into the state file and manually remove it while the client is not running.  If you show exactly what you were using when the problems occurred, it might be possible to see what would be best to do.

You also asked about sched_request and sched_reply.  When your client contacts the scheduler, the full content of the request and the reply are stored as xml files in the BOINC data directory, replacing the previous exchange.  You can always grab a copy of these before a subsequent exchange overwrites them :-).  The project URL is part of the name so there should be a pair of files for each different project you support, documenting the most recent exchange.

 

Cheers,
Gary.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1589395724
RAC: 762526

    Over night this is what

 

 

Over night this is what I have been doing almost exclusively: 

Gravitational Wave Engineering run on LIGO O1 Open Data v0.08 (GW-opencl-nvidia-V1)

h1_0421.75_O1C02Cl2In0__O1OD1E_421.95Hz_1170_0 397449078 2 Apr 2019 15:37:08 UTC 5 Apr 2019 19:26:24 UTC Error while computing 0 0 0

Gravitational Wave Engineering run on LIGO O1 Open Data v0.08 () windows_x86_64

 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Betreger wrote: Over night

Betreger wrote:

Over night this is what I have been doing almost exclusively: 

Gravitational Wave Engineering run on LIGO O1 Open Data v0.08 (GW-opencl-nvidia-V1)

h1_0421.75_O1C02Cl2In0__O1OD1E_421.95Hz_1170_0 397449078 2 Apr 2019 15:37:08 UTC 5 Apr 2019 19:26:24 UTC Error while computing 0 0 0

Gravitational Wave Engineering run on LIGO O1 Open Data v0.08 () windows_x86_64   

 

That was an old v0.08 task that was buried a long time ago. But I see that host has been struggling with current v0.13 tasks today. Error looks different for them. For example this task: https://einsteinathome.org/task/841574537

[ERROR] Couldn't get OpenCL device from BOINC (-1)!

Your host is currently running almost 2 years old Nvidia driver. It made me wonder if that could be a problem with this GPU application. Maybe a newer driver version would be worth trying.

EDIT: I can't see a GPU on that  host now. Did I mix the two different hosts... did they both have GTX 1060?

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1589395724
RAC: 762526

Digging a  bit deeper Boinc

Digging a  bit deeper Boinc says the GTX1060 is missing. A reboot did not help, it does show in the device manager and it thinks it's OK. I have no clue what to do next. 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

I would try these two

I would try these two things:

1. Update Boinc to newest version (if there's no special reason for currently running that older version).

2. "Clean install" latest Nvidia driver:

a) Download the driver from Nvidia web site. https://www.nvidia.com/download/driverResults.aspx/145870/en-us
b) Disconnect internet connection.
c) Run the driver installer and choose custom installation and below the components check the box "clean install".
d) Reboot after installation.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1589395724
RAC: 762526

The latest and greatest

The latest and greatest driver seems to have fixed the problem ATM, why the old driver ceased working boggles my mind. 

Now the task is to process some work and see if it validates. 

Thanx 

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 0

Gary Roberts wrote:DanNeely

Gary Roberts wrote:
DanNeely wrote:
Zalster wrote:

Gary, what about an exclude gpu in the cc_config?

 

Your attempt to limit the exclusion to GW GPU tasks didn't work, it also showed GPU missing on my Fermi tasks, failed over to a backup project, and at some point in there process began aborting the fermi GPU tasks (I managed to stop boinc and revert the change before it took out more than 50 or 60 of them).

Dan,
I'm not sure who you were referring to when you said, "Your attempt ...".  Your message is responding to Zalster but I suspect it might have been my suggestion that caused your grief.  If it was, I'm very sorry for the adverse effects.

I don't use a cc_config.xml on most machines.  For the few that I do, it was for the purpose of increasing the daily quota when the fast crunching GPU tasks were on offer a while ago.  I've perused all the cc_config options again now and have noted the extra functionality available for the <exclude_gpu> option.  Last time I looked at that, (probably a long time ago) I don't remember it being able to select on the basis of the app name.  I thought it worked just at the project level.  I'm not familiar with that option at all so I could easily be mistaken.

I think Zalster's suggestion should work.  I imagine you just want to exclude the GW GPU app so no device_num or type should be needed, just the project URL and the app short name.  If the documentation describes things accurately,  just add the <exclude_gpu> stuff inside the <options> clause of your current config file or create the file outright if you don't already have one.  Here is what I believe you would need if you were starting with a new file and not adding further options.  Read the documentation if you're not sure but it seems to me that the following should do what you want.

<cc_config>
    <options>
        <exclude_gpu>
            <url>http://einstein.phys.uwm.edu</url>
            <app>einstein_O1OD1E</app>
        </exclude_gpu>
    </options>
</cc_config>

Once again, sorry if my suggestions stuffed things up for you.  One of the side effects of using an app_config.xml file (as opposed to the current cc_config suggestion) is that you can't remove the influence by just deleting the app_config.xml file.  This is because stuff from the file gets permanently inserted into the state file.  So if your attempt with app_config has ended with just deleting the file, there might still be issues.

The documentation says you need to reset the project to clean out that inserted stuff.  I've found that just editing the file to change all options back to the default values seems to work OK without actually performing a full project reset.  I also believe it should be possible to find what has been inserted into the state file and manually remove it while the client is not running.  If you show exactly what you were using when the problems occurred, it might be possible to see what would be best to do.

You also asked about sched_request and sched_reply.  When your client contacts the scheduler, the full content of the request and the reply are stored as xml files in the BOINC data directory, replacing the previous exchange.  You can always grab a copy of these before a subsequent exchange overwrites them :-).  The project URL is part of the name so there should be a pair of files for each different project you support, documenting the most recent exchange.

 

 

With the exception of not trying to narrow it farther by a plan class that looks the same as the cc_config adidtion that Zalster suggested I try; which did disable the GPU for the Fermi app in addition to the GW one.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 0

I tried your cut down version

I tried your cut down version (minus the stay ]), same result.  Boinc appears to be parsing the config correctly but then doesn't actually handle it the right way and turns off the GPU for everything:

 

https://i.imgur.com/D2MGAeC.png

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 0

A few minutes after thinking

A few minutes after thinking I'd wrapped up the tests and restored everything to normal I noticed my Fermi GPU tasks still weren't running even after I'd reloaded the cc_config file version that should have re-enabled them.

 

I had to restart BOINC to get them running.  Having seen one case where the reload config option wasn't working correctly, I tried inserting Gary's suggested lines in again.  This time it did work at least to the extent of running Fermi GPU and Einstein CPU tasks while excluding Einstein GPU tasks; which makes me think there's still something not right about the cc config updates setting boinc state if reloaded while running vs at startup.

 

I'm also not convinced the Boinc client correctly understands what work it can request.  Shortly after I got the cc_config working with what apps I do/don't want to run it downloaded a GPU task from one of my 0% share backup projects, and when I aborted it as unneeded the client promptly downloaded a task from a different GPU backup.  The event log for an E@H request I manually triggered doesn't look quite right in that it's not reporting the status of the GPU (my box without the cc_config changes reports "CPU: job cache full; NVIDIA GPU: job cache full").

 

4/21/2019 6:56:10 PM | Einstein@Home | Sending scheduler request: Requested by user.
4/21/2019 6:56:10 PM | Einstein@Home | Not requesting tasks: don't need (CPU: job cache full; NVIDIA GPU: )

 

I uploaded the sched request/responses for that update.  They're large enough that I don't want to do a blind search through them, but if you or anyone else has an idea of what to look for specifically they're both on pastebin.

 

request:

https://pastebin.com/39Xh5Cm9

response:

https://pastebin.com/vLAbJ6GJ

 

I'll be keeping an eye on my system for the next day or so, if it continues to not download any new fermi tasks I'll have to revert the cc_config changes and either resume manually aborting GW GPU tasks, or see if enough accumulate to stop my fetching more fermi GPU work before they time out and automatically fail.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.