Gravitational Wave Engineering run on LIGO O1 Open Data

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1589395724

RAC: 762526

Richie wrote:Betreger

21 Apr 2019 0:49:57 UTC

Message 170837 in response to message 170833

(moderation:

)

Richie wrote:

Betreger wrote:
Since the GPU app came most work stalls after completing with a " waiting to acquire lock" They eventually clear and validate. I don't know if this is a feature or a bug.

Hi! That was a bug on v0.12 which is now deprecated and current version is v0.13.

Well V0.13 seems to have that bug on my GTX1060

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117580009872

RAC: 35190334

DanNeely wrote:Zalster

21 Apr 2019 6:47:00 UTC

Message 170842 in response to message 170825

(moderation:

)

DanNeely wrote:

Zalster wrote:

Gary, what about an exclude gpu in the cc_config?

Your attempt to limit the exclusion to GW GPU tasks didn't work, it also showed GPU missing on my Fermi tasks, failed over to a backup project, and at some point in there process began aborting the fermi GPU tasks (I managed to stop boinc and revert the change before it took out more than 50 or 60 of them).

Dan,
I'm not sure who you were referring to when you said, "Your attempt ...". Your message is responding to Zalster but I suspect it might have been my suggestion that caused your grief. If it was, I'm very sorry for the adverse effects.

I don't use a cc_config.xml on most machines. For the few that I do, it was for the purpose of increasing the daily quota when the fast crunching GPU tasks were on offer a while ago. I've perused all the cc_config options again now and have noted the extra functionality available for the <exclude_gpu> option. Last time I looked at that, (probably a long time ago) I don't remember it being able to select on the basis of the app name. I thought it worked just at the project level. I'm not familiar with that option at all so I could easily be mistaken.

I think Zalster's suggestion should work. I imagine you just want to exclude the GW GPU app so no device_num or type should be needed, just the project URL and the app short name. If the documentation describes things accurately, just add the <exclude_gpu> stuff inside the <options> clause of your current config file or create the file outright if you don't already have one. Here is what I believe you would need if you were starting with a new file and not adding further options. Read the documentation if you're not sure but it seems to me that the following should do what you want.

<cc_config>
    <options>
        <exclude_gpu>
            <url>http://einstein.phys.uwm.edu</url>
            <app>einstein_O1OD1E</app>
        </exclude_gpu>
    </options>
</cc_config>

Once again, sorry if my suggestions stuffed things up for you. One of the side effects of using an app_config.xml file (as opposed to the current cc_config suggestion) is that you can't remove the influence by just deleting the app_config.xml file. This is because stuff from the file gets permanently inserted into the state file. So if your attempt with app_config has ended with just deleting the file, there might still be issues.

The documentation says you need to reset the project to clean out that inserted stuff. I've found that just editing the file to change all options back to the default values seems to work OK without actually performing a full project reset. I also believe it should be possible to find what has been inserted into the state file and manually remove it while the client is not running. If you show exactly what you were using when the problems occurred, it might be possible to see what would be best to do.

You also asked about sched_request and sched_reply. When your client contacts the scheduler, the full content of the request and the reply are stored as xml files in the BOINC data directory, replacing the previous exchange. You can always grab a copy of these before a subsequent exchange overwrites them :-). The project URL is part of the name so there should be a pair of files for each different project you support, documenting the most recent exchange.

Cheers,
Gary.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1589395724

RAC: 762526

Over night this is what

21 Apr 2019 15:15:22 UTC

Message 170850

(moderation:

)

Over night this is what I have been doing almost exclusively:

Gravitational Wave Engineering run on LIGO O1 Open Data v0.08 (GW-opencl-nvidia-V1)

h1_0421.75_O1C02Cl2In0__O1OD1E_421.95Hz_1170_0

397449078

2 Apr 2019 15:37:08 UTC

5 Apr 2019 19:26:24 UTC

Error while computing

Gravitational Wave Engineering run on LIGO O1 Open Data v0.08 () windows_x86_64

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Betreger wrote: Over night

21 Apr 2019 16:02:13 UTC

Message 170851 in response to message 170850

(moderation:

)

Betreger wrote:

Over night this is what I have been doing almost exclusively:

Gravitational Wave Engineering run on LIGO O1 Open Data v0.08 (GW-opencl-nvidia-V1)

h1_0421.75_O1C02Cl2In0__O1OD1E_421.95Hz_1170_0 397449078 2 Apr 2019 15:37:08 UTC 5 Apr 2019 19:26:24 UTC Error while computing 0 0 0
Gravitational Wave Engineering run on LIGO O1 Open Data v0.08 () windows_x86_64

That was an old v0.08 task that was buried a long time ago. But I see that host has been struggling with current v0.13 tasks today. Error looks different for them. For example this task: https://einsteinathome.org/task/841574537

[ERROR] Couldn't get OpenCL device from BOINC (-1)!

Your host is currently running almost 2 years old Nvidia driver. It made me wonder if that could be a problem with this GPU application. Maybe a newer driver version would be worth trying.

EDIT: I can't see a GPU on that host now. Did I mix the two different hosts... did they both have GTX 1060?

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1589395724

RAC: 762526

Digging a bit deeper Boinc

21 Apr 2019 16:12:17 UTC

Message 170852 in response to message 170851

(moderation:

)

Digging a bit deeper Boinc says the GTX1060 is missing. A reboot did not help, it does show in the device manager and it thinks it's OK. I have no clue what to do next.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

I would try these two

21 Apr 2019 16:27:58 UTC

Message 170854

(moderation:

)

I would try these two things:

1. Update Boinc to newest version (if there's no special reason for currently running that older version).

2. "Clean install" latest Nvidia driver:

a) Download the driver from Nvidia web site. https://www.nvidia.com/download/driverResults.aspx/145870/en-us
b) Disconnect internet connection.
c) Run the driver installer and choose custom installation and below the components check the box "clean install".
d) Reboot after installation.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1589395724

RAC: 762526

The latest and greatest

21 Apr 2019 17:05:36 UTC

Message 170855 in response to message 170854

(moderation:

)

The latest and greatest driver seems to have fixed the problem ATM, why the old driver ceased working boggles my mind.

Now the task is to process some work and see if it validates.

Thanx

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

Gary Roberts wrote:DanNeely

21 Apr 2019 19:06:00 UTC

Message 170859 in response to message 170842

(moderation:

)

Gary Roberts wrote:

DanNeely wrote:
Zalster wrote:

Gary, what about an exclude gpu in the cc_config?

Your attempt to limit the exclusion to GW GPU tasks didn't work, it also showed GPU missing on my Fermi tasks, failed over to a backup project, and at some point in there process began aborting the fermi GPU tasks (I managed to stop boinc and revert the change before it took out more than 50 or 60 of them).

Dan,
I'm not sure who you were referring to when you said, "Your attempt ...". Your message is responding to Zalster but I suspect it might have been my suggestion that caused your grief. If it was, I'm very sorry for the adverse effects.

I don't use a cc_config.xml on most machines. For the few that I do, it was for the purpose of increasing the daily quota when the fast crunching GPU tasks were on offer a while ago. I've perused all the cc_config options again now and have noted the extra functionality available for the <exclude_gpu> option. Last time I looked at that, (probably a long time ago) I don't remember it being able to select on the basis of the app name. I thought it worked just at the project level. I'm not familiar with that option at all so I could easily be mistaken.

I think Zalster's suggestion should work. I imagine you just want to exclude the GW GPU app so no device_num or type should be needed, just the project URL and the app short name. If the documentation describes things accurately, just add the <exclude_gpu> stuff inside the <options> clause of your current config file or create the file outright if you don't already have one. Here is what I believe you would need if you were starting with a new file and not adding further options. Read the documentation if you're not sure but it seems to me that the following should do what you want.
<cc_config>
    <options>
        <exclude_gpu>
            <url>http://einstein.phys.uwm.edu</url>
            <app>einstein_O1OD1E</app>
        </exclude_gpu>
    </options>
</cc_config>
Once again, sorry if my suggestions stuffed things up for you. One of the side effects of using an app_config.xml file (as opposed to the current cc_config suggestion) is that you can't remove the influence by just deleting the app_config.xml file. This is because stuff from the file gets permanently inserted into the state file. So if your attempt with app_config has ended with just deleting the file, there might still be issues.

The documentation says you need to reset the project to clean out that inserted stuff. I've found that just editing the file to change all options back to the default values seems to work OK without actually performing a full project reset. I also believe it should be possible to find what has been inserted into the state file and manually remove it while the client is not running. If you show exactly what you were using when the problems occurred, it might be possible to see what would be best to do.

You also asked about sched_request and sched_reply. When your client contacts the scheduler, the full content of the request and the reply are stored as xml files in the BOINC data directory, replacing the previous exchange. You can always grab a copy of these before a subsequent exchange overwrites them :-). The project URL is part of the name so there should be a pair of files for each different project you support, documenting the most recent exchange.

With the exception of not trying to narrow it farther by a plan class that looks the same as the cc_config adidtion that Zalster suggested I try; which did disable the GPU for the Fermi app in addition to the GW one.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

I tried your cut down version

21 Apr 2019 22:33:59 UTC

Message 170862

(moderation:

)

I tried your cut down version (minus the stay ]), same result. Boinc appears to be parsing the config correctly but then doesn't actually handle it the right way and turns off the GPU for everything:

https://i.imgur.com/D2MGAeC.png

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

A few minutes after thinking

22 Apr 2019 10:39:45 UTC

Message 170863

(moderation:

)

A few minutes after thinking I'd wrapped up the tests and restored everything to normal I noticed my Fermi GPU tasks still weren't running even after I'd reloaded the cc_config file version that should have re-enabled them.

I had to restart BOINC to get them running. Having seen one case where the reload config option wasn't working correctly, I tried inserting Gary's suggested lines in again. This time it did work at least to the extent of running Fermi GPU and Einstein CPU tasks while excluding Einstein GPU tasks; which makes me think there's still something not right about the cc config updates setting boinc state if reloaded while running vs at startup.

I'm also not convinced the Boinc client correctly understands what work it can request. Shortly after I got the cc_config working with what apps I do/don't want to run it downloaded a GPU task from one of my 0% share backup projects, and when I aborted it as unneeded the client promptly downloaded a task from a different GPU backup. The event log for an E@H request I manually triggered doesn't look quite right in that it's not reporting the status of the GPU (my box without the cc_config changes reports "CPU: job cache full; NVIDIA GPU: job cache full").

4/21/2019 6:56:10 PM | Einstein@Home | Sending scheduler request: Requested by user.
4/21/2019 6:56:10 PM | Einstein@Home | Not requesting tasks: don't need (CPU: job cache full; NVIDIA GPU: )

I uploaded the sched request/responses for that update. They're large enough that I don't want to do a blind search through them, but if you or anyone else has an idea of what to look for specifically they're both on pastebin.

request:

https://pastebin.com/39Xh5Cm9

response:

https://pastebin.com/vLAbJ6GJ

I'll be keeping an eye on my system for the next day or so, if it continues to not download any new fermi tasks I'll have to revert the cc_config changes and either resume manually aborting GW GPU tasks, or see if enough accumulate to stop my fetching more fermi GPU work before they time out and automatically fail.

Gravitational Wave Engineering run on LIGO O1 Open Data

Forums › Cruncher's Corner

Hi! That was a bug on v0.12 which is now deprecated and current version is v0.13.

Comment viewing options

Forums › Cruncher's Corner