Computation Error O3MDF on Nvidia Windows 11

Colin Haig
Colin Haig
Joined: 7 Mar 05
Posts: 7
Credit: 547555773
RAC: 2422799
Topic 228748

Multi-Directional Gravitational Wave search on O3 (GPU) v1.02 () windows_x86_64

I'm seeing computation errors - I think floating point exceptions 0xc0000090 - on this app on a Windows 11 machine, with NVidia GPUs (980Ti, 980, both have the same compute capabilities).  I've tried changing Nvidia driver versions but it didn't make a difference.

e.g. https://einsteinathome.org/task/1394004030

I realize people are on holidays, but thought I'd ask in case anyone else is seeing this.

TPCBF
TPCBF
Joined: 24 Nov 12
Posts: 17
Credit: 188534230
RAC: 1350275

Got a bunch of those this

Got a bunch of those this morning on a Windows 10 host, running a NVidia RTX A2000. My only other WU that is currently GPU enabled is my programming laptop, which got a GTX1060 and that one just got a few minutes ago 4 of those, but they are not started yet.
Binary Radio Pulsar Search (MeerKAT) is running on the same machine just fine, though I noticed in general that not all of the received and returned WUs are showing up under my account here, but that's a story for another thread...

TPCBF
TPCBF
Joined: 24 Nov 12
Posts: 17
Credit: 188534230
RAC: 1350275

Well, after the last post, I

Well, after the last post, I got a lot more of those GPU WUs and all of them will terminate (from a couple of seconds to a couple of minutes) with a computation error and now for the second time within an hour, the whole system blue screens and reboots.... :(

 

mikey
mikey
Joined: 22 Jan 05
Posts: 12640
Credit: 1839026224
RAC: 5478

TPCBF wrote: Well, after the

TPCBF wrote:

Well, after the last post, I got a lot more of those GPU WUs and all of them will terminate (from a couple of seconds to a couple of minutes) with a computation error and now for the second time within an hour, the whole system blue screens and reboots.... :( 

If you don't already use it get MSIAfterburner, it works for both AMD and Nvidia gpu's, and check out the heat that is being generated on that laptop by those 03 tasks and then scroll down and check the cpu temps as well, you could be seeing problems due to overheating.

TPCBF
TPCBF
Joined: 24 Nov 12
Posts: 17
Credit: 188534230
RAC: 1350275

mikey wrote: TPCBF

mikey wrote:

TPCBF wrote:

Well, after the last post, I got a lot more of those GPU WUs and all of them will terminate (from a couple of seconds to a couple of minutes) with a computation error and now for the second time within an hour, the whole system blue screens and reboots.... :( 

If you don't already use it get MSIAfterburner, it works for both AMD and Nvidia gpu's, and check out the heat that is being generated on that laptop by those 03 tasks and then scroll down and check the cpu temps as well, you could be seeing problems due to overheating.

No, I don't run that program (yet), but I doubt that it is a heat related issue. On one batch that then lead to a blue screen, the WUs got from 0 to 90% in a couple of seconds, then showed computation error a couple of seconds later, before the machine even got a chance to run hot. It also runs other GPU task just fine (like the OPNG ones from WCG).
Als, when the latest batch came in last night, I suspended all the NVidia tasks (the Intel GPU tasks ran just fine all this time, so did the CPU tasks), then resumed them manual one by one until I went to bed, and again this morning when getting back to my desk. And those WUs run and finish just fine (just don't get any credit for them still)...

mikey
mikey
Joined: 22 Jan 05
Posts: 12640
Credit: 1839026224
RAC: 5478

TPCBF wrote:mikey

TPCBF wrote:

mikey wrote:

TPCBF wrote:

Well, after the last post, I got a lot more of those GPU WUs and all of them will terminate (from a couple of seconds to a couple of minutes) with a computation error and now for the second time within an hour, the whole system blue screens and reboots.... :( 

If you don't already use it get MSIAfterburner, it works for both AMD and Nvidia gpu's, and check out the heat that is being generated on that laptop by those 03 tasks and then scroll down and check the cpu temps as well, you could be seeing problems due to overheating.

No, I don't run that program (yet), but I doubt that it is a heat related issue. On one batch that then lead to a blue screen, the WUs got from 0 to 90% in a couple of seconds, then showed computation error a couple of seconds later, before the machine even got a chance to run hot. It also runs other GPU task just fine (like the OPNG ones from WCG).
Als, when the latest batch came in last night, I suspended all the NVidia tasks (the Intel GPU tasks ran just fine all this time, so did the CPU tasks), then resumed them manual one by one until I went to bed, and again this morning when getting back to my desk. And those WUs run and finish just fine (just don't get any credit for them still)... 

So you are running tasks on your cpu, your gpu AND the gpu built-into the cpu all at the same time? AND all of that on a laptop as well?

rbpeake
rbpeake
Joined: 18 Jan 05
Posts: 266
Credit: 1118007797
RAC: 742575

I started seeing these errors

I started seeing these errors on a machine with a successful validation rate, so something changed.  These work units have multiple "Error while computing" instances.

Here is one of these units.  Workunit 693628022 | Einstein@Home (einsteinathome.org)

Here is another from my other computer.  Workunit 694423029 | Einstein@Home (einsteinathome.org)

Thank you.

Colin Haig
Colin Haig
Joined: 7 Mar 05
Posts: 7
Credit: 547555773
RAC: 2422799

I enabled some extra logging,

I enabled some extra logging, and also am seeing an error with the AMD/ATI version of the task.

Log snippet here:

20-Dec-2022 09:06:13 [Einstein@Home] Output file h1_0432.40_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_776_3_0 for task h1_0432.40_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_776_3 absent
20-Dec-2022 09:06:13 [Einstein@Home] [task] result state=COMPUTE_ERROR for h1_0432.40_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_776_3 from CS::app_finished
20-Dec-2022 09:06:13 [Einstein@Home] [coproc] NVIDIA instance 0; 1.000000 pending for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2912_3
20-Dec-2022 09:06:13 [Einstein@Home] [coproc] ATI instance 0; 1.000000 pending for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2979_3
20-Dec-2022 09:06:13 [Einstein@Home] [coproc] NVIDIA instance 1: confirming 1.000000 instance for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2912_3
20-Dec-2022 09:06:13 [Einstein@Home] [coproc] ATI instance 0: confirming 1.000000 instance for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2979_3
20-Dec-2022 09:06:13 [Einstein@Home] [coproc] Assigning NVIDIA instance 0 to h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2936_3
20-Dec-2022 09:06:14 [Einstein@Home] [task_debug] task is running in processor group 0
20-Dec-2022 09:06:14 [Einstein@Home] [task] task_state=EXECUTING for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2936_3 from start
20-Dec-2022 09:06:14 [Einstein@Home] Starting task h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2936_3
20-Dec-2022 09:06:15 [Einstein@Home] [task] Process for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2912_3 exited, exit code 3221225616, task state 1
20-Dec-2022 09:06:15 [Einstein@Home] [task] task_state=EXITED for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2912_3 from handle_exited_app
20-Dec-2022 09:06:15 [Einstein@Home] [task] result state=COMPUTE_ERROR for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2912_3 from CS::report_result_error
20-Dec-2022 09:06:15 [Einstein@Home] [task] Process for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2912_3 exited
20-Dec-2022 09:06:15 [Einstein@Home] [task] exit code -1073741680 (0xc0000090): (unknown error)
20-Dec-2022 09:06:15 [Einstein@Home] Finished download of p2030.20200619.G31.86+04.57.S.b1s0g0.00000.zap
20-Dec-2022 09:06:15 [Einstein@Home] Finished download of p2030.20200619.G32.32+03.17.N.b0s0g0.00000_2072.bin4
20-Dec-2022 09:06:15 [Einstein@Home] Started download of p2030.20200619.G32.32+03.17.N.b0s0g0.00000_2073.bin4
20-Dec-2022 09:06:15 [Einstein@Home] Started download of p2030.20200619.G32.32+03.17.N.b0s0g0.00000_2074.bin4
20-Dec-2022 09:06:15 [Einstein@Home] Computation for task h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2912_3 finished
20-Dec-2022 09:06:15 [Einstein@Home] Output file h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2912_3_0 for task h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2912_3 absent
20-Dec-2022 09:06:15 [Einstein@Home] [task] result state=COMPUTE_ERROR for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2912_3 from CS::app_finished
20-Dec-2022 09:06:15 [Einstein@Home] [task] result state=FILES_DOWNLOADED for p2030.20200619.G31.86+04.57.S.b1s0g0.00000_3600_1 from CS::update_results
20-Dec-2022 09:06:15 [Einstein@Home] [coproc] NVIDIA instance 0; 1.000000 pending for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2936_3
20-Dec-2022 09:06:15 [Einstein@Home] [coproc] ATI instance 0; 1.000000 pending for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2979_3
20-Dec-2022 09:06:15 [Einstein@Home] [coproc] NVIDIA instance 0: confirming 1.000000 instance for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2936_3
20-Dec-2022 09:06:15 [Einstein@Home] [coproc] ATI instance 0: confirming 1.000000 instance for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2979_3
20-Dec-2022 09:06:15 [Einstein@Home] [coproc] Assigning NVIDIA instance 1 to h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2931_3
20-Dec-2022 09:06:16 [Einstein@Home] [task_debug] task is running in processor group 0
20-Dec-2022 09:06:16 [Einstein@Home] [task] task_state=EXECUTING for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2931_3 from start
20-Dec-2022 09:06:16 [Einstein@Home] Starting task h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2931_3
20-Dec-2022 09:06:17 [Einstein@Home] [task_debug] task is running in processor group 0
20-Dec-2022 09:06:17 [Einstein@Home] [task] task_state=EXECUTING for p2030.20200619.G31.86+04.57.S.b1s0g0.00000_3600_1 from start
20-Dec-2022 09:06:17 [Einstein@Home] Starting task p2030.20200619.G31.86+04.57.S.b1s0g0.00000_3600_1
20-Dec-2022 09:06:19 [Einstein@Home] Finished download of p2030.20200619.G32.32+03.17.N.b0s0g0.00000_2073.bin4
20-Dec-2022 09:06:19 [Einstein@Home] Finished download of p2030.20200619.G32.32+03.17.N.b0s0g0.00000_2074.bin4
20-Dec-2022 09:06:19 [Einstein@Home] Started download of p2030.20200619.G32.32+03.17.N.b0s0g0.00000_2075.bin4
20-Dec-2022 09:06:19 [Einstein@Home] Started download of p2030.20200619.G32.32+03.17.N.b0s0g0.00000_2076.bin4
20-Dec-2022 09:06:20 [Einstein@Home] [task] Process for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2979_3 exited, exit code 1057, task state 1
20-Dec-2022 09:06:20 [Einstein@Home] [task] task_state=EXITED for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2979_3 from handle_exited_app
20-Dec-2022 09:06:20 [Einstein@Home] [task] result state=COMPUTE_ERROR for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2979_3 from CS::report_result_error
20-Dec-2022 09:06:20 [Einstein@Home] [task] Process for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2979_3 exited
20-Dec-2022 09:06:20 [Einstein@Home] [task] exit code 1057 (0x421): The account name is invalid or does not exist, or the password is invalid for the account name specified.
 (0x421)
20-Dec-2022 09:06:21 [Einstein@Home] Computation for task h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2979_3 finished
20-Dec-2022 09:06:21 [Einstein@Home] Output file h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2979_3_0 for task h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2979_3 absent
20-Dec-2022 09:06:21 [Einstein@Home] [task] result state=COMPUTE_ERROR for h1_0432.20_O3aC01Cl1In0__O3MDFG1_G34731_432.50Hz_2979_3 from CS::app_finished

 

The NVIDIA version is throwing a floating point error:

20-Dec-2022 09:06:15 [Einstein@Home] [task] exit code -1073741680 (0xc0000090): (unknown error)

 

The AMD/ATI version is throwing an Account/Password error:

20-Dec-2022 09:06:20 [Einstein@Home] [task] exit code 1057 (0x421): The account name is invalid or does not exist, or the password is invalid for the account name specified.
 (0x421)

Am happy to run specific tests or provide more data.

Best regards

Colin

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3927
Credit: 45682462642
RAC: 63982393

the project has acknowledged

the project has acknowledged an issue with the Windows application, but they wont be able to address it until early next year.

https://einsteinathome.org/content/multi-directional-gravitational-wave-search-o3-data-o3md1f?page=4#comment-205427

Oliver Behnke wrote:

We are aware of an issue that can affect the Windows GPU app right now. We'll look into it ASAP but it'll take until the first week of January, unfortunately (see above). We'll update this thread as soon as we think we've resolved the issue. Until then it's of course perfectly fine to opt out of the app for the time being.

Sorry for the hassle, sometimes these bugs only manifest themselves when launching the apps full-scale, despite all beta testing we do.

_________________________________________________________________________

Colin Haig
Colin Haig
Joined: 7 Mar 05
Posts: 7
Credit: 547555773
RAC: 2422799

Thanks for confirming other

Thanks for confirming other folks have the issues too.

 

I've modified my cc_config.xml to disable the OM3DF tasks and other stuff is running fine.

<cc_config>
   <log_flags>
    <coproc_debug>1</coproc_debug>
    <task_debug>1</task_debug>
   </log_flags>
   <options>
        <use_all_gpus>1</use_all_gpus>
        <exclude_gpu>
            <url>https://einstein.phys.uwm.edu/</url>
            <app>einstein_O3MDF</app>
        </exclude_gpu>
   </options>
</cc_config>

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.