Rejoin and everything bombs!

James L. Neill
James L. Neill
Joined: 14 Dec 10
Posts: 13
Credit: 141558696
RAC: 0
Topic 222854

Computer 12492607

Work unit example:-

Name: LATeah1064L12_452.0_0_0.0_20332046_0
Workunit ID: 463702827
Created: 16 Jun 2020 15:17:50 UTC
Sent: 16 Jun 2020 15:23:24 UTC
Report deadline: 30 Jun 2020 15:23:24 UTC
Received: 16 Jun 2020 15:32:22 UTC
Server state: Over
Outcome: Computation error
Client state: Compute error
Exit status: 11 (0x0000000B) Unknown error code
Computer: 12492607

All my units after rejoining are failing and I do not know where to start looking. I suspect that the problem revolves around this line:-

read_checkpoint(): Couldn't open file 'LATeah1064L12_452.0_0_0.0_20332046_0_0.out.cpt': No such file or directory (2)

Any help in pointing my nose in the right direction would be much appreciated.

James

 

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

Hi and welcome back! The

Hi and welcome back!

The line about not being able to read the checkpoint is normal when starting a new task from scratch.

I looked at a few of your failed tasks and in the Gravity wave tasks this message is logged:

OpenCL is not available with these features!

The problem seems to be with your graphics drivers and OpenCL support.
Try downloading and installing new drivers from AMDs website.

James L. Neill
James L. Neill
Joined: 14 Dec 10
Posts: 13
Credit: 141558696
RAC: 0

Thank you Holmis. I will

Thank you Holmis.

I will try that and then wait until I can get work again. I had not thought of that going wrong. I should add a caveat that I run Milkyway@home too and I have no problem with GPU tasks. Time to start digging and once again thank you for your assistance.

James

James L. Neill
James L. Neill
Joined: 14 Dec 10
Posts: 13
Credit: 141558696
RAC: 0

Good morning Holmis. I

Good morning Holmis.

I reinstalled the video-drivers and sadly that did not work. Oddly when I got work Boinc got sticky and would not respond when I tried to stop EInstein.

I have subsequently changed settings to one task per gpu where that figure was two. Before I forget the error message was the same as last time.

My startup info may help shed some light on the matter:-

Wed 17 Jun 2020 16:40:24 BST |  | Starting BOINC client version 7.16.6 for x86_64-pc-linux-gnu
Wed 17 Jun 2020 16:40:24 BST |  | log flags: file_xfer, sched_ops, task
Wed 17 Jun 2020 16:40:24 BST |  | Libraries: libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
Wed 17 Jun 2020 16:40:24 BST |  | Data directory: /var/lib/boinc-client
Wed 17 Jun 2020 16:40:43 BST |  | OpenCL: AMD/ATI GPU 0: AMD Radeon (TM) RX 480 Graphics (POLARIS10, DRM 3.37.0, 5.4.0-37 (driver version 20.0.4, device version OpenCL 1.1 Mesa 20.0.4, 26214MB, 26214MB available, 3773 GFLOPS peak)
Wed 17 Jun 2020 16:40:43 BST |  | OpenCL: AMD/ATI GPU 1: AMD Radeon (TM) RX 480 Graphics (POLARIS10, DRM 3.37.0, 5.4.0-37 (driver version 20.0.4, device version OpenCL 1.1 Mesa 20.0.4, 26214MB, 26214MB available, 3773 GFLOPS peak)
Wed 17 Jun 2020 16:40:43 BST |  | OpenCL: AMD/ATI GPU 2: AMD Radeon (TM) RX 480 Graphics (driver version 3110.6, device version OpenCL 1.2 AMD-APP (3110.6), 8147MB, 8147MB available, 6036 GFLOPS peak)
Wed 17 Jun 2020 16:40:43 BST |  | OpenCL: AMD/ATI GPU 3: AMD Radeon (TM) RX 480 Graphics (driver version 3110.6, device version OpenCL 1.2 AMD-APP (3110.6), 8152MB, 8152MB available, 6036 GFLOPS peak)
Wed 17 Jun 2020 16:40:43 BST |  | libc: Ubuntu GLIBC 2.31-0ubuntu9 version 2.31
Wed 17 Jun 2020 16:40:43 BST |  | Host name: orac002
Wed 17 Jun 2020 16:40:43 BST |  | Processor: 8 AuthenticAMD AMD FX(tm)-8320 Eight-Core Processor [Family 21 Model 2 Stepping 0]
Wed 17 Jun 2020 16:40:43 BST |  | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate ssbd ibpb vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
Wed 17 Jun 2020 16:40:43 BST |  | OS: Linux Ubuntu: Ubuntu 20.04 LTS [5.4.0-37-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9)]
Wed 17 Jun 2020 16:40:43 BST |  | Memory: 31.28 GB physical, 122.07 GB virtual
Wed 17 Jun 2020 16:40:43 BST |  | Disk: 103.13 GB total, 82.68 GB free
Wed 17 Jun 2020 16:40:43 BST |  | Local time is UTC +1 hours
Wed 17 Jun 2020 16:40:43 BST |  | Config: GUI RPCs allowed from:
Wed 17 Jun 2020 16:40:43 BST |  | Config: use all coprocessors
Wed 17 Jun 2020 16:40:43 BST | Einstein@Home | General prefs: from Einstein@Home (last modified 15-Jun-2020 22:45:07)
Wed 17 Jun 2020 16:40:43 BST | Einstein@Home | Host location: none
Wed 17 Jun 2020 16:40:43 BST | Einstein@Home | General prefs: using your defaults

Back to waiting for more work!

James

 

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 139002861
RAC: 0

I don’t think Mesa is

I don’t think Mesa is suitable for Einstein. Try the AMDGPU OpenCL.

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

I've got no experience with

I've got no experience with GPU crunching under Linux as I'm a Windows user. But I seem to remember others having problems with Mesa drivers here at Einstein. I'd follow Marks advice and try the other driver type.
If you need help then I'm sure some of our resident Linux users will chime in and help get you going.

cecht
cecht
Joined: 7 Mar 18
Posts: 1537
Credit: 2914408654
RAC: 2130283

It's not clear which video

It's not clear which video drivers you reinstalled, but you will need to run:
~$ ./amdgpu-install -y --opencl=legacy --headless
from within the directory newly of the released amdgpu-pro-20.20-1089974-ubuntu-20.04 package (downloaded from here, https://www.amd.com/en/support/graphics/radeon-400-series/radeon-rx-400-series/radeon-rx-480). Read the included install documentation to know what's what.

The Mesa video drivers are fine, but you need AMD's OpenCL 1.2 for crunching on the rx480s. The earlier AMD Radeon driver stack didn't play well with Ubuntu 20.04, but the recent AMD update to is supposed to have fixed the issues. (I'm still on 18.04, so haven't tried the latest AMD offerings.)

If you previously installed any AMD Radeon drivers, you will need to run amdgpu-pro-uninstall before installing the OpenCL drivers. (This is true for any AMD Radeon driver installation or reinstallation.) There is no need to reboot after the uninstall, just do it after the --opencl install.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

James L. Neill
James L. Neill
Joined: 14 Dec 10
Posts: 13
Credit: 141558696
RAC: 0

I have just seen everybody's

I have just seen everybody's new posts and I will try this in the morning. Thank you all!

James

 

 

James L. Neill
James L. Neill
Joined: 14 Dec 10
Posts: 13
Credit: 141558696
RAC: 0

I had not tried a headless

I had not tried a headless install. I have now done as you suggested and removed this lie from cc_config.xml:-

<use_all_gpus>1</use_all_gpus>

That move disabled the "mesa" GPU's!

Fri 19 Jun 2020 12:08:40 BST |  | Data directory: /var/lib/boinc-client
Fri 19 Jun 2020 12:08:40 BST |  | OpenCL: AMD/ATI GPU 0 (ignored by config): AMD Radeon (TM) RX 480 Graphics (POLARIS10, DRM 3.37.0, 5.4.0-37 (driver version 20.0.4, device version OpenCL 1.1 Mesa 20.0.4, 26214MB, 26214MB available, 3773 GFLOPS peak)
Fri 19 Jun 2020 12:08:40 BST |  | OpenCL: AMD/ATI GPU 1 (ignored by config): AMD Radeon (TM) RX 480 Graphics (POLARIS10, DRM 3.37.0, 5.4.0-37 (driver version 20.0.4, device version OpenCL 1.1 Mesa 20.0.4, 26214MB, 26214MB available, 3773 GFLOPS peak)
Fri 19 Jun 2020 12:08:40 BST |  | OpenCL: AMD/ATI GPU 2: AMD Radeon (TM) RX 480 Graphics (driver version 3110.6, device version OpenCL 1.2 AMD-APP (3110.6), 7786MB, 7786MB available, 6036 GFLOPS peak)
Fri 19 Jun 2020 12:08:40 BST |  | OpenCL: AMD/ATI GPU 3: AMD Radeon (TM) RX 480 Graphics (driver version 3110.6, device version OpenCL 1.2 AMD-APP (3110.6), 8184MB, 8184MB available, 6036 GFLOPS peak)
Fri 19 Jun 2020 12:08:40 BST |  | libc: Ubuntu GLIBC 2.31-0ubuntu9 version 2.31
Fri 19 Jun 2020 12:08:40 BST |  | Host name: orac002

Now I must wait 12 hours for more work!

Once again thank you for your help.

James

 

 

 

James L. Neill
James L. Neill
Joined: 14 Dec 10
Posts: 13
Credit: 141558696
RAC: 0

Oh dear definitely does not

Oh dear definitely does not work.

Below is, I think, the relevant error text:-

LAL Error - MAIN (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/LIBC215/TARGET/linux-x86_64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:869): OpenCL is not available with these features!
XLAL Error - MAIN (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/LIBC215/TARGET/linux-x86_64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:869): Internal function call failed
2020-06-19 15:26:31.2462 (22630) [CRITICAL]: ERROR: MAIN() returned with error '1'
2020-06-19 15:26:31.2462 (22630) [debug]: resultfile '../../projects/einstein.phys.uwm.edu/h1_1611.80_O2C02Cl4In0__O2MDFV2i_VelaJr1_1612.70Hz_453_1_0' (len 95), current config file: 0

computer: https://einsteinathome.org/host/12492607

Back to the drawing board!


 


James L. Neill
James L. Neill
Joined: 14 Dec 10
Posts: 13
Credit: 141558696
RAC: 0

Hooray! I managed to solve

Hooray!

I managed to solve the problem and am now happily crunching Einstein work!

Solution:-

******** It looked a bit funny that I had 4 GPU's recognised by BOINC. ********

1). Shutdown projects - no new tasks and suspend.

2). Shutdown Boinc service.

3). Uninstall the amdgpu-pro installation.

4). Reboot.

5). Run sudo apt purge mesa-opencl-icd ********

6). Reinstall (full) amdgpu-pro and reboot.

7). Restart Boinc.

8). Restart projects one at a time.

9). BULLSEYE!

I do find it interesting that Milkyway could recognise the mesa drivers and Einstein not. I suspect that any pre-existing opencl package should be removed before installing the amdgpu-pro stack.

Finally Thank you Cecht, Holmis and Markj for helping me along the way.

James

PS: Have just completed my first, successful, Einstein unit this very second.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.