My WU completion rate has been pretty consistent for the last couple of years, but I upgraded OS to F31 and E@H immediately started failing after that. All outputs look the same, failing in LLVM-9. Can some one interpret this for me?
I found a similar thread from three weeks ago that mentioned LIBC215 option? I hadn't heard of that, but I tried toggling it, update, new WUs, but no joy. I don't have output from the latest failed WU, but this is one from before I tried LIBC215:
<core_client_version>7.14.2</core_client_version> <![CDATA[ <message> process exited with code 11 (0xb, -245)</message> <stderr_txt> 17:16:23 (4170162): [normal]: This Einstein@home App was built at: Jan 16 2017 08:09:16
17:16:23 (4170162): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati'.
17:16:23 (4170162): [debug]: 1e+16 fp, 7.5e+09 fp/s, 1401846 s, 389h24m05s78
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati --inputfile ../../projects/einstein.phys.uwm.edu/LATeah1062L12.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.526880e-07 --ldiBins 30 --f0start 340.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah1062L12_0348_37827783.dat --debug 1 --device 0 -o LATeah1062L12_348.0_0_0.0_37827783_0_0.out
output files: 'LATeah1062L12_348.0_0_0.0_37827783_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah1062L12_348.0_0_0.0_37827783_0_0' 'LATeah1062L12_348.0_0_0.0_37827783_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah1062L12_348.0_0_0.0_37827783_0_1'
17:16:23 (4170162): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
17:16:23 (4170162): [debug]: glibc version/release: 2.30/stable
17:16:23 (4170162): [debug]: Set up communication with graphics process.
-- signal handler called: signal 1
3 stack frames obtained for this thread:
Frame 3:
Binary file: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati (0x48b101)
Source file: hs_boinc_extras.c (Function: sighandler / Line: 291)
Frame 2:
Binary file: /lib64/libLLVM-9.so (0x7ff1bf0beca3)
Offset info: +0x3785ca3
Frame 1:
Binary file: /lib64/libLLVM-9.so (0x7ff1bf0beca3)
Offset info: +0x3785ca3
End of stcaktrace
17:16:23 (4170162): called boinc_finish
Copyright © 2024 Einstein@Home. All rights reserved.
Please unhide your computers
)
Please unhide your computers or post a link to the troubled host.
What are you looking for, is
)
Actually not really sure what
)
Actually not really sure what I'm looking for but going through the latest tasks (failed or not), scheduler logs and system specs might point to something. And if I'm not able to, as a mostly Windows user, then others might see something amiss and have advice for you to try.
If you're concerned about what will be shown then click on my alias and browse my computers, I think it's harmless to show my computers here on the site.
As for trying to solve your problem start by making sure that the graphics driver and especially OpenCL is installed correctly. Other than that I'll have to defer to one of the resident Linux specialists.
Thanks Holmis, much
)
Thanks Holmis, much appreciated. Sent you my host link.
I am using the OSS driver and OpenCL stack so I didn't have to install anything, which is not to say it's not broken, only that I didn't break it, knowingly. SETI@Home has been broken for 5 years, now, after working for 3; so, who knows. Just hoping that someone has an idea why this is happening all of a sudden. It's very likely that the stack was updated, but the main problem remains that, when there are problems in the stack, it's impossible, AFAIK, to troubleshoot. I just rely on folks recognizing the error or knowing from some upstream source that there is a bug and I can check to see if that bug exists on my system.
Paul wrote:My WU completion
)
Nope, I can't :-).
Let's summarise things. Correct me if I get it wrong. You had an older version of the OS (presumaby Fedora) and you upgraded to the latest. Do you remember what you had to do with the older install to get OpenCL libs installed and everything working properly? Did you repeat this procedure for the latest install, using the latest (presumably compatible) versions of OpenCL? You don't mention what GPU you are using and your computers are hidden so I can't look to see. Could it be that your GPU isn't supported any more by the latest drivers? What exactly is your GPU?
My guess is that there is some sort of incompatibility between the OS and/or the OpenCL libs you have installed and/or your hardware.
I'm running lots of AMD GPUs on a Linux (PCLinuxOS) which is not supported directly by the stuff that AMD supplies in the AMDGPU-PRO package. The supported systems are Red Hat, Ubuntu, and OpenSUSE. Does the Red Hat version also support Fedora? Do the Fedora maintainers provide their own version of that package? Have you asked about this on Fedora Forums?
Here is a summary of what I do with relatively modern GPUs, seeing as I have a totally unsupported OS. PCLOS is RPM based so I download the Red Hat version of AMDGPU-PRO and extract the contents (approx 50 separate RPMs). I select about 5 of those that contain libs that I need - mainly pertaining to OpenCL. I install these bits under two main paths - /opt/amdgpu/ and /opt/amdgpu-pro/. I could give you a list of the RPMs I use and the filenames of everything I install. It's not that large a list.
There are a couple of other bits that go elsewhere (eg. under /etc/). I make sure that BOINC can find what it needs by setting the LD_LIBRARY_PATH environment variable
LD_LIBRARY_PATH=/opt/amdgpu-pro/lib64:/opt/ampgpu/lib64
in the script I use to launch BOINC. If you have OpenCL properly installed, you should be able to run clinfo and have it report the OpenCL capabilities of your GPU. Have you tried running clinfo to see what it says? Here is a small extract from a terminal session where I run clinfo (it's not in my $PATH) with the LD_LIBRARY_PATH set and just grab the first 15 lines to see if all looks OK. I do this after a new install just to make sure everything looks OK.
I've been installing the OpenCL libs this way since the 16.60 version of the amdgpu-pro package in late 2016. There have been changes along the way but I've been able to work out what to do to handle the differences. The latest version I've tried is 19.10. There doesn't seem to be any real difference in crunching performance with the different versions of the libs. There have been reliability benefits by keeping up with the latest versions of the amdgpu graphics driver that's built in to the kernel. To get these benefits, you need to be running relatively recent kernels.
One thing you could do to start with is post a copy of all the event log messages you get when you launch BOINC. That should provide some information about what BOINC thinks of the OpenCL capabilities of your GPU. Here is an example of what I see on one of mine.
Cheers,
Gary.
I'd started composing a few
)
I'd started composing a few thought on this problem but see that Gary has given a much better reply that I ever could when it comes to Linux, so I'll leave you in his capable care.
Hey Gary! I think you and I
)
Hey Gary!
I think you and I have been around this block once before. I'm using all OSS stack, and I have not been able to get they hybrid system you described working on my system. My computers are no longer hidden, if you want to look at that stuff.
The OSS stack only seems to get better. Now, even the clpeak and clinfo work all the time, when they used to give errors or show missing pieces. So, I'm more reluctant than before to go to the -PRO. I keep up with the kernels every week and I know Fedora stays a bit ahead of Debian/Ubuntu on that front.
Platform Name Portable Computing Language
Platform Vendor The pocl project
Platform Version OpenCL 1.2 pocl 1.5-pre, RelWithDebInfo, LLVM 9.0.0, RELOC, SLEEF, DISTRO, POCL_DEBUG
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd
Platform Extensions function suffix POCL
Platform Name Clover
Number of devices 1
Device Name AMD Radeon (TM) RX 480 Graphics (POLARIS10, DRM 3.33.0, 5.3.7-301.fc31.x86_64, LLVM 9.0.0)
Device Vendor AMD
Device Vendor ID 0x1002
Device Version OpenCL 1.1 Mesa 19.2.2
Driver Version 19.2.2
Device OpenCL C Version OpenCL C 1.1
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Max compute units 36
Max clock frequency 1288MHz
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple 64
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 8 / 8 (cl_khr_fp16)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 8589934592 (8GiB)
Error Correction support No
Max memory allocation 6871947673 (6.4GiB)
Unified memory for Host and Device No
Minimum alignment for any data type 128 bytes
Alignment of base address 32768 bits (4096 bytes)
Global Memory cache type None
Image support No
Local memory type Local
Local memory size 32768 (32KiB)
Max number of constant args 16
Max constant buffer size 2147483647 (2GiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Profiling timer resolution 0ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Device Extensions cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64 cl_khr_fp16
Platform Name Portable Computing Language
Number of devices 1
Device Name pthread-AMD Ryzen 7 3700X 8-Core Processor
Device Vendor AuthenticAMD
Device Vendor ID 0x6c636f70
Device Version OpenCL 1.2 pocl HSTR: pthread-x86_64-unknown-linux-gnu-znver1
Driver Version 1.5-pre
Device OpenCL C Version OpenCL C 1.2 pocl
Device Type CPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 16
Max clock frequency 3600MHz
Device Partition (core)
Max number of sub-devices 16
Supported partition types equally, by counts
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 4096x4096x4096
Max work group size 4096
Preferred work group size multiple 8
Preferred / native vector sizes
char 16 / 16
short 16 / 16
int 8 / 8
long 4 / 4
half 0 / 0 (n/a)
float 8 / 8
double 4 / 4 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 14654013440 (13.65GiB)
Error Correction support No
Max memory allocation 4294967296 (4GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 16777216 (16MiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 268435456 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 128
Local memory type Global
Local memory size 8388608 (8MiB)
Max number of constant args 8
Max constant buffer size 8388608 (8MiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 1ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
printf() buffer size 16777216 (16MiB)
Built-in kernels (n/a)
Device Extensions cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Clover
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [MESA]
clCreateContext(NULL, ...) [default] Success [MESA]
clCreateContext(NULL, ...) [other] Success [POCL]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Clover
Device Name AMD Radeon (TM) RX 480 Graphics (POLARIS10, DRM 3.33.0, 5.3.7-301.fc31.x86_64, LLVM 9.0.0)
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Clover
Device Name AMD Radeon (TM) RX 480 Graphics (POLARIS10, DRM 3.33.0, 5.3.7-301.fc31.x86_64, LLVM 9.0.0)
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Clover
Device Name AMD Radeon (TM) RX 480 Graphics (POLARIS10, DRM 3.33.0, 5.3.7-301.fc31.x86_64, LLVM 9.0.0)
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.12
ICD loader Profile OpenCL 2.2
Hey Gary! I tried your
)
Hey Gary! I tried your method again, and now it seems all OpenCL is broken on my system. clinfo crashes, where as, before, it worked fine. BOINC cannot find any OpenCL devices.
Here is what I did. I got all the AMD driver stuff and ran amd-install, which installs just the non-PRO stuff. That worked for all the packages...except two.
amdgpu-core-19.30-934563
amdgpu-dkms
Now, as far as I can tell, these are not the important packages in the "hybrid" system you described. Is that right?
Second, the install did not install any -PRO things, but I see that you are adding -pro/ files to your LD path. So, I'm not sure I installed all the same things you did. You said 50 pkgs?! I just want to double check: you installed both non-PRO and PRO libraries, but *not* the actual driver or dkms packages. Is that right?
To answer your specific
)
To answer your specific questions, I studied the install scripts provided by AMD to work out the absolute minimum to install to provide OpenCL (eg. things pointed to by --headless or --compute type options). I found what I needed in a very limited set of rpms - from memory just 5 - and some of that may not have been required.
It's probably going to take me a while to document everything I have done so it will be in a form useful to you. This is just a preliminary response to let you know that I'm working on it. I do understand quite a bit more about what I'm doing now, so hopefully I can explain things a bit better this time on the merry-go-round :-).
As a teaser, you might be interested to know that I'm now able to choose any version of the OpenCL compute libs right up to the very latest 19.30 package that AMD released late last year. With just a few components from that 19.30 package, and with the latest kernels/amdgpu modules (I'm using a 5.4.6 kernel), I can now run on *any* GCN GPU including my swag of GCN 1st Gen (Southern Islands) GPUs. The hosts with these GPUs were running a mid-2016 version of PCLOS which had the last version of Xorg that supported the proprietary fglrx/OpenCL components from the final Catalyst package that AMD released before deprecating fglrx.
I've been documenting my attempts to get valid results on SI series GPUs in this thread. I started the story around the middle of last year when I was able to start using the 19.10 version of the former AMDGPU-PRO package for testing. Just recently, I've updated that thread with the success story, now that I've moved to the latest of everything, including the 19.30 version of AMDGPU-PRO (Radeon Software for Linux). Links, particularly in the opening post, are not likely to show much since, once confirmed working well, machines are being shut down again until the autumn. Tasks referred to in earlier times are no longer going to exist in the current online database. Current tasks will tend to disappear quite quickly after a machine is shut down.
Cheers,
Gary.
Okay, I follow that. I can
)
Okay, I follow that. I can read the scripts, too. Thanks for explaining.