I'm getting errors in every wu i try, the stderr file is shown below. Any hint what is wrong? BPR4 open cl ATI wus are crunched without problem.
Linux Mint, Boinc 7.2.33, CAL 1.4.1741, OpenCL 1.2(1016.4)
7.2.33
process exited with code 8 (0x8, -248)
[22:25:34][9891][INFO ] Application startup - thank you for supporting Einstein@Home!
[22:25:34][9891][INFO ] Starting data processing...
[22:25:34][9891][INFO ] Using OpenCL platform provided by: Advanced Micro Devices, Inc.
[22:25:34][9891][INFO ] Using OpenCL device "Tahiti" by: Advanced Micro Devices, Inc.
[22:25:35][9891][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[22:25:35][9891][INFO ] Header contents:
------> Original WAPP file: ./PA0060_00581_DM36.00
------> Sample time in microseconds: 1000
------> Observation time in seconds: 2097.152
------> Time stamp (MJD): 54023.03363677561
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 75444.7089996
------> DEC (J2000): -370056.157001
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4539477
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 2097152
------> Trial dispersion measure: 36 cm^-3 pc
------> Scale factor: 1.57895
[22:25:36][9891][INFO ] Seed for random number generator is 1090763708.
[22:25:41][9891][INFO ] Derived global search parameters:
------> f_A probability = 0.04
------> single bin prob(P_noise > P_thr) = 1.2977e-08
------> thr1 = 18.1601
------> thr2 = 21.263
------> thr4 = 26.2923
------> thr8 = 34.674
------> thr16 = 48.9881
[22:25:42][9891][ERROR] Application caught signal 8.
------> Obtained 17 stack frames for this thread.
------> Backtrace:
Frame 17:
Binary file: ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP5_1.39_x86_64-pc-linux-gnu__BRP5-opencl-ati (0x480ae6)
Source file: erp_boinc_wrapper.cpp (Function: sighandler / Line: 167)
Frame 16:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc57993f)
Offset info: +0x32993f
Frame 15:
Binary file: /lib/x86_64-linux-gnu/libpthread.so.0 (0x7fbfff463cb0)
Offset info: +0xfcb0
Frame 14:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc7bd549)
Offset info: +0x56d549
Frame 13:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc6500da)
Offset info: +0x4000da
Frame 12:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc650260)
Offset info: +0x400260
Frame 11:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc9eadeb)
Offset info: +0x79adeb
Frame 10:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc6f4c82)
Offset info: +0x4a4c82
Frame 9:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc5a7779)
Offset info: +0x357779
Frame 8:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc5c78d9)
Offset info: +0x3778d9
Frame 7:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc5cab81)
Offset info: +0x37ab81
Frame 6:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc55e839)
Offset info: +0x30e839
Frame 5:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc55f38d)
Offset info: +0x30f38d
Frame 4:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc57b2b1)
Offset info: +0x32b2b1
Frame 3:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fbffc578adc)
Offset info: +0x328adc
Frame 2:
Binary file: /lib/x86_64-linux-gnu/libpthread.so.0 (0x7fbfff45be9a)
Offset info: +0x7e9a
Frame 1:
Binary file: /lib/x86_64-linux-gnu/libc.so.6 (0x7fbffeb893fd)
Offset info: clone+0x6d
------> End of backtrace
22:25:47 (9891): called boinc_finish
]]>
Copyright © 2024 Einstein@Home. All rights reserved.
Perseus Arm Survey v1.39 (BRP5-opencl-ati) computing error in Li
)
I got error with exit code 11 in same circumstances.
WU's for Einstein@home never starting on GPU, but Wilkyway@Home is working fine.
Ubuntu 12.04 LTS, Boinc 7.2.33, Ati 7730, with propietary Catalyst driver v13.
These lines from stdoutdae.txt:
and that in stderr output:
process exited with code 11 (0xb, -245)
[16:23:40][2662][INFO ] Application startup - thank you for supporting Einstein@Home!
[16:23:40][2662][INFO ] Starting data processing...
[16:23:40][2662][ERROR] Application caught signal 11.
------> Obtained 22 stack frames for this thread.
------> Backtrace:
Frame 22:
Binary file: ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP5_1.39_x86_64-pc-linux-gnu__BRP5-opencl-ati (0x480ae6)
Source file: erp_boinc_wrapper.cpp (Function: sighandler / Line: 167)
Frame 21:
Binary file: /lib/x86_64-linux-gnu/libpthread.so.0 (0x7fc869e61cb0)
Offset info: +0xfcb0
Frame 20:
Binary file: /lib/x86_64-linux-gnu/librt.so.1 (0x7fc8686eb14b)
Offset info: clock_gettime+0xb
Frame 19:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fc8663fa331)
Offset info: +0x3bd331
Frame 18:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fc8663fa393)
Offset info: +0x3bd393
Frame 17:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fc866423246)
Offset info: +0x3e6246
Frame 16:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fc86642510b)
Offset info: +0x3e810b
Frame 15:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fc866427952)
Offset info: +0x3ea952
Frame 14:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fc866427bc6)
Offset info: +0x3eabc6
Frame 13:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fc8663da324)
Offset info: +0x39d324
Frame 12:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fc8663f6aae)
Offset info: +0x3b9aae
Frame 11:
Binary file: /usr/lib/fglrx/libamdocl64.so (0x7fc8663c7693)
Offset info: clIcdGetPlatformIDsKHR+0x93
Frame 10:
Binary file: /usr/lib/fglrx/libOpenCL.so.1 (0x7fc86a071172)
Offset info: +0x2172
Frame 9:
Binary file: /usr/lib/fglrx/libOpenCL.so.1 (0x7fc86a073106)
Offset info: +0x4106
Frame 8:
Binary file: /usr/lib/fglrx/libOpenCL.so.1 (0x7fc86a0727e0)
Offset info: clGetPlatformIDs+0x20
Frame 7:
Binary file: ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP5_1.39_x86_64-pc-linux-gnu__BRP5-opencl-ati (0x728290)
Offset info: _Z24boinc_get_opencl_ids_auxPciiPP13_cl_device_idPP15_cl_platform_id+0x76
Source file: unknown (Function: boinc_get_opencl_ids_aux(char*, int, int, _cl_device_id**, _cl_platform_id**) / Line: 0)
Frame 6:
Binary file: ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP5_1.39_x86_64-pc-linux-gnu__BRP5-opencl-ati (0x72855d)
Offset info: _Z20boinc_get_opencl_idsPP13_cl_device_idPP15_cl_platform_id+0xf3
Source file: unknown (Function: / Line: 0)
Frame 5:
Binary file: ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP5_1.39_x86_64-pc-linux-gnu__BRP5-opencl-ati (0x4855ee)
Offset info: MAIN+0x485e
Source file: demod_binary.c (Function: MAIN / Line: 462)
Frame 4:
Binary file: ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP5_1.39_x86_64-pc-linux-gnu__BRP5-opencl-ati (0x4802cd)
Source file: erp_boinc_wrapper.cpp (Function: worker / Line: 453)
Frame 3:
Binary file: ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP5_1.39_x86_64-pc-linux-gnu__BRP5-opencl-ati (0x4807a3)
Offset info: main+0x113
Source file: erp_boinc_wrapper.cpp (Function: main / Line: 554)
Frame 2:
Binary file: /lib/x86_64-linux-gnu/libc.so.6 (0x7fc8694b776d)
Offset info: __libc_start_main+0xed
Frame 1:
Binary file: ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP5_1.39_x86_64-pc-linux-gnu__BRP5-opencl-ati (0x47fa79)
Offset info: realloc+0x249
Source file: unknown (Function: _start / Line: 0)
------> End of backtrace
16:23:40 (2662): called boinc_finish
This MAY not be a high
)
This MAY not be a high priority problem for the project, I found this on the server status page:
Tasks valid 86,356
Tasks invalid 20,752
So nearly a 25% error rate, which to me is very high! I have turned off the BRP5 units and will see what other gpu units I get, or switch to another project that will give me a much lower error rate.
Here is a link to the page:
http://einstein.phys.uwm.edu/server_status.html
RE: I got error with exit
)
I notice you are getting the same error on BRP4G as well ...
http://einsteinathome.org/task/420680341
RE: This MAY not be a high
)
Have you actually looked for errors on your own host(s) to see if your own error rate is anything like the one you are quoting?
Have you tried to think of any particular reasons why the real error rate is likely to be quite a bit less than the rather simplistic number you have deduced?
Can you tell us how the project Admins are supposed to assign a higher priority to and presumably fix problems that (in the main) are most likely being caused by hardware or software issues on the volunteer's computer?
Here are a few things to think about. The server status page is a snapshot of the data in the online database. A lot of the successfully completed tasks would get validated promptly and then be removed from the database quite quickly. Some time ago Bernd posted that he was reducing the amount of time for retaining tasks after validation in order to keep the size of the database within acceptable bounds. For tasks that fail, they have to be retained for a much longer period until one or more resends are crunched and returned to give a successful validation. Failed tasks are going to keep showing up in the figures long after successful tasks have been removed.
I did a survey of 5 of my own hosts that are crunching BRP5. In total they had around 1000 tasks in the database. There were no computation errors but there were 7 validate errors. Not all tasks were completed. Some were 'in progress' or 'pending' and some of these could eventually become errors. Let's imagine the number of failed tasks doubles to 14. Apparently I have about a 1.4% error rate for those 5 hosts. It would seem that there's not too much of a problem for the Devs to prioritize and 'fix'.
It takes quite a while to accumulate a given number of crunched tasks. An entire cache of work can be trashed in an instant if there is a hardware failure. This happened to me recently. A machine that had been crunching without incident for months, suddenly trashed the entire cache of work. I took it out of service, gave it a good clean and put it back to work. It seemed to be fine and when some new tasks completed, I was able to fill the cache. The next day I found that it had trashed all the tasks again overnight. So I fired up memtest and let it run for a couple of hours. Sure enough, after a while, a small number of memory errors started showing up in the log. I replaced the RAM, put it back to work and there hasn't been any further problem.
The point is that a relatively small number of hardware issues like this can put a relatively large number of error results into the database quite quickly. Is it fair to use random events like this to claim that the project has an unacceptably high error rate that is not being addressed in a timely fashion by the Devs?
Cheers,
Gary.
You are both running an AMD
)
You are both running an AMD GPU under Linux.
Driver issues seems most likely. Both use driver version 1.4.1741. Is it possible to try a different version and see if the problem goes away?
RE: RE: This MAY not be a
)
Yes and yes I have more then a couple of errors on each machine.
Not to get snippy but telling YOU how to run YOUR project is not my job or interest.
Now I installed latest BOINC,
)
Now I installed latest BOINC, version 7.2.42, and problem vanished. BRP5-opencl-ati began crunching.