Error while computing

Apostolos Papag...

Joined: 16 Nov 10

Posts: 1

Credit: 51916

RAC: 0

18 Nov 2010 13:12:00 UTC

Topic 195459

(moderation:

)

Can someone please check those links what happened and if he can do something about it.Thanks :)

http://einsteinathome.org/task/206299674

http://einsteinathome.org/task/206299620

http://einsteinathome.org/task/206299080

Gundolf Jahn

Joined: 1 Mar 05

Posts: 1079

Credit: 341280

RAC: 0

Error while computing

18 Nov 2010 15:49:45 UTC

Message 100786

(moderation:

)

Did you try a reboot to clear up anything stuck with your GPU?

GruÃŸ,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

mikey

Joined: 22 Jan 05

Posts: 12829

Credit: 1883772828

RAC: 1095498

RE: Can someone please

19 Nov 2010 12:33:13 UTC

Message 100787

(moderation:

)

Quote:

Can someone please check those links what happened and if he can do something about it.Thanks :)

http://einsteinathome.org/task/206299674

http://einsteinathome.org/task/206299620

http://einsteinathome.org/task/206299080

When you installed Boinc did you use the defaults or did you specify the directories etc? Also are you the only person using the pc or are there several people using it all with different log ins?

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118660320489

RAC: 19088619

RE: Can someone please

20 Nov 2010 2:53:09 UTC

Message 100788

(moderation:

)

Quote:

Can someone please check those links what happened and if he can do something about it.Thanks :)

Hi Apostolos,

Welcome to the Einstein project.

I've checked those three links and the info that each contained. I don't run any CUDA capable GPUs so I have no direct experience but I think I can figure out basically what happened but not why it happened. Unfortunately there is nothing that can be done about the tasks that were trashed. They will have been resent to someone else for completion.

As you listed them, it was actually the third one that failed first (check out the 'sent' and 'returned' times recorded in each file) and the other two failed as a direct consequence to what happened in the first failure. So, looking at the data for the third link we see the following snippets together with my commentary

One of the files in the registry database had to be recovered by use of a log or alternate copy. The recovery was successful. (0x3f6) - exit code 1014 (0x3f6)

This is a message from your OS that you should investigate (google?) but it doesn't seem to be causing a problem for BOINC.

Activated exception handling...
....
....

These lines are quite normal and indicate the successful start of an ABP2 task. If you ever stop and restart BOINC or reboot your computer, you will see a repeat of these startup lines each time. You will find one of these restarts later in the file.

[17:00:22][5700][INFO ] Checkpoint committed!
[17:01:22][5700][INFO ] Checkpoint committed!
[17:02:22][5700][INFO ] Checkpoint committed!
[17:03:22][5700][INFO ] Checkpoint committed!
....
....
[17:24:53][5700][INFO ] Data processing finished successfully!

These follow immediately after the previous initialisation output and show a checkpoint being saved every minute. ABP2 tasks (binary pulsar search) each consist of 10 'mini-tasks' sent together to form one large task. This is simply for server convenience. So in a fully completed task, you should find 10 sets of these 'Checkpoint committed' messages. In your case there are 7 full sets and an 8th 'partial set'. This partial one is immediately followed by

....
[20:15:48][5700][INFO ] Checkpoint committed!
[20:16:48][5700][INFO ] Checkpoint committed!
Activated exception handling...

which indicates that soon after the 20:16:48 checkpoint was written, either crunching was stopped and restarted or the machine was rebooted, or something of this nature. The timestamp given immediately after restarting [21:00:30] shows that crunching had stopped for about 43 minutes.

On restarting, you can see that the seven completed 'mini-tasks' were acknowledged and skipped and that the 8th uncompleted one was attempted to be reloaded from a saved checkpoint. Immediately following this you see

[21:00:30][3488][INFO ] Starting data processing...
[21:00:30][3488][ERROR] Error acquiring "real" CUDA device!
------> The acquired device is a "Device Emulation (CPU)"
[21:00:30][3488][ERROR] Demodulation failed (error: 1014)!
21:00:30 (3488): called boinc_finish

This is the actual problem. My guess is that at the time the checkpoint was to be loaded into GPU memory, there wasn't enough free and available GPU memory to hold it - or something like that. I have no idea why this happened. You might be able to deduce the reason if you can remember why crunching was off for 43 minutes as logged. Did you run something else that consumed and didn't release your GPU RAM?

Once the first task had failed, the next two were immediate casualties of the same set of circumstances. At least you didn't have any crunching time wasted with those two.

Will this problem happen again? Possibly. The CUDA app is being worked on and a new version is expected 'when it's ready'. We are currently consuming the available ABP2 data at 7 times the rate that new data is being produced so ABP2 tasks will probably be a lot scarcer soon. The current CUDA app requires both a CPU and a GPU and not very much of the total calculation load can actually be run on the GPU. There will be (usually) an improvement in total crunch time but the downside is that you tie up both a CPU and the GPU to achieve it. The improvement will be modest or even non-existent if it's a low end GPU. For these reasons, some volunteers prefer to use their GPU for projects that make more efficient use of them.

Cheers,
Gary.

Error while computing

Forums › Problems and Bug Reports

Error while computing

RE: Can someone please

RE: Can someone please

Comment viewing options

Forums › Problems and Bug Reports