Error while computing

Apostolos Papageorgiou
Apostolos Papag...
Joined: 16 Nov 10
Posts: 1
Credit: 51916
RAC: 0
Topic 195459

Can someone please check those links what happened and if he can do something about it.Thanks :)

http://einsteinathome.org/task/206299674

http://einsteinathome.org/task/206299620

http://einsteinathome.org/task/206299080

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

Error while computing

Did you try a reboot to clear up anything stuck with your GPU?

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

mikey
mikey
Joined: 22 Jan 05
Posts: 12829
Credit: 1883772828
RAC: 1095498

RE: Can someone please

Quote:

Can someone please check those links what happened and if he can do something about it.Thanks :)

http://einsteinathome.org/task/206299674

http://einsteinathome.org/task/206299620

http://einsteinathome.org/task/206299080

When you installed Boinc did you use the defaults or did you specify the directories etc? Also are you the only person using the pc or are there several people using it all with different log ins?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118660320489
RAC: 19088619

RE: Can someone please

Quote:
Can someone please check those links what happened and if he can do something about it.Thanks :)


Hi Apostolos,

Welcome to the Einstein project.

I've checked those three links and the info that each contained. I don't run any CUDA capable GPUs so I have no direct experience but I think I can figure out basically what happened but not why it happened. Unfortunately there is nothing that can be done about the tasks that were trashed. They will have been resent to someone else for completion.

As you listed them, it was actually the third one that failed first (check out the 'sent' and 'returned' times recorded in each file) and the other two failed as a direct consequence to what happened in the first failure. So, looking at the data for the third link we see the following snippets together with my commentary

One of the files in the registry database had to be recovered by use of a log or alternate copy. The recovery was successful. (0x3f6) - exit code 1014 (0x3f6)


This is a message from your OS that you should investigate (google?) but it doesn't seem to be causing a problem for BOINC.

Activated exception handling...
....
....


These lines are quite normal and indicate the successful start of an ABP2 task. If you ever stop and restart BOINC or reboot your computer, you will see a repeat of these startup lines each time. You will find one of these restarts later in the file.

[17:00:22][5700][INFO ] Checkpoint committed!
[17:01:22][5700][INFO ] Checkpoint committed!
[17:02:22][5700][INFO ] Checkpoint committed!
[17:03:22][5700][INFO ] Checkpoint committed!
....
....
[17:24:53][5700][INFO ] Data processing finished successfully!


These follow immediately after the previous initialisation output and show a checkpoint being saved every minute. ABP2 tasks (binary pulsar search) each consist of 10 'mini-tasks' sent together to form one large task. This is simply for server convenience. So in a fully completed task, you should find 10 sets of these 'Checkpoint committed' messages. In your case there are 7 full sets and an 8th 'partial set'. This partial one is immediately followed by

....
[20:15:48][5700][INFO ] Checkpoint committed!
[20:16:48][5700][INFO ] Checkpoint committed!
Activated exception handling...


which indicates that soon after the 20:16:48 checkpoint was written, either crunching was stopped and restarted or the machine was rebooted, or something of this nature. The timestamp given immediately after restarting [21:00:30] shows that crunching had stopped for about 43 minutes.

On restarting, you can see that the seven completed 'mini-tasks' were acknowledged and skipped and that the 8th uncompleted one was attempted to be reloaded from a saved checkpoint. Immediately following this you see

[21:00:30][3488][INFO ] Starting data processing...
[21:00:30][3488][ERROR] Error acquiring "real" CUDA device!
------> The acquired device is a "Device Emulation (CPU)"
[21:00:30][3488][ERROR] Demodulation failed (error: 1014)!
21:00:30 (3488): called boinc_finish


This is the actual problem. My guess is that at the time the checkpoint was to be loaded into GPU memory, there wasn't enough free and available GPU memory to hold it - or something like that. I have no idea why this happened. You might be able to deduce the reason if you can remember why crunching was off for 43 minutes as logged. Did you run something else that consumed and didn't release your GPU RAM?

Once the first task had failed, the next two were immediate casualties of the same set of circumstances. At least you didn't have any crunching time wasted with those two.

Will this problem happen again? Possibly. The CUDA app is being worked on and a new version is expected 'when it's ready'. We are currently consuming the available ABP2 data at 7 times the rate that new data is being produced so ABP2 tasks will probably be a lot scarcer soon. The current CUDA app requires both a CPU and a GPU and not very much of the total calculation load can actually be run on the GPU. There will be (usually) an improvement in total crunch time but the downside is that you tie up both a CPU and the GPU to achieve it. The improvement will be modest or even non-existent if it's a low end GPU. For these reasons, some volunteers prefer to use their GPU for projects that make more efficient use of them.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.