Late completion

Pete
Pete
Joined: 31 Jul 10
Posts: 14
Credit: 1020243718
RAC: 0
Topic 198110

I wonder if anyone could explain what happened to this work unit. 219483160. As I see it the LATeah1052E_1424.0_374536_0.0 was initially sent out on the 26th May twice with a 4 day deadline. It got 2 replies within the time and was re sent out again twice but now with a 2 day deadline.I got one of the latter and replied in 2 and a half days. I completed to late to validate and therefore wasted my time/electricity.
1/ Why the ultra short reply time? 2/ Why the halved reply time to me? I have never seen a 4 day let alone a 2 day reply time. Is it normal for LAT... work? my computer id is 11715962. Regards Peter

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2989343147
RAC: 701619

Late completion

Let's make that WU 219483160.

If you look at this Technical News thread, you'll see that Bernd was trying to hurry through some stragglers, before re-launching the application with some minor alterations that would have caused validation problems if the two versions had run at the same time. It looks like you got caught up in the collateral, but it was a one-off situation, and shouldn't be an ongoing problem.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118600280849
RAC: 18205008

Richard has given the

Richard has given the succinct answer but you may want to limit your reading of the Technical News thread he linked to by starting at message # 140950 where Bernd announced the problem with the previous app and a brief statement on how it would be fixed. All the earlier messages in the thread are not relevant.

When Bernd posted the initial message, I posted a comment about potential issues and he responded with more detail. If you look through what he wrote on how he intended to speed up the completion of outstanding work by increasing the replication of in-progress tasks and reducing the deadlines so as to force them to come back quickly, you should be able to understand what happened to you.

Some time after the initial warning (about 15 days), Bernd announced that all the remaining primary tasks for the old app had been sent out about 15 hours earlier and he was now creating an extra 2 tasks for every workunit 'without a canonical result'. These had a 2 day deadline only and you obviously got one of these. It actually took a while to distribute these and when they were all gone and the workunits 'without a canonical result' had reduced to almost zero, I posted again to advise people that they could quickly check and abort tasks that already had a canonical result at this time. Of course, for this to be of any use, you would need to have been following the progress of this very closely.

The vast bulk of volunteers don't regularly read the message boards and would only be aware of all this if they had noticed the short deadline work much earlier around either the 5 or 4 day deadline stage. One warning sign would have been the high priority mode engendered by the short deadline. If you had a multi-day cache size it would have showed up quickly and if you had taken a look at the technical news or asked a question then, you could have been alerted as to what was happening.

There are a whole range of factors including BOINC version, mix of projects, mix of science runs within Einstein, work cache settings, number of concurrent GPU tasks, number of allowed CPU cores, etc., that would influence how your particular client would be able to handle the short deadline. Personally, I knew this was coming and I still got caught. I reduced work cache settings a couple of times over the entire exercise and at the end I was down to a 1 day cache (or less) after the appearance of 2 day deadline tasks.

One of my problems was that running AMD GPUs with 4 concurrent GPU tasks (4x) reduces the available cores to crunch CPU tasks to just 2 on a quad core host. The BOINC client however still fetches CPU work as if there were 4 available CPU cores. At some point with decreasing deadlines, BOINC is bound to go into high priority mode and when it does, all 4 CPU cores start crunching tasks alongside the concurrent GPU tasks. I first noticed this when 5 day deadline tasks were being distributed and I had a 2.5 day cache size. The problem is the abysmal CPU crunching rate when the GPU is still active and all cores are crunching (none free). The CPU task output on all 4 cores is actually much worse than what it was when only 2 cores were crunching - an impossible situation that rapidly further deteriorates. Fortunately, I noticed this early enough so that I could suspend tasks to get out of high priority and then find enough already completed quorums to allow me to abort enough CPU tasks to get things back on track for those AMD endowed hosts.

To avoid having to use a very low cache setting or to have to keep finding tasks to abort on AMD GPU hosts, I came up with a different strategy toward the end when the 2 day deadline was in play. I set BOINC to use 50% of CPU cores and changed (using app_config.xml) the CPU and GPU resources needed for each GPU task to 0.25 GPUs + 0.2 CPUs. That way, BOINC would know to fetch work for 2 CPU cores only and not to reserve any further cores for the 4 concurrent GPU tasks. Even if high priority mode was subsequently in play (BOINC did this unnecessarily with each new 2 day task) the host would remain at 2 CPU tasks and 4 GPU tasks, as intended, and no real damage would result (since CPU tasks actually take less than the estimate if this crunching regime is maintained).

Having worked out the problem, I decided to keep aborting any tasks for which the quorum was completed (and therefore getting further replacements) as a way of cleaning up as much as possible of these 2 day deadline tasks. I would have aborted several hundred tasks over the entire period, mainly from quad core hosts running AMD GPUs 4x, but also from others I happened to notice.

I normally run a 3 to 4 day cache setting on all hosts and I try not to have to do any micro-managing of clients. I have an automatic management script that allows me to detect misbehaving clients very quickly. This script can also do lots of things on all clients (like changing cache settings for example) with a single input value. Even forewarned and forearmed with such a script, I found the sudden changes in deadline from 14 to 5 to 4 and finally 2 days to be quite difficult to deal with in terms of understanding all the issues to be faced in keeping around 90 disparate hosts all crunching smoothly. I'm not complaining. If a program has a bug it has to be fixed quickly so what happened was necessary. I'm just hoping I can get back to a much less 'hands-on' existence now with no more sudden shocks to come :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.