"Unsent" WUs

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5,872

Credit: 117,586,503,171

RAC: 35,253,769

RE: Is this something

16 Sep 2008 13:22:58 UTC

Message 85526 in response to message 85525

(moderation:

)

Quote:

Is this something that's been instigated with the new application? I don't recall in the last 3 years having to wait an extended period of time for either the wu to be sent out to a wingman or a 5 to 7 day run of work to not receive credits within a resonable period of time!

It's certainly got nothing to do with the new science run. Comments about "unsent" tasks like this have come up several times in the past. If you do an advanced search on the word "unsent", you will find a few examples of this issue being commented on previously. Unfortunately, you can only go back a year with the advanced search. Here is a post by archae86 that was made almost a year ago now where he states that the second task for one of his, remained unsent 5 days after he received the original task.

I know for certain that there are much older examples than this. Obviously the advanced search function needs the ability to go back much longer than just one year to be useful for showing just how old this topic really is.

Quote:

... when you are used to seeing a consistant amount of credit showing up everyday and then suddenly it's 1/3 or 1/4 the normal amount you start looking for answers.

Do you also look for answers if (a little later on) you start getting days of double the normal credit? :-).

Cheers,
Gary.

Arion

Joined: 20 Mar 05

Posts: 147

Credit: 1,626,747

RAC: 0

RE: Do you also look for

17 Sep 2008 0:48:57 UTC

Message 85527 in response to message 85526

(moderation:

)

Quote:

Do you also look for answers if (a little later on) you start getting days of double the normal credit? :-).

Actually no I don't because I generally know that the servers have been down or that there was another problem involved.

As I stated before I have some of the unsent wu's but my biggest concern right now is that on one system I have 18 wu's out of 20 now that are still pending. Starting on Sept. 10th through today. That is unusual! If I hadn't taken one of the cores off it would be even larger. Both cores for 5 days straight of pending. Since I dropped it to only 1 core on Einstein those are now continuing the same streak.

If this werent' so unusual I wouldn't even to have bothered to say that I also had unsent wu's in response to someone else's inquiry.

As a side note I went back and looked through some of the wu's that are pending and there are very few that are unsent. Most have been sent out to a wingman but the catch is that it is either the same day I report my results or days later... 1 was even 4 days later.. maybe I am missing something here but I'm used to seeing both units going out at the same time or within hours of each other. NOT DAYS later!!! The point is that in 3 years I've never seen this before and I was looking for an explanation or if there was a potential problem at least notifying someone...

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5,872

Credit: 117,586,503,171

RAC: 35,253,769

RE: ... my biggest concern

17 Sep 2008 5:12:45 UTC

Message 85528 in response to message 85527

(moderation:

)

Quote:

... my biggest concern right now is that on one system I have 18 wu's out of 20 now that are still pending. Starting on Sept. 10th through today.

I really don't understand why you think this is a big concern. It's absolutely zero concern because this is precisely what can happen from time to time with locality scheduling.

Quote:

That is unusual!

It's NOT unusual - it happens all the time. For example, I decided to have a quick look at some of my hosts and within five minutes I had found several that were showing the exact same symptoms you describe - large pending lists and a discrepancy of several days between the issue of the first and second tasks for any given quorum.

If you think about it for a bit, you will be able to come up with a good reason why this is particularly noticeable at the moment. When we were doing S5R3 there tended to be a relatively small range of data frequencies that were "in play" at any one point in time. For instance when we went above 800Hz, the range in play extended from 800Hz to about 930Hz. When that range was largely finished, we proceeded on to the 940Hz to 1060Hz range. Finally towards the end, the range in play was something like 1070Hz to 1200Hz. The small range of active frequencies meant that there were larger numbers of hosts assigned to any given frequency step.

With the advent of S5R4, the noticeable difference is that all frequencies are in play. As I look through all my hosts, I've seen frequencies from as low as 70Hz to as high as 1200Hz, all at the same time. If there are 10 times more frequency steps in play, there will be 10 times less hosts available for any one frequency step. Don't you think that might lead to a difficulty for the scheduler to immediately issue the second task in a quorum once the first task has been issued?

Quote:

If I hadn't taken one of the cores off it would be even larger. Both cores for 5 days straight of pending. Since I dropped it to only 1 core on Einstein those are now continuing the same streak.

Actually, by removing one core you have probably made the behaviour worse. My experience is that the scheduler is more likely to do something sooner if the "unsent" queue of second tasks is getting larger. If you had added an extra core, the scheduler may have noticed the imbalance sooner and decided to do something about it (ie assign new hosts) in a more timely fashion. The only way the imbalance is rectified is by the addition of extra hosts to the particular frequency step.

Hundreds or thousands of new hosts are being added each day. The scheduler has to make a decision (for each one) about which frequency band is most deserving of the additional resources. Why wouldn't you try to convince the scheduler that your particular band is one of the most deserving?

Quote:

If this werent' so unusual I wouldn't even to have bothered to say that I also had unsent wu's in response to someone else's inquiry.

It's NOT unusual! I have the luxury of about 130 hosts to observe it on and I can assure you that it's quite common at the moment. I have hosts that have just as big an imbalance as yours.

Cheers,
Gary.

Arion

Joined: 20 Mar 05

Posts: 147

Credit: 1,626,747

RAC: 0

I stand informed!!!! I hadn't

17 Sep 2008 9:05:20 UTC

Message 85529 in response to message 85528

(moderation:

)

I stand informed!!!! I hadn't really looked at it that way and not knowning the reasoning behind everything just made it guess work.

I most assuredly appreciate the time and trouble you have gone through to give the details so that I can set up and run more effiecently.

Thanks Again Gary for giving me the insight to help resolve this...

Arion

[edit] Hey at least I didn't detatch or reset the project. I think I'm learning.
[/edit]

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5,872

Credit: 117,586,503,171

RAC: 35,253,769

RE: I most assuredly

17 Sep 2008 20:27:45 UTC

Message 85530 in response to message 85529

(moderation:

)

Quote:

I most assuredly appreciate the time and trouble you have gone through to give the details so that I can set up and run more effiecently.

That's quite OK and you're most welcome! I'm sorry if it appeared that I was nagging you as that wasn't my intention. I was concerned to make sure that other casual readers didn't think there really was an ongoing problem that wasn't being addressed.

Quote:

[edit] Hey at least I didn't detatch or reset the project. I think I'm learning.
[/edit]

Good for you :-).

Just out of interest I notice that one of my machines where the "unsents" were accumulating has now "caught up" quite dramatically. I saw that it had a bit of a boost in its RAC and when I checked, sure enough, a lot of the pendings had been granted. Are you seeing any "progress" yet?

Cheers,
Gary.

Arion

Joined: 20 Mar 05

Posts: 147

Credit: 1,626,747

RAC: 0

I got 3 cleared today, but

18 Sep 2008 2:12:15 UTC

Message 85531 in response to message 85530

(moderation:

)

I got 3 cleared today, but added another 3..... Sigh. But maybe the dam is about to break. I put NNW for the other project, as per your suggestion, last night and waiting for it to finish out the wru's that still need to be processed. Maybe that will help speed it along.

Arion

Joined: 20 Mar 05

Posts: 147

Credit: 1,626,747

RAC: 0

RE: Just out of interest I

22 Sep 2008 9:30:56 UTC

Message 85532 in response to message 85530

(moderation:

)

Quote:

Just out of interest I notice that one of my machines where the "unsents" were accumulating has now "caught up" quite dramatically. I saw that it had a bit of a boost in its RAC and when I checked, sure enough, a lot of the pendings had been granted. Are you seeing any "progress" yet?

Okay gave it a few more days and some of them have caught up. Looking at This WU as an example what is the reasoning behind the delay? It's sent to me first, after I finish it a few days later it is then sent to someone else. That couple of days delay in sending it out to someone else to validate as you said can take up to 5+ days. Until recently the work was sent out to 2 hosts USUALLY within hours of each other not days. Is this a result of the new server upgrade or as you said its because of the large band of frequencies that we are now expanding to?

Either way it seems to me that I would rather get the wu after someone else has had the opportunity to complete it already. And with any luck I'll only be the next guy he was waiting for instead of maybe the 3rd or 4th host down the line that hasn't timed out or had client errors and he hasn't had to wait but maybe a week for credit instead of between 2 to 6 weeks after he finished it.

Maybe I just need to get a life and not spend so much time watching over my computers and making sure everything is working right.

archae86

Joined: 6 Dec 05

Posts: 3,157

Credit: 7,221,884,931

RAC: 954,190

I also have a case of a

22 Sep 2008 20:36:39 UTC

Message 85533

(moderation:

)

I also have a case of a long-used machine which recently has seen extremely delayed issue of first quorum partner results.

The last result for which a partner result has been sent out was sent to my host on 18 Sep at 03:11, but only issued to another 3.5 days later. This for 541.75 work.

Previously it was doing 541.70 work, for which the last initial partner issues took place a full seven days after issue to my host, and the last few 541.70s in my series still have not gone to a partner at all.

This is not a case of my host having suddenly gobbled a giant dose--it is a Q6600 running with nominal 1.6 day queue.

I only have five hosts, but on a quick review I noticed that another one has just seen a seven-day delay from issue to it to first issue to a partner.

I'll speculate that something has altered the dynamics of WU issue the current work to make greatly delayed first partner issue much more common than in the past. Berndt has, I think, mentioned a deliberate change to a "more random" pattern of issue. More speculatively, perhaps the ATLAS sudden appearance followed by almost equally sudden reduction has someone put some transients in the system.

DigitalDingus

Joined: 15 Oct 06

Posts: 15

Credit: 1,105,785

RAC: 0

I currently have 5,000

23 Sep 2008 4:56:23 UTC

Message 85534 in response to message 85533

(moderation:

)

I currently have 5,000 pending credits. This is at the 66.xx count, so realistically, I'm looking at about 18,000 credits. I have WU's pending from August 31st.

Arion

Joined: 20 Mar 05

Posts: 147

Credit: 1,626,747

RAC: 0

RE: I'll speculate that

23 Sep 2008 8:41:16 UTC

Message 85535 in response to message 85533

(moderation:

)

Quote:

I'll speculate that something has altered the dynamics of WU issue the current work to make greatly delayed first partner issue much more common than in the past. Berndt has, I think, mentioned a deliberate change to a "more random" pattern of issue. More speculatively, perhaps the ATLAS sudden appearance followed by almost equally sudden reduction has someone put some transients in the system.

I knew it wasn't just my imagination that something was out of the ordinary. I usually check the message boards 3 or 4 times a day and while I'm at it I generally check to see how my returns are doing. One of the main reasons I've been keeping a close eye on the stats for 1 particular system is that I had troubles with it and it was RMA'ed and I'm watching for problems. Before the motehrboard was RMA'ed it was crashing WU's anywhere between 10 seconds and 10 minutes. That had gone on over a weekend before I caught up with it and then spent a week fighting to fix whatever the problem was. I want to make sure that nothing is screwed up on this system again and have it go on for an exteneded period of time before I catch it.

It's bad enough with all the client errors, detachments, time outs and delayed resends that I don't want to be one of the ones causing another wingman to be thrust into a LONG extented wait for someone else to get a unit to validate.

I'm not new at this and have been pretty regular in watching how my work has gone - good and bad. When I said "Something has Changed" and this is "Unusual" it was!!!!!! Einstein has been my MAIN project for 3 years with a 7 month hias after the changeover before the last one and there were a lot of problems with the AMD CPU's. Here we have another changeover and there's problems as well. That's no big deal as long as someone is aware that there is something wrong. Even if the parameters have changed that's not bad, but if no one stands up and says "Hey what's going on here?" then no one is alerted to a potential problem or can explain to those questioning what changed or offer decent explaination.

Prior to the changeover I was running around 3200+ credits per day "AVERAGE". After the changeover it was 2000+ on an average. There was a lot of disussions about the credit adjustment - good and bad. I got a reasonable explaination what was going on and said that I'm not really that worried about credits as long as they're reasonable, granted in a reasonable time frame and I don't care what other projects do, I'm staying here with Einstein. That was easy....

Now I'm seeing something that has changed "RECENTLY" with delays between first host and second host and I question it. Those unsent workunits are there because the server hasn't sent them out like it used to and its waiting a long time after it was originally sent and the result is returned before it is reissued. If the parameters have changed SAY SO so that (I)(We) can make reasonable decisions on how to set our work, especially if we are working on more than one project. If they haven't changed and this behavior is not an expected result, then its a problem and someone needs to be alerted so it can be fixed.

If this system after the latest update averages say 1500 credits a day and then it starts averaging 250, I'm going to look and see what the problem is. Rather than jump in and ask questions right away I waited until I had a week of this system getting NO credits and was seeing them completed and returned with no errors then I want to know what is going on...

The truth be known if there has been a configuration change for whatever explaination is given and the developers are expecting a certain behavior with the "locality" thing a bob and then start gettign reports such as mine and others and this is NOT what they wanted, then by golly someone needs to step up and inform them. Or even if it is an expected result, if some of the volunteers think this is too extreme maybe the developers want to know and make adjustments. I don't expect a noob - generalized reply when I've made it perfectly clear what the standards were at certain point sin this project and that those standards have changed. All I really want is a reasonable answer. Gary suggested in a detailed explaination that is was related to the "Locality" feature. Okay I'll buy that, if that's really whats going on here. In that case is the expected behavior that someone gets a wu first, finishes it and reports it back to Einstein and then 3 to 5 days, maybe 7 days later it is sent to the second host for validation? Is that the expected result? If not then it needs to be fixed. Somehow I think that isn't really what was expected, in that case the "locality" settings need to be adjusted.

Whatever it is needs to be looked at and let us know that is what we should be seeing or "Opps, we3 need to adjust some parameters." Thanks for pointing it out...

Just my 2cents worth.

"Unsent" WUs

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner