Is this something that's been instigated with the new application? I don't recall in the last 3 years having to wait an extended period of time for either the wu to be sent out to a wingman or a 5 to 7 day run of work to not receive credits within a resonable period of time!
It's certainly got nothing to do with the new science run. Comments about "unsent" tasks like this have come up several times in the past. If you do an advanced search on the word "unsent", you will find a few examples of this issue being commented on previously. Unfortunately, you can only go back a year with the advanced search. Here is a post by archae86 that was made almost a year ago now where he states that the second task for one of his, remained unsent 5 days after he received the original task.
I know for certain that there are much older examples than this. Obviously the advanced search function needs the ability to go back much longer than just one year to be useful for showing just how old this topic really is.
Quote:
... when you are used to seeing a consistant amount of credit showing up everyday and then suddenly it's 1/3 or 1/4 the normal amount you start looking for answers.
Do you also look for answers if (a little later on) you start getting days of double the normal credit? :-).
Do you also look for answers if (a little later on) you start getting days of double the normal credit? :-).
Actually no I don't because I generally know that the servers have been down or that there was another problem involved.
As I stated before I have some of the unsent wu's but my biggest concern right now is that on one system I have 18 wu's out of 20 now that are still pending. Starting on Sept. 10th through today. That is unusual! If I hadn't taken one of the cores off it would be even larger. Both cores for 5 days straight of pending. Since I dropped it to only 1 core on Einstein those are now continuing the same streak.
If this werent' so unusual I wouldn't even to have bothered to say that I also had unsent wu's in response to someone else's inquiry.
As a side note I went back and looked through some of the wu's that are pending and there are very few that are unsent. Most have been sent out to a wingman but the catch is that it is either the same day I report my results or days later... 1 was even 4 days later.. maybe I am missing something here but I'm used to seeing both units going out at the same time or within hours of each other. NOT DAYS later!!! The point is that in 3 years I've never seen this before and I was looking for an explanation or if there was a potential problem at least notifying someone...
... my biggest concern right now is that on one system I have 18 wu's out of 20 now that are still pending. Starting on Sept. 10th through today.
I really don't understand why you think this is a big concern. It's absolutely zero concern because this is precisely what can happen from time to time with locality scheduling.
Quote:
That is unusual!
It's NOT unusual - it happens all the time. For example, I decided to have a quick look at some of my hosts and within five minutes I had found several that were showing the exact same symptoms you describe - large pending lists and a discrepancy of several days between the issue of the first and second tasks for any given quorum.
If you think about it for a bit, you will be able to come up with a good reason why this is particularly noticeable at the moment. When we were doing S5R3 there tended to be a relatively small range of data frequencies that were "in play" at any one point in time. For instance when we went above 800Hz, the range in play extended from 800Hz to about 930Hz. When that range was largely finished, we proceeded on to the 940Hz to 1060Hz range. Finally towards the end, the range in play was something like 1070Hz to 1200Hz. The small range of active frequencies meant that there were larger numbers of hosts assigned to any given frequency step.
With the advent of S5R4, the noticeable difference is that all frequencies are in play. As I look through all my hosts, I've seen frequencies from as low as 70Hz to as high as 1200Hz, all at the same time. If there are 10 times more frequency steps in play, there will be 10 times less hosts available for any one frequency step. Don't you think that might lead to a difficulty for the scheduler to immediately issue the second task in a quorum once the first task has been issued?
Quote:
If I hadn't taken one of the cores off it would be even larger. Both cores for 5 days straight of pending. Since I dropped it to only 1 core on Einstein those are now continuing the same streak.
Actually, by removing one core you have probably made the behaviour worse. My experience is that the scheduler is more likely to do something sooner if the "unsent" queue of second tasks is getting larger. If you had added an extra core, the scheduler may have noticed the imbalance sooner and decided to do something about it (ie assign new hosts) in a more timely fashion. The only way the imbalance is rectified is by the addition of extra hosts to the particular frequency step.
Hundreds or thousands of new hosts are being added each day. The scheduler has to make a decision (for each one) about which frequency band is most deserving of the additional resources. Why wouldn't you try to convince the scheduler that your particular band is one of the most deserving?
Quote:
If this werent' so unusual I wouldn't even to have bothered to say that I also had unsent wu's in response to someone else's inquiry.
It's NOT unusual! I have the luxury of about 130 hosts to observe it on and I can assure you that it's quite common at the moment. I have hosts that have just as big an imbalance as yours.
I most assuredly appreciate the time and trouble you have gone through to give the details so that I can set up and run more effiecently.
That's quite OK and you're most welcome! I'm sorry if it appeared that I was nagging you as that wasn't my intention. I was concerned to make sure that other casual readers didn't think there really was an ongoing problem that wasn't being addressed.
Quote:
[edit] Hey at least I didn't detatch or reset the project. I think I'm learning.
[/edit]
Good for you :-).
Just out of interest I notice that one of my machines where the "unsents" were accumulating has now "caught up" quite dramatically. I saw that it had a bit of a boost in its RAC and when I checked, sure enough, a lot of the pendings had been granted. Are you seeing any "progress" yet?
I got 3 cleared today, but added another 3..... Sigh. But maybe the dam is about to break. I put NNW for the other project, as per your suggestion, last night and waiting for it to finish out the wru's that still need to be processed. Maybe that will help speed it along.
Just out of interest I notice that one of my machines where the "unsents" were accumulating has now "caught up" quite dramatically. I saw that it had a bit of a boost in its RAC and when I checked, sure enough, a lot of the pendings had been granted. Are you seeing any "progress" yet?
Okay gave it a few more days and some of them have caught up. Looking at This WU as an example what is the reasoning behind the delay? It's sent to me first, after I finish it a few days later it is then sent to someone else. That couple of days delay in sending it out to someone else to validate as you said can take up to 5+ days. Until recently the work was sent out to 2 hosts USUALLY within hours of each other not days. Is this a result of the new server upgrade or as you said its because of the large band of frequencies that we are now expanding to?
Either way it seems to me that I would rather get the wu after someone else has had the opportunity to complete it already. And with any luck I'll only be the next guy he was waiting for instead of maybe the 3rd or 4th host down the line that hasn't timed out or had client errors and he hasn't had to wait but maybe a week for credit instead of between 2 to 6 weeks after he finished it.
Maybe I just need to get a life and not spend so much time watching over my computers and making sure everything is working right.
I also have a case of a long-used machine which recently has seen extremely delayed issue of first quorum partner results.
The last result for which a partner result has been sent out was sent to my host on 18 Sep at 03:11, but only issued to another 3.5 days later. This for 541.75 work.
Previously it was doing 541.70 work, for which the last initial partner issues took place a full seven days after issue to my host, and the last few 541.70s in my series still have not gone to a partner at all.
This is not a case of my host having suddenly gobbled a giant dose--it is a Q6600 running with nominal 1.6 day queue.
I only have five hosts, but on a quick review I noticed that another one has just seen a seven-day delay from issue to it to first issue to a partner.
I'll speculate that something has altered the dynamics of WU issue the current work to make greatly delayed first partner issue much more common than in the past. Berndt has, I think, mentioned a deliberate change to a "more random" pattern of issue. More speculatively, perhaps the ATLAS sudden appearance followed by almost equally sudden reduction has someone put some transients in the system.
I currently have 5,000 pending credits. This is at the 66.xx count, so realistically, I'm looking at about 18,000 credits. I have WU's pending from August 31st.
I'll speculate that something has altered the dynamics of WU issue the current work to make greatly delayed first partner issue much more common than in the past. Berndt has, I think, mentioned a deliberate change to a "more random" pattern of issue. More speculatively, perhaps the ATLAS sudden appearance followed by almost equally sudden reduction has someone put some transients in the system.
I knew it wasn't just my imagination that something was out of the ordinary. I usually check the message boards 3 or 4 times a day and while I'm at it I generally check to see how my returns are doing. One of the main reasons I've been keeping a close eye on the stats for 1 particular system is that I had troubles with it and it was RMA'ed and I'm watching for problems. Before the motehrboard was RMA'ed it was crashing WU's anywhere between 10 seconds and 10 minutes. That had gone on over a weekend before I caught up with it and then spent a week fighting to fix whatever the problem was. I want to make sure that nothing is screwed up on this system again and have it go on for an exteneded period of time before I catch it.
It's bad enough with all the client errors, detachments, time outs and delayed resends that I don't want to be one of the ones causing another wingman to be thrust into a LONG extented wait for someone else to get a unit to validate.
I'm not new at this and have been pretty regular in watching how my work has gone - good and bad. When I said "Something has Changed" and this is "Unusual" it was!!!!!! Einstein has been my MAIN project for 3 years with a 7 month hias after the changeover before the last one and there were a lot of problems with the AMD CPU's. Here we have another changeover and there's problems as well. That's no big deal as long as someone is aware that there is something wrong. Even if the parameters have changed that's not bad, but if no one stands up and says "Hey what's going on here?" then no one is alerted to a potential problem or can explain to those questioning what changed or offer decent explaination.
Prior to the changeover I was running around 3200+ credits per day "AVERAGE". After the changeover it was 2000+ on an average. There was a lot of disussions about the credit adjustment - good and bad. I got a reasonable explaination what was going on and said that I'm not really that worried about credits as long as they're reasonable, granted in a reasonable time frame and I don't care what other projects do, I'm staying here with Einstein. That was easy....
Now I'm seeing something that has changed "RECENTLY" with delays between first host and second host and I question it. Those unsent workunits are there because the server hasn't sent them out like it used to and its waiting a long time after it was originally sent and the result is returned before it is reissued. If the parameters have changed SAY SO so that (I)(We) can make reasonable decisions on how to set our work, especially if we are working on more than one project. If they haven't changed and this behavior is not an expected result, then its a problem and someone needs to be alerted so it can be fixed.
If this system after the latest update averages say 1500 credits a day and then it starts averaging 250, I'm going to look and see what the problem is. Rather than jump in and ask questions right away I waited until I had a week of this system getting NO credits and was seeing them completed and returned with no errors then I want to know what is going on...
The truth be known if there has been a configuration change for whatever explaination is given and the developers are expecting a certain behavior with the "locality" thing a bob and then start gettign reports such as mine and others and this is NOT what they wanted, then by golly someone needs to step up and inform them. Or even if it is an expected result, if some of the volunteers think this is too extreme maybe the developers want to know and make adjustments. I don't expect a noob - generalized reply when I've made it perfectly clear what the standards were at certain point sin this project and that those standards have changed. All I really want is a reasonable answer. Gary suggested in a detailed explaination that is was related to the "Locality" feature. Okay I'll buy that, if that's really whats going on here. In that case is the expected behavior that someone gets a wu first, finishes it and reports it back to Einstein and then 3 to 5 days, maybe 7 days later it is sent to the second host for validation? Is that the expected result? If not then it needs to be fixed. Somehow I think that isn't really what was expected, in that case the "locality" settings need to be adjusted.
Whatever it is needs to be looked at and let us know that is what we should be seeing or "Opps, we3 need to adjust some parameters." Thanks for pointing it out...
RE: Is this something
)
It's certainly got nothing to do with the new science run. Comments about "unsent" tasks like this have come up several times in the past. If you do an advanced search on the word "unsent", you will find a few examples of this issue being commented on previously. Unfortunately, you can only go back a year with the advanced search. Here is a post by archae86 that was made almost a year ago now where he states that the second task for one of his, remained unsent 5 days after he received the original task.
I know for certain that there are much older examples than this. Obviously the advanced search function needs the ability to go back much longer than just one year to be useful for showing just how old this topic really is.
Do you also look for answers if (a little later on) you start getting days of double the normal credit? :-).
Cheers,
Gary.
RE: Do you also look for
)
Actually no I don't because I generally know that the servers have been down or that there was another problem involved.
As I stated before I have some of the unsent wu's but my biggest concern right now is that on one system I have 18 wu's out of 20 now that are still pending. Starting on Sept. 10th through today. That is unusual! If I hadn't taken one of the cores off it would be even larger. Both cores for 5 days straight of pending. Since I dropped it to only 1 core on Einstein those are now continuing the same streak.
If this werent' so unusual I wouldn't even to have bothered to say that I also had unsent wu's in response to someone else's inquiry.
As a side note I went back and looked through some of the wu's that are pending and there are very few that are unsent. Most have been sent out to a wingman but the catch is that it is either the same day I report my results or days later... 1 was even 4 days later.. maybe I am missing something here but I'm used to seeing both units going out at the same time or within hours of each other. NOT DAYS later!!! The point is that in 3 years I've never seen this before and I was looking for an explanation or if there was a potential problem at least notifying someone...
RE: ... my biggest concern
)
I really don't understand why you think this is a big concern. It's absolutely zero concern because this is precisely what can happen from time to time with locality scheduling.
It's NOT unusual - it happens all the time. For example, I decided to have a quick look at some of my hosts and within five minutes I had found several that were showing the exact same symptoms you describe - large pending lists and a discrepancy of several days between the issue of the first and second tasks for any given quorum.
If you think about it for a bit, you will be able to come up with a good reason why this is particularly noticeable at the moment. When we were doing S5R3 there tended to be a relatively small range of data frequencies that were "in play" at any one point in time. For instance when we went above 800Hz, the range in play extended from 800Hz to about 930Hz. When that range was largely finished, we proceeded on to the 940Hz to 1060Hz range. Finally towards the end, the range in play was something like 1070Hz to 1200Hz. The small range of active frequencies meant that there were larger numbers of hosts assigned to any given frequency step.
With the advent of S5R4, the noticeable difference is that all frequencies are in play. As I look through all my hosts, I've seen frequencies from as low as 70Hz to as high as 1200Hz, all at the same time. If there are 10 times more frequency steps in play, there will be 10 times less hosts available for any one frequency step. Don't you think that might lead to a difficulty for the scheduler to immediately issue the second task in a quorum once the first task has been issued?
Actually, by removing one core you have probably made the behaviour worse. My experience is that the scheduler is more likely to do something sooner if the "unsent" queue of second tasks is getting larger. If you had added an extra core, the scheduler may have noticed the imbalance sooner and decided to do something about it (ie assign new hosts) in a more timely fashion. The only way the imbalance is rectified is by the addition of extra hosts to the particular frequency step.
Hundreds or thousands of new hosts are being added each day. The scheduler has to make a decision (for each one) about which frequency band is most deserving of the additional resources. Why wouldn't you try to convince the scheduler that your particular band is one of the most deserving?
It's NOT unusual! I have the luxury of about 130 hosts to observe it on and I can assure you that it's quite common at the moment. I have hosts that have just as big an imbalance as yours.
Cheers,
Gary.
I stand informed!!!! I hadn't
)
I stand informed!!!! I hadn't really looked at it that way and not knowning the reasoning behind everything just made it guess work.
I most assuredly appreciate the time and trouble you have gone through to give the details so that I can set up and run more effiecently.
Thanks Again Gary for giving me the insight to help resolve this...
Arion
[edit] Hey at least I didn't detatch or reset the project. I think I'm learning.
[/edit]
RE: I most assuredly
)
That's quite OK and you're most welcome! I'm sorry if it appeared that I was nagging you as that wasn't my intention. I was concerned to make sure that other casual readers didn't think there really was an ongoing problem that wasn't being addressed.
Good for you :-).
Just out of interest I notice that one of my machines where the "unsents" were accumulating has now "caught up" quite dramatically. I saw that it had a bit of a boost in its RAC and when I checked, sure enough, a lot of the pendings had been granted. Are you seeing any "progress" yet?
Cheers,
Gary.
I got 3 cleared today, but
)
I got 3 cleared today, but added another 3..... Sigh. But maybe the dam is about to break. I put NNW for the other project, as per your suggestion, last night and waiting for it to finish out the wru's that still need to be processed. Maybe that will help speed it along.
RE: Just out of interest I
)
Okay gave it a few more days and some of them have caught up. Looking at This WU as an example what is the reasoning behind the delay? It's sent to me first, after I finish it a few days later it is then sent to someone else. That couple of days delay in sending it out to someone else to validate as you said can take up to 5+ days. Until recently the work was sent out to 2 hosts USUALLY within hours of each other not days. Is this a result of the new server upgrade or as you said its because of the large band of frequencies that we are now expanding to?
Either way it seems to me that I would rather get the wu after someone else has had the opportunity to complete it already. And with any luck I'll only be the next guy he was waiting for instead of maybe the 3rd or 4th host down the line that hasn't timed out or had client errors and he hasn't had to wait but maybe a week for credit instead of between 2 to 6 weeks after he finished it.
Maybe I just need to get a life and not spend so much time watching over my computers and making sure everything is working right.
I also have a case of a
)
I also have a case of a long-used machine which recently has seen extremely delayed issue of first quorum partner results.
The last result for which a partner result has been sent out was sent to my host on 18 Sep at 03:11, but only issued to another 3.5 days later. This for 541.75 work.
Previously it was doing 541.70 work, for which the last initial partner issues took place a full seven days after issue to my host, and the last few 541.70s in my series still have not gone to a partner at all.
This is not a case of my host having suddenly gobbled a giant dose--it is a Q6600 running with nominal 1.6 day queue.
I only have five hosts, but on a quick review I noticed that another one has just seen a seven-day delay from issue to it to first issue to a partner.
I'll speculate that something has altered the dynamics of WU issue the current work to make greatly delayed first partner issue much more common than in the past. Berndt has, I think, mentioned a deliberate change to a "more random" pattern of issue. More speculatively, perhaps the ATLAS sudden appearance followed by almost equally sudden reduction has someone put some transients in the system.
I currently have 5,000
)
I currently have 5,000 pending credits. This is at the 66.xx count, so realistically, I'm looking at about 18,000 credits. I have WU's pending from August 31st.
RE: I'll speculate that
)
I knew it wasn't just my imagination that something was out of the ordinary. I usually check the message boards 3 or 4 times a day and while I'm at it I generally check to see how my returns are doing. One of the main reasons I've been keeping a close eye on the stats for 1 particular system is that I had troubles with it and it was RMA'ed and I'm watching for problems. Before the motehrboard was RMA'ed it was crashing WU's anywhere between 10 seconds and 10 minutes. That had gone on over a weekend before I caught up with it and then spent a week fighting to fix whatever the problem was. I want to make sure that nothing is screwed up on this system again and have it go on for an exteneded period of time before I catch it.
It's bad enough with all the client errors, detachments, time outs and delayed resends that I don't want to be one of the ones causing another wingman to be thrust into a LONG extented wait for someone else to get a unit to validate.
I'm not new at this and have been pretty regular in watching how my work has gone - good and bad. When I said "Something has Changed" and this is "Unusual" it was!!!!!! Einstein has been my MAIN project for 3 years with a 7 month hias after the changeover before the last one and there were a lot of problems with the AMD CPU's. Here we have another changeover and there's problems as well. That's no big deal as long as someone is aware that there is something wrong. Even if the parameters have changed that's not bad, but if no one stands up and says "Hey what's going on here?" then no one is alerted to a potential problem or can explain to those questioning what changed or offer decent explaination.
Prior to the changeover I was running around 3200+ credits per day "AVERAGE". After the changeover it was 2000+ on an average. There was a lot of disussions about the credit adjustment - good and bad. I got a reasonable explaination what was going on and said that I'm not really that worried about credits as long as they're reasonable, granted in a reasonable time frame and I don't care what other projects do, I'm staying here with Einstein. That was easy....
Now I'm seeing something that has changed "RECENTLY" with delays between first host and second host and I question it. Those unsent workunits are there because the server hasn't sent them out like it used to and its waiting a long time after it was originally sent and the result is returned before it is reissued. If the parameters have changed SAY SO so that (I)(We) can make reasonable decisions on how to set our work, especially if we are working on more than one project. If they haven't changed and this behavior is not an expected result, then its a problem and someone needs to be alerted so it can be fixed.
If this system after the latest update averages say 1500 credits a day and then it starts averaging 250, I'm going to look and see what the problem is. Rather than jump in and ask questions right away I waited until I had a week of this system getting NO credits and was seeing them completed and returned with no errors then I want to know what is going on...
The truth be known if there has been a configuration change for whatever explaination is given and the developers are expecting a certain behavior with the "locality" thing a bob and then start gettign reports such as mine and others and this is NOT what they wanted, then by golly someone needs to step up and inform them. Or even if it is an expected result, if some of the volunteers think this is too extreme maybe the developers want to know and make adjustments. I don't expect a noob - generalized reply when I've made it perfectly clear what the standards were at certain point sin this project and that those standards have changed. All I really want is a reasonable answer. Gary suggested in a detailed explaination that is was related to the "Locality" feature. Okay I'll buy that, if that's really whats going on here. In that case is the expected behavior that someone gets a wu first, finishes it and reports it back to Einstein and then 3 to 5 days, maybe 7 days later it is sent to the second host for validation? Is that the expected result? If not then it needs to be fixed. Somehow I think that isn't really what was expected, in that case the "locality" settings need to be adjusted.
Whatever it is needs to be looked at and let us know that is what we should be seeing or "Opps, we3 need to adjust some parameters." Thanks for pointing it out...
Just my 2cents worth.