Scheduler went nuts

SciManStev

Joined: 27 Aug 05

Posts: 154

Credit: 15562799

RAC: 0

Thank you for the excellent

2 Dec 2010 12:17:04 UTC

Message 100987

(moderation:

)

Thank you for the excellent reply. I haven't checked it this morning, as I'm at work. Previously I had set my cache for one day, and of course all of a sudden it filled up to what ever the Einstein limit is. I set it to No New Tasks, and by my recconing it should complete everything in a couple of days. The screen shot I posted was just a small block of what is there, so many of my work units are more than half crunched and will clear quickly. I think the earliest dedline I saw was December 9'th, which is easy. If I have to abort a few then I will. When I enabled hyperthreading, it did add about an hour and 15 minutes to each crunch time, but it more than made up for it with the number of wu's crunched.

Steve

Crunching as member of The GPU Users Group team.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2960995982

RAC: 696495

RE: Thank you for the

2 Dec 2010 12:53:37 UTC

Message 100988 in response to message 100987

(moderation:

)

Quote:

Thank you for the excellent reply. I haven't checked it this morning, as I'm at work. Previously I had set my cache for one day, and of course all of a sudden it filled up to what ever the Einstein limit is. I set it to No New Tasks, and by my recconing it should complete everything in a couple of days. The screen shot I posted was just a small block of what is there, so many of my work units are more than half crunched and will clear quickly. I think the earliest dedline I saw was December 9'th, which is easy. If I have to abort a few then I will. When I enabled hyperthreading, it did add about an hour and 15 minutes to each crunch time, but it more than made up for it with the number of wu's crunched.

Steve

When you have a moment in front of the screen, would you mind giving it another try with work fetch enabled, and one or other level of debug logging on the client? If the cycle of 'request NVidia, receive S5GC1HF' starts again, try and grab a matching server log (link in right-most column of your host summary list) for comparison.

This cross-allocation shouldn't be happening, but we need evidence to nip it in the bud.

SciManStev

Joined: 27 Aug 05

Posts: 154

Credit: 15562799

RAC: 0

RE: RE: Thank you for the

2 Dec 2010 15:16:48 UTC

Message 100989 in response to message 100988

(moderation:

)

Quote:

Quote:
Thank you for the excellent reply. I haven't checked it this morning, as I'm at work. Previously I had set my cache for one day, and of course all of a sudden it filled up to what ever the Einstein limit is. I set it to No New Tasks, and by my recconing it should complete everything in a couple of days. The screen shot I posted was just a small block of what is there, so many of my work units are more than half crunched and will clear quickly. I think the earliest dedline I saw was December 9'th, which is easy. If I have to abort a few then I will. When I enabled hyperthreading, it did add about an hour and 15 minutes to each crunch time, but it more than made up for it with the number of wu's crunched.

Steve

When you have a moment in front of the screen, would you mind giving it another try with work fetch enabled, and one or other level of debug logging on the client? If the cycle of 'request NVidia, receive S5GC1HF' starts again, try and grab a matching server log (link in right-most column of your host summary list) for comparison.

This cross-allocation shouldn't be happening, but we need evidence to nip it in the bud.

Will do Richard, once I get home this evening.

Steve

Crunching as member of The GPU Users Group team.

Mayor of Bree

Joined: 23 Feb 09

Posts: 1

Credit: 581148

RAC: 0

Hi, I don't know whether

2 Dec 2010 19:15:37 UTC

Message 100990

(moderation:

)

Hi,

I don't know whether this is a project problem or a BOINC one, or perhaps a combination of both - either way it's a problem and I'd welcome a fix or suggestions about how to prevent it. Einstein@home over the past three days has begun to send me great rafts of WUs, each with an estimated time of 11+ hours, and most due on a short completion schedule. My settings (both local and on the project) are to keep two extra days of work, but the project has decided I need up to 120 extra days of work.

Of course this has led to "high priority" situations, and has led to all my other projects being unable to receive a share of my CPU resources. I've taken to aborting all but the few Einstein projects I know I'll be able to complete, and now I've set the project to accept no new tasks.

Any suggestions?

Jord

Joined: 26 Jan 05

Posts: 2952

Credit: 5893653

RAC: 51

RE: What I was interested

2 Dec 2010 19:57:45 UTC

Message 100991 in response to message 100984

(moderation:

)

Quote:

What I was interested in catching was one of those 'new task every minute' events as visible in 2500292's task list - those are the ones which I suspect to be CUDA requests.

His latest log (at the time I caught it) was a work request on that host for his CUDA:

2010-12-02 19:42:37.3358 [PID=6883]   Request: [USER#xxxxx] [HOST#2500292] [IP xxx.xxx.xxx.111] client 6.10.58
2010-12-02 19:42:37.3391 [PID=6883 ]    [send] effective_ncpus 16 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2010-12-02 19:42:37.3391 [PID=6883 ]    [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2010-12-02 19:42:37.3391 [PID=6883 ]    [send] Not using matchmaker scheduling; Not using EDF sim
2010-12-02 19:42:37.3391 [PID=6883 ]    [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2010-12-02 19:42:37.3391 [PID=6883 ]    [send] CUDA: req 21600.86 sec, 1.00 instances; est delay 0.00
2010-12-02 19:42:37.3392 [PID=6883 ]    [send] work_req_seconds: 0.00 secs
2010-12-02 19:42:37.3392 [PID=6883 ]    [send] available disk 98.80 GB, work_buf_min 0
2010-12-02 19:42:37.3392 [PID=6883 ]    [send] active_frac 0.999941 on_frac 0.999675 DCF 1.266993
2010-12-02 19:42:37.3710 [PID=6883 ]    [send] [HOST#2500292] is reliable
2010-12-02 19:42:37.3712 [PID=6883 ]    [send] set_trust: random choice for error rate 0.000010: yes
2010-12-02 19:42:37.5376 [PID=6883 ]    [version] Don't need CPU jobs, skipping version 504 for einstein_S5GC1HF ()
2010-12-02 19:42:37.5377 [PID=6883 ]    [version] Don't need CPU jobs, skipping version 704 for einstein_S5GC1HF ()
2010-12-02 19:42:37.5377 [PID=6883 ]    [version] no app version available: APP#14 (einstein_S5GC1HF) PLATFORM#6 (i686-apple-darwin) min_version 0
2010-12-02 19:42:37.5377 [PID=6883 ]    [version] no app version available: APP#14 (einstein_S5GC1HF) PLATFORM#3 (powerpc-apple-darwin) min_version 0
2010-12-02 19:42:37.5911 [PID=6883 ] [debug]   [HOST#2500292] MSG(high) No work sent
2010-12-02 19:42:37.5911 [PID=6883 ]    Sending reply to [HOST#2500292]: 0 results, delay req 60.00
2010-12-02 19:42:37.5914 [PID=6883 ]    Scheduler ran 0.447 seconds

Looks normal and correct.

Edit: Although the interesting part would be why we have the following sequnce:

Quote:

2010-12-02 19:42:37.5376 [PID=6883 ] [version] Don't need CPU jobs, skipping version 504 for einstein_S5GC1HF ()
2010-12-02 19:42:37.5377 [PID=6883 ] [version] Don't need CPU jobs, skipping version 704 for einstein_S5GC1HF ()

That's 5.04 for Mac OS X on Intel and 7.04 for Mac OS X on PPC.

Why the need to check for both? An Apple with Intel CPU can't run the PPC application, while the Apple with PPC CPU can't run the Intel applications as else the project wouldn't need to make an app for both processors. Or am I seeing that wrong?

soft spirit

Joined: 27 Oct 10

Posts: 113

Credit: 5880079

RAC: 0

hit me as well. all CPU

2 Dec 2010 22:16:03 UTC

Message 100992

(moderation:

)

hit me as well. all CPU work, 2 pages, 5hrs each. I had 1 day cache set, and it filled until I hit a project limit. With 3 cores crunching, I expect to get about half done before the time limit is hit on 12/13.

SciManStev

Joined: 27 Aug 05

Posts: 154

Credit: 15562799

RAC: 0

Richard, I tried adding

2 Dec 2010 23:35:33 UTC

Message 100993

(moderation:

)

Richard, I tried adding several debug flags, and ended up with a mess of messages. I did capture the server log, which I'll post. If you can tell me which debug flags to use, I will delete the rest of the garbage, and post the results I have saved.

2010-12-02 23:18:46.0145 [PID=4073] 2010-12-02 23:18:46.0185 [PID=4073 ] 2010-12-02 23:18:46.0185 [PID=4073 ] 2010-12-02 23:18:46.0185 [PID=4073 ] 2010-12-02 23:18:46.0185 [PID=4073 ] 2010-12-02 23:18:46.0186 [PID=4073 ] 2010-12-02 23:18:46.0186 [PID=4073 ] 2010-12-02 23:18:46.0186 [PID=4073 ] 2010-12-02 23:18:46.0186 [PID=4073 ] 2010-12-02 23:18:46.1456 [PID=4073 ] 2010-12-02 23:18:46.1457 [PID=4073 ] 2010-12-02 23:18:46.3463 [PID=4073 ] 2010-12-02 23:18:46.3463 [PID=4073 ] 2010-12-02 23:18:46.3464 [PID=4073 ] 2010-12-02 23:18:46.3471 [PID=4073 ] [debug] 2010-12-02 23:18:46.3471 [PID=4073 ] [debug] 2010-12-02 23:18:46.3472 [PID=4073 ] [debug] 2010-12-02 23:18:46.3472 [PID=4073 ] [debug] 2010-12-02 23:18:46.3472 [PID=4073 ] [debug] 2010-12-02 23:18:46.3472 [PID=4073 ] [debug] 2010-12-02 23:18:46.3472 [PID=4073 ] [debug] 2010-12-02 23:18:46.3474 [PID=4073 ] 2010-12-02 23:18:46.3485 [PID=4073 ] 2010-12-02 23:18:46.3485 [PID=4073 ] 2010-12-02 23:18:50.5032 [PID=4073 ] 2010-12-02 23:18:50.5032 [PID=4073 ] 2010-12-02 23:18:50.5032 [PID=4073 ] 2010-12-02 23:18:50.5032 [PID=4073 ] 2010-12-02 23:18:50.5033 [PID=4073 ] 2010-12-02 23:18:50.5911 [PID=4073 ] 2010-12-02 23:18:50.5915 [PID=4073 ] Request: [USER#xxxxx] [HOST#2924241] [IP xxx.xxx.xxx.34] client 6.10.58
[send] effective_ncpus 12 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
[send] effective_ngpus 2 max_jobs_on_host_gpu 999999
[send] Not using matchmaker scheduling; Not using EDF sim
[send] CPU: req 2725.46 sec, 0.00 instances; est delay 540788.19
[send] CUDA: req 475544.20 sec, 0.00 instances; est delay 0.00
[send] work_req_seconds: 2725.46 secs
[send] available disk 97.30 GB, work_buf_min 0
[send] active_frac 0.998591 on_frac 0.972763 DCF 1.418769
[send] [HOST#2924241] is reliable
[send] set_trust: random choice for error rate 0.007381: yes
[version] Best version of app einstein_S5GC1HF is ID 231 (6.00 GFLOPS)
[send] est. duration for WU 88715847: unscaled 10789.98 scaled 15759.32
[send] [WU#88715847] meets deadline: 540788.19 + 15759.32 < 1209600
Sorted list of URLs follows [host timezone: UTC-18000]
zone=-21600 url=http://einstein-dl4.phys.uwm.edu
zone=-21600 url=http://einstein-dl2.phys.uwm.edu
zone=-21600 url=http://einstein-dl3.phys.uwm.edu
zone=-28800 url=http://einstein.ligo.caltech.edu
zone=+03600 url=http://einstein.aei.mpg.de
zone=+03600 url=http://einstein-mirror.aei.uni-hannover.de/EatH
[send] [HOST#2924241] Sending app_version einstein_S5GC1HF 2 304 S5GCESSE2; 6.00 GFLOPS
[send] est. duration for WU 88715847: unscaled 10789.98 scaled 15759.32
[HOST#2924241] Sending [RESULT#209494783 h1_1339.00_S5R4__1085_S5GC1HFa_2] (est. dur. 15759.32 seconds)
[version] have CPU version but no more CPU work needed
[version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF ()
[version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF (S5GCESSE)
[version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF (S5GCESSE2)
[version] no app version available: APP#14 (einstein_S5GC1HF) PLATFORM#2 (windows_intelx86) min_version 0
Sending reply to [HOST#2924241]: 1 results, delay req 60.00
Scheduler ran 4.584 seconds

Steve

Crunching as member of The GPU Users Group team.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117783758608

RAC: 34733008

RE: Any suggestions? I

3 Dec 2010 0:11:46 UTC

Message 100994 in response to message 100990

(moderation:

)

Quote:

Any suggestions?

I don't have an NVIDIA GPU so I have no direct experience and can't experiment. It seems to me that there's not much point in asking for ABP2 work when there is no new work available so I'd change my preferences, either to deselect ABP2 crunching or to stop asking for work for the GPU. If your client isn't continually trying to get the non-existent GPU tasks, it shouldn't continue to get the excess GW tasks (hopefully) :-). Once there are new binary pulsar tasks (I'm sure there will be a big announcement) you could easily reinstate your desired preference settings.

The other way you could handle it would be to disable the use of your GPU using BOINC's client configuration options in a self-constructed cc_config.xml file. Since that would disable the GPU for all projects, you wouldn't want to do that if you wished to use your GPU for any other project.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117783758608

RAC: 34733008

Hi Steve, thanks very much

3 Dec 2010 3:46:32 UTC

Message 100995 in response to message 100993

(moderation:

)

Hi Steve, thanks very much for posting the server log snippet. Richard and Jord will be much more competent than I am in the interpretation but I thought I'd have a go and make a few comments anyway.

Your client seems to be requesting both CPU work (a small amount) and GPU work (quite a lot) so that seems to answer Richard's point about whether or not the server was wrongly cross-allocating CPU work when it couldn't supply GPU work. It would seem the server is just responding to a request coming from the client that the client really shouldn't be making.

If the client is going to continue making such a CPU request every time it also makes an unsatisfied GPU request, it's not surprising that people are getting way too many CPU tasks.

The log is quite informative about the decision making process for selecting a CPU task to fill that request but (apart from acknowledging that a substantial GPU request was made) there seems to be no comment about the GPU - not even a 'no work available' type comment. That seems to be a bit surprising. Maybe there are extra flags that need to be set to make the scheduler more 'chatty' about its decision making processes for GPU tasks or maybe it's just that the complexity of the 'locality scheduling' used for selecting GW tasks needs more output about how and why certain decisions about these CPU tasks were made.

The upshot of all this is that someone with a GPU needs to set client debugging flags to see if we can work out why these extra requests for CPU work are being made. If you set flags relating to work scheduling in the client, you may get an insight as to why the BOINC client is prepared to keep asking for CPU work. To preserve your own sanity, you might like to start with just work_fetch_debug and see what that gives you.

Cheers,
Gary.

Jord

Joined: 26 Jan 05

Posts: 2952

Credit: 5893653

RAC: 51

RE: To preserve your own

3 Dec 2010 7:34:27 UTC

Message 100996 in response to message 100995

(moderation:

)

Quote:

To preserve your own sanity, you might like to start with just work_fetch_debug and see what that gives you.

No, better use and as these only run at the time of making contact and doing the transfers.

That's giving less fodder than which runs and outputs every 10 seconds to every minute.

Scheduler went nuts

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner