Scheduler went nuts

SciManStev
Joined: 27 Aug 05
Posts: 154
Credit: 15562799
RAC: 0

I will filter out the garbage

I will filter out the garbage when I get hoe this evening, and see what is left. I may have had those flags set, as I went through and added every flag which I thought might be helpful. The output could have filled the library of congress. :) One thing different with this time, is that after the initial filling with CPU work, a lot of GPU work came in, so when I made the test last night, there was plenty of GPU work on board. It may have responded more normally. After I ran the test, I did abort enough to get me out of high priority mode, as I was getting tired of it jumping from one unfinished work unit to another. So now the condition does not exist. I could about most of my GPU work, but from what I've read there was a shortage of GPU work when this occured, and now there is much more available. I wish I had had the forsight to capture the data as it first occured, but I was at work, and didn't realize there had been a problem until much later. Work tends to get in the way of a lot of things....

Steve

Crunching as member of The GPU Users Group team.

tolafoph
tolafoph
Joined: 14 Sep 07
Posts: 122
Credit: 74659937
RAC: 0

now it started on my machine

now it started on my machine too.

Quote:
12/3/2010 7:29:33 PM Einstein@Home Sending scheduler request: To fetch work.
12/3/2010 7:29:33 PM Einstein@Home Requesting new tasks for CPU and GPU
12/3/2010 7:29:33 PM Einstein@Home [sched_op_debug] CPU work request: 16150.94 seconds; 0.00 CPUs
12/3/2010 7:29:33 PM Einstein@Home [sched_op_debug] NVIDIA GPU work request: 16150.94 seconds; 0.00 GPUs
12/3/2010 7:29:37 PM Einstein@Home Scheduler request completed: got 1 new tasks
12/3/2010 7:29:37 PM Einstein@Home [sched_op_debug] Server version 611
12/3/2010 7:29:37 PM Einstein@Home Project requested delay of 60 seconds
12/3/2010 7:29:37 PM Einstein@Home [sched_op_debug] estimated total CPU job duration: 44605 seconds
12/3/2010 7:29:37 PM Einstein@Home [sched_op_debug] estimated total NVIDIA GPU job duration: 0 seconds
12/3/2010 7:29:37 PM Einstein@Home [sched_op_debug] Deferring communication for 1 min 0 sec
12/3/2010 7:29:37 PM Einstein@Home [sched_op_debug] Reason: requested by project
12/3/2010 7:30:37 PM Einstein@Home [sched_op_debug] Starting scheduler request
12/3/2010 7:30:38 PM Einstein@Home Sending scheduler request: To fetch work.
12/3/2010 7:30:38 PM Einstein@Home Requesting new tasks for CPU and GPU
12/3/2010 7:30:38 PM Einstein@Home [sched_op_debug] CPU work request: 16155.52 seconds; 0.00 CPUs
12/3/2010 7:30:38 PM Einstein@Home [sched_op_debug] NVIDIA GPU work request: 16155.52 seconds; 0.00 GPUs
12/3/2010 7:30:44 PM Einstein@Home Scheduler request completed: got 1 new tasks
12/3/2010 7:30:44 PM Einstein@Home [sched_op_debug] Server version 611
12/3/2010 7:30:44 PM Einstein@Home Project requested delay of 60 seconds
12/3/2010 7:30:44 PM Einstein@Home [sched_op_debug] estimated total CPU job duration: 44605 seconds
12/3/2010 7:30:44 PM Einstein@Home [sched_op_debug] estimated total NVIDIA GPU job duration: 0 seconds
12/3/2010 7:30:44 PM Einstein@Home [sched_op_debug] Deferring communication for 1 min 0 sec
12/3/2010 7:30:44 PM Einstein@Home [sched_op_debug] Reason: requested by project
Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

That's what I

That's what I suspect:

Quote:
12/3/2010 7:29:33 PM Einstein@Home Sending scheduler request: To fetch work.
12/3/2010 7:29:33 PM Einstein@Home Requesting new tasks for CPU and GPU
12/3/2010 7:29:33 PM Einstein@Home [sched_op_debug] CPU work request: 16150.94 seconds; 0.00 CPUs
12/3/2010 7:29:33 PM Einstein@Home [sched_op_debug] NVIDIA GPU work request: 16150.94 seconds; 0.00 GPUs
12/3/2010 7:29:37 PM Einstein@Home Scheduler request completed: got 1 new tasks
12/3/2010 7:29:37 PM Einstein@Home [sched_op_debug] Server version 611
12/3/2010 7:29:37 PM Einstein@Home Project requested delay of 60 seconds
12/3/2010 7:29:37 PM Einstein@Home [sched_op_debug] estimated total CPU job duration: 44605 seconds
12/3/2010 7:29:37 PM Einstein@Home [sched_op_debug] estimated total NVIDIA GPU job duration: 0 seconds
12/3/2010 7:29:37 PM Einstein@Home [sched_op_debug] Deferring communication for 1 min 0 sec
12/3/2010 7:29:37 PM Einstein@Home [sched_op_debug] Reason: requested by project
12/3/2010 7:30:37 PM Einstein@Home [sched_op_debug] Starting scheduler request
12/3/2010 7:30:38 PM Einstein@Home Sending scheduler request: To fetch work.
12/3/2010 7:30:38 PM Einstein@Home Requesting new tasks for CPU and GPU
12/3/2010 7:30:38 PM Einstein@Home [sched_op_debug] CPU work request: 16155.52 seconds; 0.00 CPUs
12/3/2010 7:30:38 PM Einstein@Home [sched_op_debug] NVIDIA GPU work request: 16155.52 seconds; 0.00 GPUs
12/3/2010 7:30:44 PM Einstein@Home Scheduler request completed: got 1 new tasks
12/3/2010 7:30:44 PM Einstein@Home [sched_op_debug] Server version 611
12/3/2010 7:30:44 PM Einstein@Home Project requested delay of 60 seconds
12/3/2010 7:30:44 PM Einstein@Home [sched_op_debug] estimated total CPU job duration: 44605 seconds
12/3/2010 7:30:44 PM Einstein@Home [sched_op_debug] estimated total NVIDIA GPU job duration: 0 seconds
12/3/2010 7:30:44 PM Einstein@Home [sched_op_debug] Deferring communication for 1 min 0 sec
12/3/2010 7:30:44 PM Einstein@Home [sched_op_debug] Reason: requested by project

Since Einstein tasks occupy one CPU core with each GPU, both types are requested for each GPU task.

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 729882930
RAC: 1191712

Hi! But it's not doing

Hi!

But it's not doing that every time, that host is also sending requests for GPU only:

Quote:

010-12-04 14:43:37.8436 [PID=11479] Request: [USER#xxxxx] [HOST#2087705] [IP xxx.xxx.xxx.22] client 6.10.58
2010-12-04 14:43:37.8476 [PID=11479] [send] effective_ncpus 2 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2010-12-04 14:43:37.8476 [PID=11479] [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2010-12-04 14:43:37.8476 [PID=11479] [send] Not using matchmaker scheduling; Not using EDF sim
2010-12-04 14:43:37.8476 [PID=11479] [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2010-12-04 14:43:37.8476 [PID=11479] [send] CUDA: req 108000.86 sec, 1.00 instances; est delay 0.00
2010-12-04 14:43:37.8476 [PID=11479] [send] work_req_seconds: 0.00 secs
2010-12-04 14:43:37.8476 [PID=11479] [send] available disk 9.12 GB, work_buf_min 0
2010-12-04 14:43:37.8477 [PID=11479] [send] active_frac 0.997061 on_frac 0.616521 DCF 2.530427
2010-12-04 14:43:37.8511 [PID=11479] [send] [HOST#2087705] is reliable
2010-12-04 14:43:37.8512 [PID=11479] [send] set_trust: random choice for error rate 0.000010: yes
2010-12-04 14:43:38.0382 [PID=11479] [version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF ()
2010-12-04 14:43:38.0383 [PID=11479] [version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF (S5GCESSE)
2010-12-04 14:43:38.0383 [PID=11479] [version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF (S5GCESSE2)
2010-12-04 14:43:38.0383 [PID=11479] [version] no app version available: APP#14 (einstein_S5GC1HF) PLATFORM#2 (windows_intelx86) min_version 0
2010-12-04 14:43:38.0489 [PID=11479] [debug] [HOST#2087705] MSG(high) No work sent
2010-12-04 14:43:38.0490 [PID=11479] Sending reply to [HOST#2087705]: 0 results, delay req 60.00
2010-12-04 14:43:38.0492 [PID=11479] Scheduler ran 0.212 seconds

I wonder how this problem correlates with BOINC client versions.

CU
HB

Erik Rausch
Erik Rausch
Joined: 19 Nov 10
Posts: 4
Credit: 66042
RAC: 0

I'm also having this problem.

I'm also having this problem. I've just aborted around 200 tasks and set E@H to get no new ones. I hope this is resolved soon!
Erik

Metod, S56RKO
Metod, S56RKO
Joined: 11 Feb 05
Posts: 135
Credit: 826520559
RAC: 84935

RE: Since Einstein tasks

Quote:
Since Einstein tasks occupy one CPU core with each GPU, both types are requested for each GPU task.

Why should BOINC client care about CPU and GPU requirements before it gets assigned a task which requires a combination of both? If CPU cache is full and GPU cache is empty, then BOINC CC should say so in sched request. It's up to scheduler to figure things out (in case of CPU cache being full it shouldn't assign GPU tasks if those also require non-trivial amount of CPU).

I'd say that there's a bug in BOINC CC.

Metod ...

tolafoph
tolafoph
Joined: 14 Sep 07
Posts: 122
Credit: 74659937
RAC: 0

RE: Hi! But it's not doing

Quote:

Hi!

But it's not doing that every time, that host is also sending requests for GPU only:

Quote:

010-12-04 14:43:37.8436 [PID=11479] Request: [USER#xxxxx] [HOST#2087705] [IP xxx.xxx.xxx.22] client 6.10.58
2010-12-04 14:43:37.8476 [PID=11479] [send] effective_ncpus 2 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2010-12-04 14:43:37.8476 [PID=11479] [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2010-12-04 14:43:37.8476 [PID=11479] [send] Not using matchmaker scheduling; Not using EDF sim
2010-12-04 14:43:37.8476 [PID=11479] [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2010-12-04 14:43:37.8476 [PID=11479] [send] CUDA: req 108000.86 sec, 1.00 instances; est delay 0.00
2010-12-04 14:43:37.8476 [PID=11479] [send] work_req_seconds: 0.00 secs
2010-12-04 14:43:37.8476 [PID=11479] [send] available disk 9.12 GB, work_buf_min 0
2010-12-04 14:43:37.8477 [PID=11479] [send] active_frac 0.997061 on_frac 0.616521 DCF 2.530427
2010-12-04 14:43:37.8511 [PID=11479] [send] [HOST#2087705] is reliable
2010-12-04 14:43:37.8512 [PID=11479] [send] set_trust: random choice for error rate 0.000010: yes
2010-12-04 14:43:38.0382 [PID=11479] [version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF ()
2010-12-04 14:43:38.0383 [PID=11479] [version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF (S5GCESSE)
2010-12-04 14:43:38.0383 [PID=11479] [version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF (S5GCESSE2)
2010-12-04 14:43:38.0383 [PID=11479] [version] no app version available: APP#14 (einstein_S5GC1HF) PLATFORM#2 (windows_intelx86) min_version 0
2010-12-04 14:43:38.0489 [PID=11479] [debug] [HOST#2087705] MSG(high) No work sent
2010-12-04 14:43:38.0490 [PID=11479] Sending reply to [HOST#2087705]: 0 results, delay req 60.00
2010-12-04 14:43:38.0492 [PID=11479] Scheduler ran 0.212 seconds

I wonder how this problem correlates with BOINC client versions.

CU
HB

Hi,

here are the logs for the GPU-only request.

Quote:
12/4/2010 3:10:11 PM Einstein@Home [sched_op_debug] Starting scheduler request
12/4/2010 3:10:11 PM Einstein@Home Sending scheduler request: To fetch work.
12/4/2010 3:10:11 PM Einstein@Home Reporting 2 completed tasks, requesting new tasks for GPU
12/4/2010 3:10:11 PM Einstein@Home [sched_op_debug] CPU work request: 0.00 seconds; 0.00 CPUs
12/4/2010 3:10:11 PM Einstein@Home [sched_op_debug] NVIDIA GPU work request: 108000.86 seconds; 1.00 GPUs
12/4/2010 3:10:15 PM Einstein@Home Scheduler request completed: got 0 new tasks
12/4/2010 3:10:15 PM Einstein@Home [sched_op_debug] Server version 611
12/4/2010 3:10:15 PM Einstein@Home Message from server: No work sent
12/4/2010 3:10:15 PM Einstein@Home Project requested delay of 60 seconds
12/4/2010 3:10:15 PM Einstein@Home [sched_op_debug] handle_scheduler_reply(): got ack for result h1_1379.20_S5R4__1234_S5GC1HFa_0
12/4/2010 3:10:15 PM Einstein@Home [sched_op_debug] handle_scheduler_reply(): got ack for result h1_1379.20_S5R4__1233_S5GC1HFa_0
12/4/2010 3:10:15 PM Einstein@Home [sched_op_debug] Deferring communication for 1 min 0 sec
12/4/2010 3:10:15 PM Einstein@Home [sched_op_debug] Reason: requested by project
12/4/2010 3:43:27 PM Einstein@Home [sched_op_debug] Starting scheduler request
12/4/2010 3:43:27 PM Einstein@Home Sending scheduler request: To fetch work.
12/4/2010 3:43:27 PM Einstein@Home Requesting new tasks for GPU
12/4/2010 3:43:27 PM Einstein@Home [sched_op_debug] CPU work request: 0.00 seconds; 0.00 CPUs
12/4/2010 3:43:27 PM Einstein@Home [sched_op_debug] NVIDIA GPU work request: 108000.86 seconds; 1.00 GPUs
12/4/2010 3:43:30 PM Einstein@Home Scheduler request completed: got 0 new tasks
12/4/2010 3:43:30 PM Einstein@Home [sched_op_debug] Server version 611
12/4/2010 3:43:30 PM Einstein@Home Message from server: No work sent
12/4/2010 3:43:30 PM Einstein@Home Project requested delay of 60 seconds
12/4/2010 3:43:30 PM Einstein@Home [sched_op_debug] Deferring communication for 1 min 0 sec
12/4/2010 3:43:30 PM Einstein@Home [sched_op_debug] Reason: requested by project
Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

RE: RE: Since Einstein

Quote:
Quote:
Since Einstein tasks occupy one CPU core with each GPU, both types are requested for each GPU task.

Why should BOINC client care about CPU and GPU requirements before it gets assigned a task which requires a combination of both? If CPU cache is full and GPU cache is empty, then BOINC CC should say so in sched request. It's up to scheduler to figure things out (in case of CPU cache being full it shouldn't assign GPU tasks if those also require non-trivial amount of CPU).


You've missed my first sentence:

Quote:
That's what I suspect:

but meanwhile, I think that my suspicion was wrong.

Quote:
I'd say that there's a bug in BOINC CC.


That's what I wanted to say, too. ;-)

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Mr. Kevvy
Mr. Kevvy
Joined: 11 Nov 04
Posts: 87
Credit: 11917606841
RAC: 5389436

Same issue. One of my boxes

Same issue. One of my boxes was set to a one-day cache, but has downloaded about 20 days' worth of work. Unfortunately when SETI@Home kicks in I'll be aborting the vast majority of it. Didn't want to do this, thus the one-day cache!

As has been noted it also has a compatible CUDA GPU and wasn't downloading any GPU work before it started doing this. I kept getting "No work sent" responses. They were highlighted in red like errors. The logs don't seem to have anything relevant in them as it was probably overwritten, but I didn't know how to retrieve them.

http://einstein.phys.uwm.edu/host_sched_logs/3675/3675852

2010-12-05 11:32:15.5365 [PID=27374]   Request: [USER#xxxxx] [HOST#3675852] [IP xxx.xxx.xxx.81] client 6.10.58
2010-12-05 11:32:15.5417 [PID=27374]    [send] effective_ncpus 4 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2010-12-05 11:32:15.5417 [PID=27374]    [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2010-12-05 11:32:15.5417 [PID=27374]    [send] Not using matchmaker scheduling; Not using EDF sim
2010-12-05 11:32:15.5418 [PID=27374]    [send] CPU: req 1647.99 sec, 0.00 instances; est delay 866794.08
2010-12-05 11:32:15.5418 [PID=27374]    [send] CUDA: req 1647.99 sec, 0.00 instances; est delay 0.00
2010-12-05 11:32:15.5418 [PID=27374]    [send] work_req_seconds: 1647.99 secs
2010-12-05 11:32:15.5418 [PID=27374]    [send] available disk 99.57 GB, work_buf_min 86400
2010-12-05 11:32:15.5418 [PID=27374]    [send] active_frac 0.999926 on_frac 0.998376 DCF 1.652355
2010-12-05 11:32:15.5707 [PID=27374]    [send] [HOST#3675852] is reliable
2010-12-05 11:32:15.5707 [PID=27374]    [send] set_trust: random choice for error rate 0.003561: yes
2010-12-05 11:32:15.7477 [PID=27374]    [version] Best version of app einstein_S5GC1HF is ID 231 (3.84 GFLOPS)
2010-12-05 11:32:15.7477 [PID=27374]    [send] est. duration for WU 89026395: unscaled 16893.24 scaled 27961.10
2010-12-05 11:32:15.7477 [PID=27374]    [send] [WU#89026395] meets deadline: 866794.08 + 27961.10 < 1209600
2010-12-05 11:32:15.7489 [PID=27374] [debug]   Sorted list of URLs follows [host timezone: UTC-18000]
2010-12-05 11:32:15.7489 [PID=27374] [debug]   zone=-21600 url=http://einstein-dl4.phys.uwm.edu
2010-12-05 11:32:15.7489 [PID=27374] [debug]   zone=-21600 url=http://einstein-dl2.phys.uwm.edu
2010-12-05 11:32:15.7489 [PID=27374] [debug]   zone=-21600 url=http://einstein-dl3.phys.uwm.edu
2010-12-05 11:32:15.7489 [PID=27374] [debug]   zone=-28800 url=http://einstein.ligo.caltech.edu
2010-12-05 11:32:15.7489 [PID=27374] [debug]   zone=+03600 url=http://einstein-mirror.aei.uni-hannover.de/EatH
2010-12-05 11:32:15.7489 [PID=27374] [debug]   zone=+03600 url=http://einstein.aei.mpg.de
2010-12-05 11:32:15.7491 [PID=27374]    [send] [HOST#3675852] Sending app_version einstein_S5GC1HF 2 304 S5GCESSE2; 3.84 GFLOPS
2010-12-05 11:32:15.7503 [PID=27374]    [send] est. duration for WU 89026395: unscaled 16893.24 scaled 27961.10
2010-12-05 11:32:15.7503 [PID=27374]    [HOST#3675852] Sending [RESULT#209962012 h1_1349.45_S5R4__966_S5GC1HFa_0] (est. dur. 27961.10 seconds)
2010-12-05 11:32:17.7716 [PID=27374]    [version] have CPU version but no more CPU work needed
2010-12-05 11:32:17.7716 [PID=27374]    [version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF ()
2010-12-05 11:32:17.7716 [PID=27374]    [version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF (S5GCESSE)
2010-12-05 11:32:17.7717 [PID=27374]    [version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF (S5GCESSE2)
2010-12-05 11:32:17.7717 [PID=27374]    [version] no app version available: APP#14 (einstein_S5GC1HF) PLATFORM#2 (windows_intelx86) min_version 0
2010-12-05 11:32:17.7818 [PID=27374]    Sending reply to [HOST#3675852]: 1 results, delay req 60.00
2010-12-05 11:32:17.7822 [PID=27374]    Scheduler ran 2.252 seconds
Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

RE: The logs don't seem to

Quote:
The logs don't seem to have anything relevant in them as it was probably overwritten, but I didn't know how to retrieve them.


The log is kept in the BOINC data directory in the files stdoutdae.txt and stdoutdae.old.

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.