Way too many opencl-intel GPU tasks for size of request

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5883
Credit: 119030411579
RAC: 24687487
Topic 197050

Well, I would not call the sceduler an idiot, but somethings is definitely odd.
Look here:
https://dl.dropboxusercontent.com/u/50246791/sceduler%20problem%201.PNG

With a setting off min work buffer of 0.08 days and additional work buffer of 0.08 days I got ~60 (!) opencl-intel_gpu wu's.

Since other project run as usual there might be problem on Einstein-side.

Can someone pls check that?

Cheers

Alex

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 3001181963
RAC: 696889

Way too many opencl-intel GPU tasks for size of request

Fortunately, you posted while the scheduler log http://einstein.phys.uwm.edu/host_sched_logs/7703/7703014 was still current. Preserving for inspection:

Quote:
2013-07-10 08:43:58.7476 [PID=15878] Request: [USER#xxxxx] [HOST#7703014] [IP xxx.xxx.xxx.234] client 7.1.17
...
2013-07-10 08:43:58.7558 [PID=15878] [send] effective_ncpus 4 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2013-07-10 08:43:58.7558 [PID=15878] [send] effective_ngpus 2 max_jobs_on_host_gpu 999999
2013-07-10 08:43:58.7558 [PID=15878] [send] Not using matchmaker scheduling; Not using EDF sim
2013-07-10 08:43:58.7558 [PID=15878] [send] CPU: req 6778.90 sec, 0.00 instances; est delay 0.00
2013-07-10 08:43:58.7559 [PID=15878] [send] ATI: req 0.00 sec, 0.00 instances; est delay 0.00
2013-07-10 08:43:58.7559 [PID=15878] [send] Intel GPU: req 13824.00 sec, 1.00 instances; est delay 0.00
2013-07-10 08:43:58.7559 [PID=15878] [send] work_req_seconds: 6778.90 secs
2013-07-10 08:43:58.7559 [PID=15878] [send] available disk 55.22 GB, work_buf_min 6912
2013-07-10 08:43:58.7559 [PID=15878] [send] active_frac 0.999586 on_frac 0.911716 DCF 1.915984


The work request is for 0.16 days, so that looks right.

Quote:
2013-07-10 08:43:58.7729 [PID=15878] [send] [HOST#7703014] Sending app_version 452 einsteinbinary_BRP4 9 134 opencl-intel_gpu; 22.44 GFLOPS
2013-07-10 08:43:58.7752 [PID=15878] [send] est. duration for WU 168874320: unscaled 779.93 scaled 1639.71
2013-07-10 08:43:58.7752 [PID=15878] [HOST#7703014] Sending [RESULT#389054506 p2030.20130409.G36.27-02.04.C.b4s0g0.00000_3295_1] (est. dur. 1639.71 seconds)
2013-07-10 08:43:58.7760 [PID=15878] [send] est. duration for WU 168875978: unscaled 779.93 scaled 1639.71
2013-07-10 08:43:58.7760 [PID=15878] [send] [WU#168875978] meets deadline: 1639.71 + 1639.71 < 1209600
2013-07-10 08:43:58.7769 [PID=15878] [send] [HOST#7703014] Sending app_version 452 einsteinbinary_BRP4 9 134 opencl-intel_gpu; 22.44 GFLOPS
2013-07-10 08:43:58.7784 [PID=15878] [send] est. duration for WU 168875978: unscaled 779.93 scaled 1639.71
2013-07-10 08:43:58.7784 [PID=15878] [HOST#7703014] Sending [RESULT#389058253 p2030.20130409.G36.27-02.04.C.b1s0g0.00000_316_0] (est. dur. 1639.71 seconds)
2013-07-10 08:43:58.7796 [PID=15878] [send] est. duration for WU 168877345: unscaled 779.93 scaled 1639.71
2013-07-10 08:43:58.7797 [PID=15878] [send] [WU#168877345] meets deadline: 3279.43 + 1639.71 < 1209600
...
2013-07-10 08:43:58.9912 [PID=15878] [HOST#7703014] Sending [RESULT#389062500 p2030.20130409.G36.27-02.04.C.b1s0g0.00000_1351_1] (est. dur. 1639.71 seconds)
2013-07-10 08:43:58.9924 [PID=15878] [send] est. duration for WU 168876492: unscaled 779.93 scaled 1639.71
2013-07-10 08:43:58.9924 [PID=15878] [send] [WU#168876492] meets deadline: 98382.83 + 1639.71 < 1209600
2013-07-10 08:43:58.9933 [PID=15878] [send] [HOST#7703014] Sending app_version 452 einsteinbinary_BRP4 9 134 opencl-intel_gpu; 22.44 GFLOPS
2013-07-10 08:43:58.9947 [PID=15878] [send] est. duration for WU 168876492: unscaled 779.93 scaled 1639.71
2013-07-10 08:43:58.9947 [PID=15878] [HOST#7703014] Sending [RESULT#389059301 p2030.20130409.G36.27-02.04.C.b0s0g0.00000_574_1] (est. dur. 1639.71 seconds)
2013-07-10 08:44:01.0562 [PID=15878] Sending reply to [HOST#7703014]: 61 results, delay req 60.00


But 61 tasks at 1639.71 seconds each add up to 1.16 days - a day more than you asked for.

The question is - why did the server only test against the 1209600 second (14-day) deadline limit, and not against the size of the work request?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5883
Credit: 119030411579
RAC: 24687487

Alex, I really don't know

Alex,

I really don't know why you decided to hijack a completely unrelated existing thread when you could have easily started your own ....

So, this one is now for you!!

Cheers,
Gary.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 798066122
RAC: 1204705

Alex & Richard, Thanks

Alex & Richard,

Thanks very much for reporting and analyzing this, this is indeed very strange. We are currently looking into this issue.

Cheers
HB

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4343
Credit: 252678742
RAC: 35578

Should be fixed now. Let me

Should be fixed now. Let me know if not by posting here.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.