Fermi LAT Gamma-ray pulsar search "FGRP2" - longer tasks

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253472844

RAC: 36319

18 Jun 2013 5:32:52 UTC

Topic 197005

(moderation:

)

Similar to what we already did with the Radio Pulsar search we will soon re-tune the runtime of the Fermi LAT Gamma-ray pulsar search tasks. We reduced the runtime of the BRP4 tasks to make it better suitable for slower devices; now we will enlarge the runtimes of the FGRP tasks to make these (more) suitable for faster devices (such as GPUs).

The new tasks that will be sent out later this week will run about 10x as long as the current ones, of course flops estimation, credit etc. will be adjusted accordingly.

Logforme

Joined: 13 Aug 10

Posts: 332

Credit: 1714373961

RAC: 0

Fermi LAT Gamma-ray pulsar search "FGRP2" - longer tasks

18 Jun 2013 7:22:43 UTC

Message 116746

(moderation:

)

Quote:

now we will enlarge the runtimes of the FGRP tasks to make these (more) suitable for faster devices (such as GPUs).

Does this mean you are working on a GPU version of the FGRP program?
Or, does it mean there already exists a GPU version I don't know of?

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253472844

RAC: 36319

RE: RE: now we will

18 Jun 2013 7:32:59 UTC

Message 116747 in response to message 116746

(moderation:

)

Quote:

Quote:
now we will enlarge the runtimes of the FGRP tasks to make these (more) suitable for faster devices (such as GPUs).

Does this mean you are working on a GPU version of the FGRP program?
Or, does it mean there already exists a GPU version I don't know of?

This is being tested over at Albert.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253472844

RAC: 36319

The longer-running FGRP2

20 Jun 2013 8:56:14 UTC

Message 116748

(moderation:

)

The longer-running FGRP2 tasks are being sent since yesterday. Unfortunately the first (~2000) WUs were generated with the old credit setting that is too low (1/11 of what is should be). If you really care about credit and got FGRP2 work ysterday, you may want to cancel / abort these tasks.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119528648289

RAC: 24957939

RE: The longer-running

20 Jun 2013 16:41:08 UTC

Message 116749 in response to message 116748

(moderation:

)

Quote:

The longer-running FGRP2 tasks are being sent since yesterday. Unfortunately the first (~2000) WUs were generated with the old credit setting that is too low (1/11 of what is should be). If you really care about credit and got FGRP2 work ysterday, you may want to cancel / abort these tasks.

Could these have also been sent out with a 'too low' flops estimation? On a very quick check on some tasks that were sent out yesterday, the estimated crunch time is still showing as unchanged from the previous value. From other tasks sent out more recently (a couple of hours ago) the estimated time is little more than double the previous time. If the new tasks will actually be an order of magnitude larger in crunch time, these way too low estimates are really going to screw up work caches on individual hosts. Please let us know urgently so that people who may already have multi-day caches can lower their cache settings if necessary to avoid massive overfetch.

I've set NNT for the moment until I get time to promote some tasks to see what the crunch time actually is.

Cheers,
Gary.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253472844

RAC: 36319

RE: Could these have also

20 Jun 2013 20:02:56 UTC

Message 116750 in response to message 116749

(moderation:

)

Quote:

Could these have also been sent out with a 'too low' flops estimation? On a very quick check on some tasks that were sent out yesterday, the estimated crunch time is still showing as unchanged from the previous value. From other tasks sent out more recently (a couple of hours ago) the estimated time is little more than double the previous time. If the new tasks will actually be an order of magnitude larger in crunch time, these way too low estimates are really going to screw up work caches on individual hosts. Please let us know urgently so that people who may already have multi-day caches can lower their cache settings if necessary to avoid massive overfetch.

Good catch!

Indeed it looks like it. This is worse, as it went unnoticed until now. I fixed that, but only for WUs generated from now on.

I'll see what I can do about the WUs already in the DB. Edit: fixed this for the WUs. This, however, will only help for the tasks generated from now on. If you don't want your DCF to go astray, you should probably best abort the new FGRP2 tasks you got since yesterday.

ROBtheLIONHEART

Joined: 16 Aug 12

Posts: 47

Credit: 58199880

RAC: 0

Thanx for the heads up. I got

20 Jun 2013 23:06:00 UTC

Message 116751 in response to message 116748

(moderation:

)

Thanx for the heads up. I got a new rig and its been doing well. Then early this morning I noticed the long run times and checked the the size and thought something was wrong on the new rig. Spent hours trying to see what went wrong and of course found nothing. I am very relieved its not a prob with the new comp!! :) The lesson learned is always check the boards before panic mode !!

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1960

Credit: 1517729201

RAC: 1747560

RE: Thanx for the heads up.

21 Jun 2013 0:22:29 UTC

Message 116752 in response to message 116751

(moderation:

)

Quote:

Thanx for the heads up. I got a new rig and its been doing well. Then early this morning I noticed the long run times and checked the the size and thought something was wrong on the new rig. Spent hours trying to see what went wrong and of course found nothing. I am very relieved its not a prob with the new comp!! :) The lesson learned is always check the boards before panic mode !!

Well Rob I have a couple of those 650Ti 2GB cards and they aren't my fastest ones but they are doing the new BRP5's over 10,000 seconds faster than yours is.

Maybe most of the problem is you also are still running the BRP4's and GRP's at the same time as the new BRP5's........and of course also depends on your Einstein preferences is set.

ROBtheLIONHEART

Joined: 16 Aug 12

Posts: 47

Credit: 58199880

RAC: 0

I was referring to the run

21 Jun 2013 0:53:32 UTC

Message 116753 in response to message 116752

(moderation:

)

I was referring to the run times on the cpu tasks They suddenly went way up. I am currently running 4 of the BRP5 on the card at a time (that is 4x?). Plus the other cpu tasks. Does that explain the longer run times for the BRP5 s ? I'm not well versed in the tech. Should I Look at adjusting other settings to improve overall production/day ? I try to read the boards to learn as I go. I appreciate the help!

In fact one of the new GRP s just error out. Have a few near complete will wait to see how they do . If same then abort the rest

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119528648289

RAC: 24957939

RE: In fact one of the new

21 Jun 2013 2:36:16 UTC

Message 116754 in response to message 116753

(moderation:

)

Quote:

In fact one of the new GRP s just error out. Have a few near complete will wait to see how they do . If same then abort the rest

The explanation of the 'Max elapsed time exceeded' problem that caused your task to fail has been given in this thread. You may have already seen it.

I notice you had a couple of long running tasks that made it to the finish without quite reaching the limit. It's very annoying for you to see how close the 'error' one must have been to the end when it got terminated. Looks like the limit for you is 90,815 secs.

From looking through your FGRP2 tasks list, I notice that some V1.09 tasks actually crunched quite quickly (at the 'old' rate) so it seems we can't just assume that all tasks branded V1.09 are going to be long running. That's quite unfortunate as it means we can't know exactly when the 'bad' tasks started. I assume we will know when we start getting 'fixed' tasks as they should have a time estimate of around 10x longer than V1.04 tasks.

Maybe Bernd can 'mark' the bad tasks in the DB so that each client can be 'told' to abort them locally after a 'sched_request - sched_reply' cycle passes the information to the client. I don't know what the server-side options might be so I'm planning to wait for further advice from Bernd.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119528648289

RAC: 24957939

Hi Bernd, I'm at home at

21 Jun 2013 7:17:26 UTC

Message 116755

(moderation:

)

Hi Bernd,

I'm at home at the moment and playing with some hosts I have here. The bulk of my machines (~70) are at a different location and I'm planning to travel there tomorrow to abort (if necessary) tasks that are at risk of 'Max elapsed time exceeded' type problems. Those machines very much look after themselves most of the time and usually have 4 day caches. Virtually all have web-based prefs (4 venues) so I've reduced the cache size drastically but undoubtedly they will all have gathered some 'at risk' tasks. I'm hoping they wont have started crunching any just yet.

I've observed that currently issuing 'new' FGRP2 tasks now have an estimate that is ~10x that of the 'old' tasks so it will be very much in my best interests to bite the bullet and replace all 'at risk' tasks with 'new' ones ASAP. I estimate it will take me many hours to visit and fix every single host so before I actually do this, I would like to know if you have any plans to perhaps cancel the problem tasks on the server somehow so that the client can be told 'automatically' to get rid of them.

I don't particularly want to agonise over what might be a 'good' or 'bad' task on every single machine, and I also am reluctant to just abort everything in sight - perhaps quite unnecessarily. I'm mindful of the pressures on you and the others so I'm not asking you to do anything that's not easy for you to do. However a rough idea of your intentions would be much appreciated.

If these tasks are not neutralised in some way, when people abort them (as they surely will) aren't they just going to be reissued with the same problem for the next recipient? Or will things be actually 'fixed' when the task is resent so that the next recipient 'sees' the correct estimate? It would be good to know that an abort wont cause the problem to be moved on to the next recipient and possibly continue until the limit of 20 is reached.

I (and probably many others) look forward to your comments.

Cheers,
Gary.

Fermi LAT Gamma-ray pulsar search "FGRP2" - longer tasks

Forums › Technical News

Comment viewing options

Forums › Technical News