multiple clients & condor fetch-work backfill

Fred Donovan
Fred Donovan
Joined: 19 Oct 06
Posts: 5
Credit: 117895072
RAC: 0
Topic 195297

Dear All:

We're converting from BOINC backfill on our condor cluster to fetch-work. boinc --allow_multiple_clients allows 2 boinc clients to run, both schedule and download, but one always get detached (yet runs the tasks anyway ...); other runs ok. Do Einstein servers need to be configured to allow multiple clients per host? Any info would be appreciated.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 780526860
RAC: 1205388

multiple clients & condor fetch-work backfill

Hi!

I forwarded your question to the devs.

HBE

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 252171734
RAC: 33931

Are you running the two BOINC

Are you running the two BOINC Clients with the same BOINC Data directory (e.g. /var/lib/boinc)?

BM

The scheduler message is:
2010-08-31 15:36:41.1450 [PID=2734 ] [CRITICAL] [HOST#2182682] [USER#223946] User has another host with same CPID.
I'm not sure how the CPID of a host is generated, but apparently this leads to a change of the CPID of one of the 'hosts' (actually clients) and invalidating its results stamped with the old hostid, which you see as outcom 'Client detached'. Don't know if this could and should be changed by project configuration.

BM

Fred Donovan
Fred Donovan
Joined: 19 Oct 06
Posts: 5
Credit: 117895072
RAC: 0

Many thanks. No, it's set up

Message 99495 in response to message 99494

Many thanks. No, it's set up per the wiki

https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureBackfill

/usr1/BOINC/var/slot1 and /usr1/BOINC/var/slot2; I went so far as to put a copy of the bonic executable in each of the var/slotX directories ie:


BOINC_Universe = vanilla ## note, needs to be 5 in fetch-work
BOINC_HOME = /usr1/BOINC
BOINC_Executable = $(BOINC_HOME)/var/slot$(BOINC_SLOT)/boinc
BOINC_Output = $(BOINC_HOME)/var/slot$(BOINC_SLOT)/boinc.out
BOINC_Error = $(BOINC_HOME)/var/slot$(BOINC_SLOT)/boinc.err
BOINC_InitialDir=$(BOINC_HOME)/var/slot$(BOINC_SLOT)
BOINC_Owner = nobody
BOINC_User = nobody
BOINC_Arguments = --allow_multiple_clients --no_gui_rpc --update_prefs http://einstein.phys.uwm.edu/

could be because these machines were previously running BOINC using condor BOINC backfill; (though I did detach from the project before configuring fetch-work ...) and thus they 'reattached' rather than started clean?

I recall from reading (google) somewhere that perhaps the project servers need to be configured to honor --allow_multiple_clients?

Right now I'm running with only one slot configured for einstein BOINC, so I'm ok for now.

Many thanks for your replies.

-Fred

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 252171734
RAC: 33931

I found the configuration

Message 99496 in response to message 99495

I found the configuration option
(see here) and turned it on.

I still have to verify exactly what this does in the code, but you should already be able to try on your side whether this helps.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 252171734
RAC: 33931

It looks like this option

Message 99497 in response to message 99496

It looks like this option slows down the scheduler to a halt. I need to disable this option for now until further investigation.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 252171734
RAC: 33931

RE: It looks like this

Message 99498 in response to message 99497

Quote:
It looks like this option slows down the scheduler to a halt. I need to disable this option for now until further investigation.

Ok, I looked into the code and this option shouldn't have anything to do with the slowdown, instead it should rather speed things up in certain cases. I enabled it again.

BM

BM

Fred Donovan
Fred Donovan
Joined: 19 Oct 06
Posts: 5
Credit: 117895072
RAC: 0

Thanks for doing this. I can

Message 99499 in response to message 99498


Thanks for doing this. I can now run boinc in both slots, but one slot's jobs are in a permanent state of "suspended" in the status column; (I'm accessing a single node using 2 copies of boincmgr -- one for each slot, using alternate gui_rpc ports) - though they are listed as 'active' tasks and the suspend button is not grayed out (ie: no resume button). They each have an identical list of tasks. (4 each)

2 boincs are running: from condor StarterLog:

09/01 18:46:27 About to exec /usr1/BOINC/var/slot1/boinc --allow_multiple_clients --gui_rpc_port 10001 --allow_remote_gui_rpc --update_prefs http://einstein.phys.uwm.edu/
09/01 18:46:27 Create_Process succeeded, pid=3832

09/01 18:47:08 About to exec /usr1/BOINC/var/slot2/boinc --allow_multiple_clients --gui_rpc_port 10002 --allow_remote_gui_rpc --update_prefs http://einstein.phys.uwm.edu/
09/01 18:47:08 Create_Process succeeded, pid=3851

slot 1 is running 4 processes:

[root@node12 slot2]# pstree 3832
boinc───4*[einsteinbinary_───{einsteinbinary_}]

slot 2 none:

[root@node12 slot2]# pstree 3851
boinc

the boinc.out files show that each downloaded identical files, though only slot1 started running jobs.

Though as I write this, I see that slot 1 just acquired two new, different tasks that are "ready to start"; so now slot 1 lists 6 tasks, slot 2 4 tasks,
with the original 4 task intersecting.

Sorry for all the trouble.

Thanks,

-Fred

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 252171734
RAC: 33931

Hi Fred! Do you have a

Message 99500 in response to message 99499

Hi Fred!

Do you have a global_prefs_override.xml file and what's its content?

It appears that you have four cores per machine, with 0,5GB RAM per core (which should be sufficient for E@H if the machine is otherwise idle).

If the intention is to ultimately run one client per Condor slot/VM, I'd write 1
(or whatever number of cores should be assigned to one client) in the global_prefs_override.xml. I don't think that (as described in the Condor Wiki) should be used.

BM

BM

Fred Donovan
Fred Donovan
Joined: 19 Oct 06
Posts: 5
Credit: 117895072
RAC: 0

My apologies for the delay; I

Message 99501 in response to message 99500

My apologies for the delay; I was away for the labor day week.
Many thanks for your help. I've added the max cpu to my global pref override:

global_prefs.xml:

http://einstein.phys.uwm.edu/
http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
1179869867
3
60
0.1
100
100
60
100
94
0.001
75
0
0

global_prefs_override.xml

1
1

I'm running each boinc client as follows from condor.config.local

START_DELAY_1 = 10
START_DELAY_2 = 30
BOINC_Universe = vanilla
BOINC_HOME = /usr1/BOINC
BOINC_Executable = $(BOINC_HOME)/var/slot$(BOINC_SLOT)/boinc
BOINC_Output = $(BOINC_HOME)/var/slot$(BOINC_SLOT)/boinc.out
BOINC_Error = $(BOINC_HOME)/var/slot$(BOINC_SLOT)/boinc.err
BOINC_InitialDir=$(BOINC_HOME)/var/slot$(BOINC_SLOT)
BOINC_Owner = nobody
BOINC_User = nobody

#BOINC_Arguments = --allow_multiple_clients --no_gui_rpc --update_prefs http://einstein.phys.uwm.edu/
#BOINC_Arguments = --allow_multiple_clients --gui_rpc_port 1000$(BOINC_SLOT) --allow_remote_gui_rpc --update_prefs http://einstein.phys.uwm.edu/ --star
t_delay $(START_DELAY)_$(BOINC_SLOT)
BOINC_Arguments = --allow_multiple_clients --gui_rpc_port 1000$(BOINC_SLOT) --allow_remote_gui_rpc --update_prefs http://einstein.phys.uwm.edu/

STARTD_JOB_HOOK_KEYWORD = BOINC
#SLOT1_JOB_HOOK_KEYWORD = BOINC
BOINC_HOOK_FETCH_WORK = $(BOINC_HOME)/fetch_work.sh
#BOINC_Requirements = RemoteUser =?= "$(BOINC_Owner)" || \
# RemoteUser =?= "$(BOINC_User)" || \
# (State == "Unclaimed" && $(StateTimer) > 60)
BOINC_Requirements = RemoteUser =?= nobody && (State == "Unclaimed" && $(StateTimer) > 60)
#RANK = $(RANK) - (Owner =?= "nobody")
RANK = 0

and can connect to each boinc on a separate rpc port (10001 and 10002) with
boincmgr.

for slot1, I have 8 jobs. 4 running, 4 "ready to start"; all are unique,
no duplicate jobs.

for slot2, I have 4 jobs, all suspended; these four are identical to the
**FIRST* 4 running jobs in slot1

top shows just the first 4 jobs from slot1

on the einstein@home web site (computer tasks), the 8 unique jobs from slot one all show up as 'in progress'.

So, it seems at first glance that slot1 gets 8 jobs, slot2 gets 4 duplicates.
and only slot one runs. I think the 4 other jobs from slot1 aren't running due to high cpu load, but it seems that the servers are still confused?

Thanks,

-Fred

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 252171734
RAC: 33931

You really have 100 ?

Message 99502 in response to message 99501

You really have

100

? That should read

1

(This is not percentage, but the maximum number of cores a Client should use).

You need to restart the clients (or they need to be terminated and restarted by Condor) in order to read the settings.

As for the duplicate taks I'd reset the project on the second Client using the BOINC Manager or boinc_cmd (command-line tool should be in the BOINC direcory). Or stop the client, delete the client_state.xml and its backup and start it again.

Note: I'll be a bit out of touch (internet access) for the next two weeks.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.