Dear All:
We're converting from BOINC backfill on our condor cluster to fetch-work. boinc --allow_multiple_clients allows 2 boinc clients to run, both schedule and download, but one always get detached (yet runs the tasks anyway ...); other runs ok. Do Einstein servers need to be configured to allow multiple clients per host? Any info would be appreciated.
Copyright © 2024 Einstein@Home. All rights reserved.
multiple clients & condor fetch-work backfill
)
Hi!
I forwarded your question to the devs.
HBE
Are you running the two BOINC
)
Are you running the two BOINC Clients with the same BOINC Data directory (e.g. /var/lib/boinc)?
BM
The scheduler message is:
2010-08-31 15:36:41.1450 [PID=2734 ] [CRITICAL] [HOST#2182682] [USER#223946] User has another host with same CPID.
I'm not sure how the CPID of a host is generated, but apparently this leads to a change of the CPID of one of the 'hosts' (actually clients) and invalidating its results stamped with the old hostid, which you see as outcom 'Client detached'. Don't know if this could and should be changed by project configuration.
BM
Many thanks. No, it's set up
)
Many thanks. No, it's set up per the wiki
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureBackfill
/usr1/BOINC/var/slot1 and /usr1/BOINC/var/slot2; I went so far as to put a copy of the bonic executable in each of the var/slotX directories ie:
BOINC_Universe = vanilla ## note, needs to be 5 in fetch-work
BOINC_HOME = /usr1/BOINC
BOINC_Executable = $(BOINC_HOME)/var/slot$(BOINC_SLOT)/boinc
BOINC_Output = $(BOINC_HOME)/var/slot$(BOINC_SLOT)/boinc.out
BOINC_Error = $(BOINC_HOME)/var/slot$(BOINC_SLOT)/boinc.err
BOINC_InitialDir=$(BOINC_HOME)/var/slot$(BOINC_SLOT)
BOINC_Owner = nobody
BOINC_User = nobody
BOINC_Arguments = --allow_multiple_clients --no_gui_rpc --update_prefs http://einstein.phys.uwm.edu/
could be because these machines were previously running BOINC using condor BOINC backfill; (though I did detach from the project before configuring fetch-work ...) and thus they 'reattached' rather than started clean?
I recall from reading (google) somewhere that perhaps the project servers need to be configured to honor --allow_multiple_clients?
Right now I'm running with only one slot configured for einstein BOINC, so I'm ok for now.
Many thanks for your replies.
-Fred
I found the configuration
)
I found the configuration option
(see here) and turned it on.
I still have to verify exactly what this does in the code, but you should already be able to try on your side whether this helps.
BM
BM
It looks like this option
)
It looks like this option slows down the scheduler to a halt. I need to disable this option for now until further investigation.
BM
BM
RE: It looks like this
)
Ok, I looked into the code and this option shouldn't have anything to do with the slowdown, instead it should rather speed things up in certain cases. I enabled it again.
BM
BM
Thanks for doing this. I can
)
Thanks for doing this. I can now run boinc in both slots, but one slot's jobs are in a permanent state of "suspended" in the status column; (I'm accessing a single node using 2 copies of boincmgr -- one for each slot, using alternate gui_rpc ports) - though they are listed as 'active' tasks and the suspend button is not grayed out (ie: no resume button). They each have an identical list of tasks. (4 each)
2 boincs are running: from condor StarterLog:
09/01 18:46:27 About to exec /usr1/BOINC/var/slot1/boinc --allow_multiple_clients --gui_rpc_port 10001 --allow_remote_gui_rpc --update_prefs http://einstein.phys.uwm.edu/
09/01 18:46:27 Create_Process succeeded, pid=3832
09/01 18:47:08 About to exec /usr1/BOINC/var/slot2/boinc --allow_multiple_clients --gui_rpc_port 10002 --allow_remote_gui_rpc --update_prefs http://einstein.phys.uwm.edu/
09/01 18:47:08 Create_Process succeeded, pid=3851
slot 1 is running 4 processes:
[root@node12 slot2]# pstree 3832
boinc───4*[einsteinbinary_───{einsteinbinary_}]
slot 2 none:
[root@node12 slot2]# pstree 3851
boinc
the boinc.out files show that each downloaded identical files, though only slot1 started running jobs.
Though as I write this, I see that slot 1 just acquired two new, different tasks that are "ready to start"; so now slot 1 lists 6 tasks, slot 2 4 tasks,
with the original 4 task intersecting.
Sorry for all the trouble.
Thanks,
-Fred
Hi Fred! Do you have a
)
Hi Fred!
Do you have a global_prefs_override.xml file and what's its content?
It appears that you have four cores per machine, with 0,5GB RAM per core (which should be sufficient for E@H if the machine is otherwise idle).
If the intention is to ultimately run one client per Condor slot/VM, I'd write
1
(or whatever number of cores should be assigned to one client) in the global_prefs_override.xml. I don't think that (as described in the Condor Wiki) should be used.
BM
BM
My apologies for the delay; I
)
My apologies for the delay; I was away for the labor day week.
Many thanks for your help. I've added the max cpu to my global pref override:
global_prefs.xml:
http://einstein.phys.uwm.edu/
http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
1179869867
3
60
0.1
100
100
60
100
94
0.001
75
0
0
global_prefs_override.xml
1
1
I'm running each boinc client as follows from condor.config.local
START_DELAY_1 = 10
START_DELAY_2 = 30
BOINC_Universe = vanilla
BOINC_HOME = /usr1/BOINC
BOINC_Executable = $(BOINC_HOME)/var/slot$(BOINC_SLOT)/boinc
BOINC_Output = $(BOINC_HOME)/var/slot$(BOINC_SLOT)/boinc.out
BOINC_Error = $(BOINC_HOME)/var/slot$(BOINC_SLOT)/boinc.err
BOINC_InitialDir=$(BOINC_HOME)/var/slot$(BOINC_SLOT)
BOINC_Owner = nobody
BOINC_User = nobody
#BOINC_Arguments = --allow_multiple_clients --no_gui_rpc --update_prefs http://einstein.phys.uwm.edu/
#BOINC_Arguments = --allow_multiple_clients --gui_rpc_port 1000$(BOINC_SLOT) --allow_remote_gui_rpc --update_prefs http://einstein.phys.uwm.edu/ --star
t_delay $(START_DELAY)_$(BOINC_SLOT)
BOINC_Arguments = --allow_multiple_clients --gui_rpc_port 1000$(BOINC_SLOT) --allow_remote_gui_rpc --update_prefs http://einstein.phys.uwm.edu/
STARTD_JOB_HOOK_KEYWORD = BOINC
#SLOT1_JOB_HOOK_KEYWORD = BOINC
BOINC_HOOK_FETCH_WORK = $(BOINC_HOME)/fetch_work.sh
#BOINC_Requirements = RemoteUser =?= "$(BOINC_Owner)" || \
# RemoteUser =?= "$(BOINC_User)" || \
# (State == "Unclaimed" && $(StateTimer) > 60)
BOINC_Requirements = RemoteUser =?= nobody && (State == "Unclaimed" && $(StateTimer) > 60)
#RANK = $(RANK) - (Owner =?= "nobody")
RANK = 0
and can connect to each boinc on a separate rpc port (10001 and 10002) with
boincmgr.
for slot1, I have 8 jobs. 4 running, 4 "ready to start"; all are unique,
no duplicate jobs.
for slot2, I have 4 jobs, all suspended; these four are identical to the
**FIRST* 4 running jobs in slot1
top shows just the first 4 jobs from slot1
on the einstein@home web site (computer tasks), the 8 unique jobs from slot one all show up as 'in progress'.
So, it seems at first glance that slot1 gets 8 jobs, slot2 gets 4 duplicates.
and only slot one runs. I think the 4 other jobs from slot1 aren't running due to high cpu load, but it seems that the servers are still confused?
Thanks,
-Fred
You really have 100 ?
)
You really have
100
? That should read
1
(This is not percentage, but the maximum number of cores a Client should use).
You need to restart the clients (or they need to be terminated and restarted by Condor) in order to read the settings.
As for the duplicate taks I'd reset the project on the second Client using the BOINC Manager or boinc_cmd (command-line tool should be in the BOINC direcory). Or stop the client, delete the client_state.xml and its backup and start it again.
Note: I'll be a bit out of touch (internet access) for the next two weeks.
BM
BM