einstein checkpoints

[AF>ALSACE>EDLS] Phil68
[AF>ALSACE>EDLS...
Joined: 30 Dec 05
Posts: 32
Credit: 39832
RAC: 0
Topic 190698

hi

i'm running albert 4.36 on an sparc-solaris 2.7 with boinc 4.43... and my questions are:
1) how can i force checkpoints ? (if it's possible)
2) how can i see if checkpoints are used and when ?
3) why could be the reasons why my WUs are going form 0% to near 10% and directly to 100% (after 6 hours without any move) ???

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

einstein checkpoints

Quote:

hi

i'm running albert 4.36 on an sparc-solaris 2.7 with boinc 4.43... and my questions are:
1) how can i force checkpoints ? (if it's possible)
2) how can i see if checkpoints are used and when ?
3) why could be the reasons why my WUs are going form 0% to near 10% and directly to 100% (after 6 hours without any move) ???

The frequency of checkpointing is determined by your preference setting for 'Write to disk no more than every X minutes'.

To watch checkpointing in progress, look in the slots/N/ directory and find the file called Fstat.out.ckp. This file is written (and changes) at each checkpoint.

Director, Einstein@Home

[AF>ALSACE>EDLS] Phil68
[AF>ALSACE>EDLS...
Joined: 30 Dec 05
Posts: 32
Credit: 39832
RAC: 0

RE: The frequency of

Message 24690 in response to message 24689

Quote:

The frequency of checkpointing is determined by your preference setting for 'Write to disk no more than every X minutes'.

To watch checkpointing in progress, look in the slots/N/ directory and find the file called Fstat.out.ckp. This file is written (and changes) at each checkpoint.

thank you very much for your answer.

my settings are: Write to disk at most every 5 seconds
and for example..... my boinc is running sinc a long time on 2 WUS....

the time is 16:42:07.. and in the slots directories i have....

1:
total 168
-rw-r--r-- 1 pp staff 73776 Oct 25 17:21 skygrid_0180_z_T06.dat
-rw-r--r-- 1 pp staff 70 Feb 1 04:52 sun
-rw-r--r-- 1 pp staff 72 Feb 1 04:52 earth
-rw-r--r-- 1 pp staff 70 Feb 1 04:52 data.sft
-rw-r--r-- 1 pp staff 77 Feb 1 04:52 conf
-rw-r--r-- 1 pp staff 93 Feb 1 04:52 albert_4.36_sparc-sun-solaris2.7
-rw-r--r-- 1 pp staff 85 Feb 1 04:52 Fstat.out
-rw-r--r-- 1 pp staff 497 Feb 1 12:09 stderr.txt
-rw-r--r-- 1 pp staff 3077 Feb 1 12:09 init_data.xml
-rw-r--r-- 1 pp staff 0 Feb 1 12:09 boinc_finish_called

2:
total 198
-rw-r--r-- 1 pp staff 89989 Oct 25 17:22 skygrid_0200_z_T06.dat
-rw-r--r-- 1 pp staff 70 Feb 1 12:10 sun
-rw-r--r-- 1 pp staff 3068 Feb 1 12:10 init_data.xml
-rw-r--r-- 1 pp staff 72 Feb 1 12:10 earth
-rw-r--r-- 1 pp staff 70 Feb 1 12:10 data.sft
-rw-r--r-- 1 pp staff 77 Feb 1 12:10 conf
-rw-r--r-- 1 pp staff 0 Feb 1 12:10 boinc_lockfile
-rw-r--r-- 1 pp staff 93 Feb 1 12:10 albert_4.36_sparc-sun-solaris2.7
-rw-r--r-- 1 pp staff 86 Feb 1 12:10 Fstat.out
-rw-r--r-- 1 pp staff 432 Feb 1 13:56 stderr.txt
-rw-r--r-- 1 pp staff 26 Feb 1 13:56 Fstat.out.ckp

and the only ckp-file i have is

more 2/Fstat.out.ckp
1237 33506155 672800
DONE

and my top shows only one application....

PID USERNAME THR PR NCE SIZE RES STATE TIME FLTS CPU COMMAND
12890 pp 3 0 19 6816K 6112K cpu01 269:46 0 49.40% albert_4.36_spa

and my boincviewer says since hours....
1 WU at 100%
1 WU at 24.95 %

???

[AF>ALSACE>EDLS] Phil68
[AF>ALSACE>EDLS...
Joined: 30 Dec 05
Posts: 32
Credit: 39832
RAC: 0

Sorry... i posted that

Message 24691 in response to message 24690

Sorry... i posted that twice.... because i don't know where to find some help :-(

some more infos...... i have 2 WUS that seams once more not to go farther.... :-(

first the client_state.xml..... and the slots files.....
i don't know...perhaps could someone find here something wrong.... the checkpoints seam really not to work on my machine.....
Is that normal "Fstat file reached MaxFileSizeKB" ???

http://einstein.phys.uwm.edu/
z1_0666.5__282_S4R2a_2
1
436
1
2
84342.990000
0.028841
84342.990000
0.000000
0.000000

http://einstein.phys.uwm.edu/
z1_0666.5__281_S4R2a_2
1
436
0
2
3085.580000
0.036817
3085.580000
0.000000
0.000000

slots/0>lst
-rw-r--r-- 1 pp staff 402399 Oct 25 17:24 skygrid_0670_z_T06.dat
drwxr-xr-x 7 pp staff 512 Jan 16 09:01 ..
-rw-r--r-- 1 pp staff 70 Feb 7 00:43 sun
-rw-r--r-- 1 pp staff 3067 Feb 7 00:43 init_data.xml
-rw-r--r-- 1 pp staff 72 Feb 7 00:43 earth
-rw-r--r-- 1 pp staff 70 Feb 7 00:43 data.sft
-rw-r--r-- 1 pp staff 77 Feb 7 00:43 conf
-rw-r--r-- 1 pp staff 0 Feb 7 00:43 boinc_lockfile
-rw-r--r-- 1 pp staff 93 Feb 7 00:43 albert_4.36_sparc-sun-solaris2.7
-rw-r--r-- 1 pp staff 85 Feb 7 00:43 Fstat.out
-rw-r--r-- 1 pp staff 432 Feb 7 01:36 stderr.txt
-rw-r--r-- 1 pp staff 25 Feb 7 01:36 Fstat.out.ckp
drwxr-xr-x 2 pp staff 512 Feb 7 01:36 .

slots/0>more Fstat.out.ckp
818 33369450 668662
DONE
Fstat.out.ckp: END
slots/0>more stderr.txt
2006-02-07 00:43:32.9964 [normal]: Start of BOINC application 'albert_4.36_sparc-sun-solaris2.7'.
2006-02-07 00:43:33.0060 [normal]: Started search at lalDebugLevel = 0
2006-02-07 00:43:34.3525 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-02-07 00:43:34.3667 [normal]: No usable checkpoint found, starting from beginning.
2006-02-07 01:36:04.8990 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.

slots/1>lst
-rw-r--r-- 1 pp staff 402399 Oct 25 17:24 skygrid_0670_z_T06.dat
drwxr-xr-x 7 pp staff 512 Jan 16 09:01 ..
-rw-r--r-- 1 pp staff 70 Feb 6 00:30 sun
-rw-r--r-- 1 pp staff 72 Feb 6 00:30 earth
-rw-r--r-- 1 pp staff 70 Feb 6 00:30 data.sft
-rw-r--r-- 1 pp staff 77 Feb 6 00:30 conf
-rw-r--r-- 1 pp staff 93 Feb 6 00:30 albert_4.36_sparc-sun-solaris2.7
-rw-r--r-- 1 pp staff 85 Feb 6 00:30 Fstat.out
-rw-r--r-- 1 pp staff 0 Feb 7 00:24 boinc_finish_called
-rw-r--r-- 1 pp staff 3077 Feb 7 08:54 init_data.xml
-rw-r--r-- 1 pp staff 0 Feb 7 08:54 boinc_lockfile
-rw-r--r-- 1 pp staff 929 Feb 7 09:36 stderr.txt
-rw-r--r-- 1 pp staff 25 Feb 7 09:36 Fstat.out.ckp
drwxr-xr-x 2 pp staff 512 Feb 7 09:36 .

slots/1>more stderr.txt
2006-02-06 00:30:38.1595 [normal]: Start of BOINC application 'albert_4.36_sparc-sun-solaris2.7'.
2006-02-06 00:30:38.1624 [normal]: Started search at lalDebugLevel = 0
2006-02-06 00:30:38.9200 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-02-06 00:30:38.9258 [normal]: No usable checkpoint found, starting from beginning.
2006-02-06 01:12:33.2474 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.
2006-02-07 00:24:42.0435 [normal]: Search finished successfully.

2006-02-07 08:54:36.8763 [normal]: Start of BOINC application 'albert_4.36_sparc-sun-solaris2.7'.
2006-02-07 08:54:36.8791 [normal]: Started search at lalDebugLevel = 0
2006-02-07 08:54:38.4567 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-02-07 08:54:38.4707 [normal]: No usable checkpoint found, starting from beginning.
2006-02-07 09:36:17.4816 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.

slots/1>more Fstat.out.ckp
641 33437018 670670
DONE

wumpus
wumpus
Joined: 17 Feb 05
Posts: 50
Credit: 7809074
RAC: 0

I figured I would post here

I figured I would post here since you seem to have a problem. I looked at the time to process for both your Sunfire V250 and my Sunblade 100. Although I only have two results returned, I am at about 60000 for CPU time on the website. You appear to have gone up to 85000 since Feb 2. Before that, you seemed to be at about 25000 a workunit. In your preferences, do you have it set to use multiple processors? Is your sunblade just used for processing workunits only?

You may want to temporarily give up on using the Boinc Viewer and just watch a terminal window with the client running in it.

Are you attached to any other projects or just Einstein?

wumpus
wumpus
Joined: 17 Feb 05
Posts: 50
Credit: 7809074
RAC: 0

I went through the stats I

I went through the stats I downloaded for SPARC machines and the result numbers for processing time are all over for the workunits. I was just looking at Sunblade 100s because there were a lot of them. The Sunblade 100 data includes both Sunblade 100 and 150 models because both report the same. There doesn't seem to be a real good way to judge if you are running slow or not. Some machines average high compute time in seconds and others are about 1/2 the time. I don't know if some machines are running more than one project. I am just running Einstein.

[AF>ALSACE>EDLS] Phil68
[AF>ALSACE>EDLS...
Joined: 30 Dec 05
Posts: 32
Credit: 39832
RAC: 0

RE: RE: I figured I would

Message 24694 in response to message 24692

Quote:

Quote:
I figured I would post here since you seem to have a problem. I looked at the time to process for both your Sunfire V250 and my Sunblade 100. Although I only have two results returned, I am at about 60000 for CPU time on the website. You appear to have gone up to 85000 since Feb 2. Before that, you seemed to be at about 25000 a workunit. In your preferences, do you have it set to use multiple processors? Is your sunblade just used for processing workunits only?

before i had "2 processors" in my preferences.... since yesterday, i'm trying with max 1 processor.... but i tried to run 2 boinc with one processor each, the first one with only einstein, and the seconde one with both seti and einstein... i all theses configurations seti works fine and shows allways his advencement percentage and time.... and einstein never....
this machine is there to compile some applications, and an oracle db is running on it...

Quote:

You may want to temporarily give up on using the Boinc Viewer and just watch a terminal window with the client running in it.

i can try, you think it's the boincviewer that make problems to einstein ????

Quote:

Are you attached to any other projects or just Einstein?


seti and einstein, but as i said, einstein alone works the same...

wumpus
wumpus
Joined: 17 Feb 05
Posts: 50
Credit: 7809074
RAC: 0

You only need to run one

You only need to run one instance of the BOINC client. Should attach to both projects and have your preferences set to use both processors. If your preferences state your resource shares are the same between both projects, they should be treated about equal. It seems E@H doesn't advance as quick if it has to share. Try to read up on how BOINC schedules the applications or stick to just one project otherwise.

[AF>ALSACE>EDLS] Phil68
[AF>ALSACE>EDLS...
Joined: 30 Dec 05
Posts: 32
Credit: 39832
RAC: 0

RE: You only need to run

Message 24696 in response to message 24695

Quote:
You only need to run one instance of the BOINC client. Should attach to both projects and have your preferences set to use both processors. If your preferences state your resource shares are the same between both projects, they should be treated about equal. It seems E@H doesn't advance as quick if it has to share. Try to read up on how BOINC schedules the applications or stick to just one project otherwise.


yes, it's sure better to run only one boinc. but as the einstein didn't show any move (or when the wus were correctly done), i tried many configurations.
1 boinc, switch betwenn apps 20 minutes/10 minutes/120 minutes, 1% seti, 50% seti, 2 einstein, 1 proc/2 proc.... nothing changed...
even with einstein alone with only one proc, it doesn't work fine (it works, but i never see where he is from its works)...
is there some site where the tags from cleint_state.xml are explained ?
because the differences between client_state and client_state_prev could be analysed, or not ?
thank you very much for your lot of answers and ideas....

ps : do a debug-mode exists for boinc and einstein ? to find what he's doing wrong....

wumpus
wumpus
Joined: 17 Feb 05
Posts: 50
Credit: 7809074
RAC: 0

On mine, with only boinc

On mine, with only boinc runing in a terminal window with default setting, I see updates to the terminal screen about every hour. I seem to be getting a completion about every 16 to 17 hours.
Leave your setup for 'use at most one processor', run only one Boinc instance, and detach all other projects. I don't think E@H updates the files all that often. Just try to run it that way for a few days. When you don't have Boinc running at all, in a terminal window type 'prstat' and hit enter. This will give you a view like the top command. If you want to quit, hit 'Q'.

http://developers.sun.com/solaris/articles/prstat.html

this link location show the commands for prstat. You can use this to see what is taking up processing power on the computer. Since you have Oracle, it may just be busy.

wumpus
wumpus
Joined: 17 Feb 05
Posts: 50
Credit: 7809074
RAC: 0

I got Boinc View running on

I got Boinc View running on Windows XP and it seems to update fine from the client on Solaris 10. I can see progress with cpu time and percent done.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.