hi
i'm running albert 4.36 on an sparc-solaris 2.7 with boinc 4.43... and my questions are:
1) how can i force checkpoints ? (if it's possible)
2) how can i see if checkpoints are used and when ?
3) why could be the reasons why my WUs are going form 0% to near 10% and directly to 100% (after 6 hours without any move) ???
Copyright © 2024 Einstein@Home. All rights reserved.
einstein checkpoints
)
The frequency of checkpointing is determined by your preference setting for 'Write to disk no more than every X minutes'.
To watch checkpointing in progress, look in the slots/N/ directory and find the file called Fstat.out.ckp. This file is written (and changes) at each checkpoint.
Director, Einstein@Home
RE: The frequency of
)
thank you very much for your answer.
my settings are: Write to disk at most every 5 seconds
and for example..... my boinc is running sinc a long time on 2 WUS....
the time is 16:42:07.. and in the slots directories i have....
1:
total 168
-rw-r--r-- 1 pp staff 73776 Oct 25 17:21 skygrid_0180_z_T06.dat
-rw-r--r-- 1 pp staff 70 Feb 1 04:52 sun
-rw-r--r-- 1 pp staff 72 Feb 1 04:52 earth
-rw-r--r-- 1 pp staff 70 Feb 1 04:52 data.sft
-rw-r--r-- 1 pp staff 77 Feb 1 04:52 conf
-rw-r--r-- 1 pp staff 93 Feb 1 04:52 albert_4.36_sparc-sun-solaris2.7
-rw-r--r-- 1 pp staff 85 Feb 1 04:52 Fstat.out
-rw-r--r-- 1 pp staff 497 Feb 1 12:09 stderr.txt
-rw-r--r-- 1 pp staff 3077 Feb 1 12:09 init_data.xml
-rw-r--r-- 1 pp staff 0 Feb 1 12:09 boinc_finish_called
2:
total 198
-rw-r--r-- 1 pp staff 89989 Oct 25 17:22 skygrid_0200_z_T06.dat
-rw-r--r-- 1 pp staff 70 Feb 1 12:10 sun
-rw-r--r-- 1 pp staff 3068 Feb 1 12:10 init_data.xml
-rw-r--r-- 1 pp staff 72 Feb 1 12:10 earth
-rw-r--r-- 1 pp staff 70 Feb 1 12:10 data.sft
-rw-r--r-- 1 pp staff 77 Feb 1 12:10 conf
-rw-r--r-- 1 pp staff 0 Feb 1 12:10 boinc_lockfile
-rw-r--r-- 1 pp staff 93 Feb 1 12:10 albert_4.36_sparc-sun-solaris2.7
-rw-r--r-- 1 pp staff 86 Feb 1 12:10 Fstat.out
-rw-r--r-- 1 pp staff 432 Feb 1 13:56 stderr.txt
-rw-r--r-- 1 pp staff 26 Feb 1 13:56 Fstat.out.ckp
and the only ckp-file i have is
more 2/Fstat.out.ckp
1237 33506155 672800
DONE
and my top shows only one application....
PID USERNAME THR PR NCE SIZE RES STATE TIME FLTS CPU COMMAND
12890 pp 3 0 19 6816K 6112K cpu01 269:46 0 49.40% albert_4.36_spa
and my boincviewer says since hours....
1 WU at 100%
1 WU at 24.95 %
???
Sorry... i posted that
)
Sorry... i posted that twice.... because i don't know where to find some help :-(
some more infos...... i have 2 WUS that seams once more not to go farther.... :-(
first the client_state.xml..... and the slots files.....
i don't know...perhaps could someone find here something wrong.... the checkpoints seam really not to work on my machine.....
Is that normal "Fstat file reached MaxFileSizeKB" ???
http://einstein.phys.uwm.edu/
z1_0666.5__282_S4R2a_2
1
436
1
2
84342.990000
0.028841
84342.990000
0.000000
0.000000
http://einstein.phys.uwm.edu/
z1_0666.5__281_S4R2a_2
1
436
0
2
3085.580000
0.036817
3085.580000
0.000000
0.000000
slots/0>lst
-rw-r--r-- 1 pp staff 402399 Oct 25 17:24 skygrid_0670_z_T06.dat
drwxr-xr-x 7 pp staff 512 Jan 16 09:01 ..
-rw-r--r-- 1 pp staff 70 Feb 7 00:43 sun
-rw-r--r-- 1 pp staff 3067 Feb 7 00:43 init_data.xml
-rw-r--r-- 1 pp staff 72 Feb 7 00:43 earth
-rw-r--r-- 1 pp staff 70 Feb 7 00:43 data.sft
-rw-r--r-- 1 pp staff 77 Feb 7 00:43 conf
-rw-r--r-- 1 pp staff 0 Feb 7 00:43 boinc_lockfile
-rw-r--r-- 1 pp staff 93 Feb 7 00:43 albert_4.36_sparc-sun-solaris2.7
-rw-r--r-- 1 pp staff 85 Feb 7 00:43 Fstat.out
-rw-r--r-- 1 pp staff 432 Feb 7 01:36 stderr.txt
-rw-r--r-- 1 pp staff 25 Feb 7 01:36 Fstat.out.ckp
drwxr-xr-x 2 pp staff 512 Feb 7 01:36 .
slots/0>more Fstat.out.ckp
818 33369450 668662
DONE
Fstat.out.ckp: END
slots/0>more stderr.txt
2006-02-07 00:43:32.9964 [normal]: Start of BOINC application 'albert_4.36_sparc-sun-solaris2.7'.
2006-02-07 00:43:33.0060 [normal]: Started search at lalDebugLevel = 0
2006-02-07 00:43:34.3525 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-02-07 00:43:34.3667 [normal]: No usable checkpoint found, starting from beginning.
2006-02-07 01:36:04.8990 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.
slots/1>lst
-rw-r--r-- 1 pp staff 402399 Oct 25 17:24 skygrid_0670_z_T06.dat
drwxr-xr-x 7 pp staff 512 Jan 16 09:01 ..
-rw-r--r-- 1 pp staff 70 Feb 6 00:30 sun
-rw-r--r-- 1 pp staff 72 Feb 6 00:30 earth
-rw-r--r-- 1 pp staff 70 Feb 6 00:30 data.sft
-rw-r--r-- 1 pp staff 77 Feb 6 00:30 conf
-rw-r--r-- 1 pp staff 93 Feb 6 00:30 albert_4.36_sparc-sun-solaris2.7
-rw-r--r-- 1 pp staff 85 Feb 6 00:30 Fstat.out
-rw-r--r-- 1 pp staff 0 Feb 7 00:24 boinc_finish_called
-rw-r--r-- 1 pp staff 3077 Feb 7 08:54 init_data.xml
-rw-r--r-- 1 pp staff 0 Feb 7 08:54 boinc_lockfile
-rw-r--r-- 1 pp staff 929 Feb 7 09:36 stderr.txt
-rw-r--r-- 1 pp staff 25 Feb 7 09:36 Fstat.out.ckp
drwxr-xr-x 2 pp staff 512 Feb 7 09:36 .
slots/1>more stderr.txt
2006-02-06 00:30:38.1595 [normal]: Start of BOINC application 'albert_4.36_sparc-sun-solaris2.7'.
2006-02-06 00:30:38.1624 [normal]: Started search at lalDebugLevel = 0
2006-02-06 00:30:38.9200 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-02-06 00:30:38.9258 [normal]: No usable checkpoint found, starting from beginning.
2006-02-06 01:12:33.2474 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.
2006-02-07 00:24:42.0435 [normal]: Search finished successfully.
2006-02-07 08:54:36.8763 [normal]: Start of BOINC application 'albert_4.36_sparc-sun-solaris2.7'.
2006-02-07 08:54:36.8791 [normal]: Started search at lalDebugLevel = 0
2006-02-07 08:54:38.4567 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-02-07 08:54:38.4707 [normal]: No usable checkpoint found, starting from beginning.
2006-02-07 09:36:17.4816 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.
slots/1>more Fstat.out.ckp
641 33437018 670670
DONE
I figured I would post here
)
I figured I would post here since you seem to have a problem. I looked at the time to process for both your Sunfire V250 and my Sunblade 100. Although I only have two results returned, I am at about 60000 for CPU time on the website. You appear to have gone up to 85000 since Feb 2. Before that, you seemed to be at about 25000 a workunit. In your preferences, do you have it set to use multiple processors? Is your sunblade just used for processing workunits only?
You may want to temporarily give up on using the Boinc Viewer and just watch a terminal window with the client running in it.
Are you attached to any other projects or just Einstein?
I went through the stats I
)
I went through the stats I downloaded for SPARC machines and the result numbers for processing time are all over for the workunits. I was just looking at Sunblade 100s because there were a lot of them. The Sunblade 100 data includes both Sunblade 100 and 150 models because both report the same. There doesn't seem to be a real good way to judge if you are running slow or not. Some machines average high compute time in seconds and others are about 1/2 the time. I don't know if some machines are running more than one project. I am just running Einstein.
RE: RE: I figured I would
)
seti and einstein, but as i said, einstein alone works the same...
You only need to run one
)
You only need to run one instance of the BOINC client. Should attach to both projects and have your preferences set to use both processors. If your preferences state your resource shares are the same between both projects, they should be treated about equal. It seems E@H doesn't advance as quick if it has to share. Try to read up on how BOINC schedules the applications or stick to just one project otherwise.
RE: You only need to run
)
yes, it's sure better to run only one boinc. but as the einstein didn't show any move (or when the wus were correctly done), i tried many configurations.
1 boinc, switch betwenn apps 20 minutes/10 minutes/120 minutes, 1% seti, 50% seti, 2 einstein, 1 proc/2 proc.... nothing changed...
even with einstein alone with only one proc, it doesn't work fine (it works, but i never see where he is from its works)...
is there some site where the tags from cleint_state.xml are explained ?
because the differences between client_state and client_state_prev could be analysed, or not ?
thank you very much for your lot of answers and ideas....
ps : do a debug-mode exists for boinc and einstein ? to find what he's doing wrong....
On mine, with only boinc
)
On mine, with only boinc runing in a terminal window with default setting, I see updates to the terminal screen about every hour. I seem to be getting a completion about every 16 to 17 hours.
Leave your setup for 'use at most one processor', run only one Boinc instance, and detach all other projects. I don't think E@H updates the files all that often. Just try to run it that way for a few days. When you don't have Boinc running at all, in a terminal window type 'prstat' and hit enter. This will give you a view like the top command. If you want to quit, hit 'Q'.
http://developers.sun.com/solaris/articles/prstat.html
this link location show the commands for prstat. You can use this to see what is taking up processing power on the computer. Since you have Oracle, it may just be busy.
I got Boinc View running on
)
I got Boinc View running on Windows XP and it seems to update fine from the client on Solaris 10. I can see progress with cpu time and percent done.