Got first file of S5GC1

tullio
tullio
Joined: 22 Jan 05
Posts: 2118
Credit: 61407735
RAC: 0

I am using BOINC 6.10.44 and

I am using BOINC 6.10.44 and my option is set to 0 (no restrictions). All my other projects (SETI, QMC, QuantumFire, AQUA, CPDN) proceed regularly.
Tullio
What I see by the "top" command is that S5GC1 is still using 4.6% of my 5 GB RAM memory (my Linux is pae) while suspended.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 721699461
RAC: 1133852

RE: I am using BOINC

Message 98031 in response to message 98030

Quote:
I am using BOINC 6.10.44 and my option is set to 0 (no restrictions). All my other projects (SETI, QMC, QuantumFire, AQUA, CPDN) proceed regularly.
Tullio
What I see by the "top" command is that S5GC1 is still using 4.6% of my 5 GB RAM memory (my Linux is pae) while suspended.

That's fine, the idea of the workaround is that the app remains in memory and is just halted, so when it resumes it will not have to re-read the checkpoint file.

I have a suspicion that the problem is related to reading the checkpoint file. I haven't had time to look at the source code in detail, tho, so this is just a gut feeling, it could be something completely different. The devs are informed and will look into this tomorrow, I guess.

CU
HB

tullio
tullio
Joined: 22 Jan 05
Posts: 2118
Credit: 61407735
RAC: 0

It just restarted from the

It just restarted from the latest progress%, without going back. Maybe we can see the light at the tunnell's end.
Tullio

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

Same problem here. First pic

Same problem here. First pic before shutdown yesterday:

http://img442.imageshack.us/img442/4580/boinc5.jpg

Next pic today after restart:

http://img194.imageshack.us/img194/1871/boinc6.jpg

stderr.txt in slot 0:

2010-05-16 12:31:57.2519 (4662) [debug]: Successfully read checkpoint:122
% --- Cpt:122, total:834, sky:21/139, f1dot:3/6
2010-05-16 12:31:57.2549 (4662) [normal]: sky:21 f1dot:3 CG:9881 FG:10423949
2010-05-16 12:32:23.8038 (4662) [normal]: sky:21 f1dot:4 CG:9881 FG:10423949
c
2010-05-16 12:32:50.1896 (4662) [normal]: sky:21 f1dot:5 CG:9881 FG:10423949
2010-05-16 12:33:16.6749 (4662) [normal]: sky:21 f1dot:6 CG:9881 FG:10423949
2010-05-16 12:33:43.0161 (4662) [normal]: sky:22 f1dot:1 CG:9881 FG:10423949
c

As you can see, the checkpoint 122 was successfully read, but calculation is going on after checkpoint 21, fdot 3/6.

?????

Does checkpoint 122 contain the wrong sky position?
See the difference to the start of a WU:

2010-05-15 22:20:09.2588 (15720) [normal]: INFO: No checkpoint h1_0492.95_S5R4__114_S5GC1a_1_0.cpt found - starting from scratch
% --- Cpt:0, total:834, sky:1/139, f1dot:1/6

cu,
Michael

[AF>Libristes] erik
[AF>Libristes] erik
Joined: 2 Feb 08
Posts: 13
Credit: 1065572761
RAC: 2310329

RE: It just restarted from

Message 98034 in response to message 98032

Quote:
It just restarted from the latest progress%, without going back. Maybe we can see the light at the tunnell's end.
Tullio


With my old computer, it's running normal. I just make craft industry.
I take the WU one for one.
Actually I have just a problem with my avatar.

Regards

just a poet

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 721699461
RAC: 1133852

RE: Same problem here.

Message 98035 in response to message 98033

Quote:

Same problem here. First pic before shutdown yesterday:

http://img442.imageshack.us/img442/4580/boinc5.jpg

Next pic today after restart:

http://img194.imageshack.us/img194/1871/boinc6.jpg

stderr.txt in slot 0:

2010-05-16 12:31:57.2519 (4662) [debug]: Successfully read checkpoint:122
% --- Cpt:122, total:834, sky:21/139, f1dot:3/6
2010-05-16 12:31:57.2549 (4662) [normal]: sky:21 f1dot:3 CG:9881 FG:10423949
2010-05-16 12:32:23.8038 (4662) [normal]: sky:21 f1dot:4 CG:9881 FG:10423949
c
2010-05-16 12:32:50.1896 (4662) [normal]: sky:21 f1dot:5 CG:9881 FG:10423949
2010-05-16 12:33:16.6749 (4662) [normal]: sky:21 f1dot:6 CG:9881 FG:10423949
2010-05-16 12:33:43.0161 (4662) [normal]: sky:22 f1dot:1 CG:9881 FG:10423949
c

As you can see, the checkpoint 122 was successfully read, but calculation is going on after checkpoint 21, fdot 3/6.

?????

Does checkpoint 122 contain the wrong sky position?
See the difference to the start of a WU:

2010-05-15 22:20:09.2588 (15720) [normal]: INFO: No checkpoint h1_0492.95_S5R4__114_S5GC1a_1_0.cpt found - starting from scratch
% --- Cpt:0, total:834, sky:1/139, f1dot:1/6

cu,
Michael

Hi Michael!

Unfortunately the first screenshot doesn't load atm, but it would be interesting to see the percentages. Can you give them here?

Quote:

As you can see, the checkpoint 122 was successfully read, but calculation is going on after checkpoint 21, fdot 3/6.

Not quite. It's like this: The main loop of the app goes over so-and-so many sky positions, and for each sky-positions, 6 different spin-down-values are tried (that's where the strange name comes from: f1dot = first derivative of frequency wrt time). The app is check-pointing after each spin-down value, not just after each sky-position. So if N is the number of sky-positions to look at in the WU, the total number of checkpoints is N * 6. If you are loading checkpoint with index k, you have to resume with sky-position k/6+1 (I guess the index starts with 1) and spin-down value (k mod 6 +1).

So I don't see anything wrong with the debug output so far, the problem seems to happen when writing the checkpoint, not when reading it. It seems that the sky-position index is saved as the checkpoint counter.

Thanks for posting this
hb

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: Hi

Message 98036 in response to message 98035

Quote:

Hi Michael!

Unfortunately the first screenshot doesn't load atm, but it would be interesting to see the percentages. Can you give them here?

New link: http://img243.imageshack.us/img243/4580/boinc5.jpg

Quote:
Quote:

As you can see, the checkpoint 122 was successfully read, but calculation is going on after checkpoint 21, fdot 3/6.

Not quite. It's like this: The main loop of the app goes over so-and-so many sky positions, and for each sky-positions, 6 different spin-down-values are tried (that's where the strange name comes from: f1dot = first derivative of frequency wrt time). The app is check-pointing after each spin-down value, not just after each sky-position. So if N is the number of sky-positions to look at in the WU, the total number of checkpoints is N * 6. If you are loading checkpoint with index k, you have to resume with sky-position k/6+1 (I guess the index starts with 1) and spin-down value (k mod 6 +1).

So I don't see anything wrong with the debug output so far, the problem might happen when writing the checkpoint, not when reading it.

Anyway, thanks for posting this
hb

You are probably right... thought about it later and share your opinion that it can be caused by writing checkpoint data too.
Last checkpoint written yesterday was
[pre]2010-05-16 04:08:02.4796 (15720) [normal]: sky:123 f1dot:6 CG:9881 FG:10423949[/pre]
So the checkpoint-index should be 123 * 6 + 6 = 744 - but it probably isn't.
It is read as 122(???) and transformed to sky: 21 f1dot:2.
Next calculation is done with sky: 21 f1dot:3.
With chkpt-value of 744 next calc would be done with sky: 124 f1dot:1 - witch would be correct.

I can give you data from checkpointing yesterday too, if you like to.

cu,
Michael

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 721699461
RAC: 1133852

RE: I can give you data

Message 98037 in response to message 98036

Quote:

I can give you data from checkpointing yesterday too, if you like to.

cu,
Michael

Thanks, but I think the problem is now already clear. Very annoying but easy to fix, I guess Bernd will create a new app soon to fix this.

CU
HB

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 721699461
RAC: 1133852

BTW, the problem seems to be

BTW, the problem seems to be limited to the Linux version of the app, Windows and Mac users need not worry or consider the workaround discusssed above.

CU
HB

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: BTW, the problem seems

Message 98039 in response to message 98038

Quote:

BTW, the problem seems to be limited to the Linux version of the app, Windows and Mac users need not worry or consider the workaround discusssed above.

CU
HB

Hi Bikeman,

there is no workaround for me beside killing all work on the hosts that only run part of the day.
Hope this is fixed soon, otherwise client_state.xml will be my friend. ;)

cu,
Michael

[edit: typo]

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.