I am using BOINC 6.10.44 and my option is set to 0 (no restrictions). All my other projects (SETI, QMC, QuantumFire, AQUA, CPDN) proceed regularly.
Tullio
What I see by the "top" command is that S5GC1 is still using 4.6% of my 5 GB RAM memory (my Linux is pae) while suspended.
I am using BOINC 6.10.44 and my option is set to 0 (no restrictions). All my other projects (SETI, QMC, QuantumFire, AQUA, CPDN) proceed regularly.
Tullio
What I see by the "top" command is that S5GC1 is still using 4.6% of my 5 GB RAM memory (my Linux is pae) while suspended.
That's fine, the idea of the workaround is that the app remains in memory and is just halted, so when it resumes it will not have to re-read the checkpoint file.
I have a suspicion that the problem is related to reading the checkpoint file. I haven't had time to look at the source code in detail, tho, so this is just a gut feeling, it could be something completely different. The devs are informed and will look into this tomorrow, I guess.
As you can see, the checkpoint 122 was successfully read, but calculation is going on after checkpoint 21, fdot 3/6.
?????
Does checkpoint 122 contain the wrong sky position?
See the difference to the start of a WU:
2010-05-15 22:20:09.2588 (15720) [normal]: INFO: No checkpoint h1_0492.95_S5R4__114_S5GC1a_1_0.cpt found - starting from scratch
% --- Cpt:0, total:834, sky:1/139, f1dot:1/6
cu,
Michael
Hi Michael!
Unfortunately the first screenshot doesn't load atm, but it would be interesting to see the percentages. Can you give them here?
Quote:
As you can see, the checkpoint 122 was successfully read, but calculation is going on after checkpoint 21, fdot 3/6.
Not quite. It's like this: The main loop of the app goes over so-and-so many sky positions, and for each sky-positions, 6 different spin-down-values are tried (that's where the strange name comes from: f1dot = first derivative of frequency wrt time). The app is check-pointing after each spin-down value, not just after each sky-position. So if N is the number of sky-positions to look at in the WU, the total number of checkpoints is N * 6. If you are loading checkpoint with index k, you have to resume with sky-position k/6+1 (I guess the index starts with 1) and spin-down value (k mod 6 +1).
So I don't see anything wrong with the debug output so far, the problem seems to happen when writing the checkpoint, not when reading it. It seems that the sky-position index is saved as the checkpoint counter.
As you can see, the checkpoint 122 was successfully read, but calculation is going on after checkpoint 21, fdot 3/6.
Not quite. It's like this: The main loop of the app goes over so-and-so many sky positions, and for each sky-positions, 6 different spin-down-values are tried (that's where the strange name comes from: f1dot = first derivative of frequency wrt time). The app is check-pointing after each spin-down value, not just after each sky-position. So if N is the number of sky-positions to look at in the WU, the total number of checkpoints is N * 6. If you are loading checkpoint with index k, you have to resume with sky-position k/6+1 (I guess the index starts with 1) and spin-down value (k mod 6 +1).
So I don't see anything wrong with the debug output so far, the problem might happen when writing the checkpoint, not when reading it.
Anyway, thanks for posting this
hb
You are probably right... thought about it later and share your opinion that it can be caused by writing checkpoint data too.
Last checkpoint written yesterday was
[pre]2010-05-16 04:08:02.4796 (15720) [normal]: sky:123 f1dot:6 CG:9881 FG:10423949[/pre]
So the checkpoint-index should be 123 * 6 + 6 = 744 - but it probably isn't.
It is read as 122(???) and transformed to sky: 21 f1dot:2.
Next calculation is done with sky: 21 f1dot:3.
With chkpt-value of 744 next calc would be done with sky: 124 f1dot:1 - witch would be correct.
I can give you data from checkpointing yesterday too, if you like to.
BTW, the problem seems to be limited to the Linux version of the app, Windows and Mac users need not worry or consider the workaround discusssed above.
BTW, the problem seems to be limited to the Linux version of the app, Windows and Mac users need not worry or consider the workaround discusssed above.
CU
HB
Hi Bikeman,
there is no workaround for me beside killing all work on the hosts that only run part of the day.
Hope this is fixed soon, otherwise client_state.xml will be my friend. ;)
I am using BOINC 6.10.44 and
)
I am using BOINC 6.10.44 and my option is set to 0 (no restrictions). All my other projects (SETI, QMC, QuantumFire, AQUA, CPDN) proceed regularly.
Tullio
What I see by the "top" command is that S5GC1 is still using 4.6% of my 5 GB RAM memory (my Linux is pae) while suspended.
RE: I am using BOINC
)
That's fine, the idea of the workaround is that the app remains in memory and is just halted, so when it resumes it will not have to re-read the checkpoint file.
I have a suspicion that the problem is related to reading the checkpoint file. I haven't had time to look at the source code in detail, tho, so this is just a gut feeling, it could be something completely different. The devs are informed and will look into this tomorrow, I guess.
CU
HB
It just restarted from the
)
It just restarted from the latest progress%, without going back. Maybe we can see the light at the tunnell's end.
Tullio
Same problem here. First pic
)
Same problem here. First pic before shutdown yesterday:
http://img442.imageshack.us/img442/4580/boinc5.jpg
Next pic today after restart:
http://img194.imageshack.us/img194/1871/boinc6.jpg
stderr.txt in slot 0:
2010-05-16 12:31:57.2519 (4662) [debug]: Successfully read checkpoint:122
% --- Cpt:122, total:834, sky:21/139, f1dot:3/6
2010-05-16 12:31:57.2549 (4662) [normal]: sky:21 f1dot:3 CG:9881 FG:10423949
2010-05-16 12:32:23.8038 (4662) [normal]: sky:21 f1dot:4 CG:9881 FG:10423949
c
2010-05-16 12:32:50.1896 (4662) [normal]: sky:21 f1dot:5 CG:9881 FG:10423949
2010-05-16 12:33:16.6749 (4662) [normal]: sky:21 f1dot:6 CG:9881 FG:10423949
2010-05-16 12:33:43.0161 (4662) [normal]: sky:22 f1dot:1 CG:9881 FG:10423949
c
As you can see, the checkpoint 122 was successfully read, but calculation is going on after checkpoint 21, fdot 3/6.
?????
Does checkpoint 122 contain the wrong sky position?
See the difference to the start of a WU:
2010-05-15 22:20:09.2588 (15720) [normal]: INFO: No checkpoint h1_0492.95_S5R4__114_S5GC1a_1_0.cpt found - starting from scratch
% --- Cpt:0, total:834, sky:1/139, f1dot:1/6
cu,
Michael
RE: It just restarted from
)
With my old computer, it's running normal. I just make craft industry.
I take the WU one for one.
Actually I have just a problem with my avatar.
Regards
just a poet
RE: Same problem here.
)
Hi Michael!
Unfortunately the first screenshot doesn't load atm, but it would be interesting to see the percentages. Can you give them here?
Not quite. It's like this: The main loop of the app goes over so-and-so many sky positions, and for each sky-positions, 6 different spin-down-values are tried (that's where the strange name comes from: f1dot = first derivative of frequency wrt time). The app is check-pointing after each spin-down value, not just after each sky-position. So if N is the number of sky-positions to look at in the WU, the total number of checkpoints is N * 6. If you are loading checkpoint with index k, you have to resume with sky-position k/6+1 (I guess the index starts with 1) and spin-down value (k mod 6 +1).
So I don't see anything wrong with the debug output so far, the problem seems to happen when writing the checkpoint, not when reading it. It seems that the sky-position index is saved as the checkpoint counter.
Thanks for posting this
hb
RE: Hi
)
New link: http://img243.imageshack.us/img243/4580/boinc5.jpg
You are probably right... thought about it later and share your opinion that it can be caused by writing checkpoint data too.
Last checkpoint written yesterday was
[pre]2010-05-16 04:08:02.4796 (15720) [normal]: sky:123 f1dot:6 CG:9881 FG:10423949[/pre]
So the checkpoint-index should be 123 * 6 + 6 = 744 - but it probably isn't.
It is read as 122(???) and transformed to sky: 21 f1dot:2.
Next calculation is done with sky: 21 f1dot:3.
With chkpt-value of 744 next calc would be done with sky: 124 f1dot:1 - witch would be correct.
I can give you data from checkpointing yesterday too, if you like to.
cu,
Michael
RE: I can give you data
)
Thanks, but I think the problem is now already clear. Very annoying but easy to fix, I guess Bernd will create a new app soon to fix this.
CU
HB
BTW, the problem seems to be
)
BTW, the problem seems to be limited to the Linux version of the app, Windows and Mac users need not worry or consider the workaround discusssed above.
CU
HB
RE: BTW, the problem seems
)
Hi Bikeman,
there is no workaround for me beside killing all work on the hosts that only run part of the day.
Hope this is fixed soon, otherwise client_state.xml will be my friend. ;)
cu,
Michael
[edit: typo]