In one of the latest WUs batches for the FGRP5 project, there are strange problems with naming tasks and data files:
What does the normal name of the tasks and the corresponding files look like:
LATeah2112F_1384.0_207988_0.0_0 LATeah2112F_1384.0_242638_0.0_0 LATeah2112F_1384.0_608542_0.0_0 LATeah2112F_1400.0_41162_0.0_0 LATeah2112F_1400.0_41184_0.0_0 LATeah2112F_1400.0_41206_0.0_0
How problematic WUs names look:
LATeah2113F_56.0_1208_-1.6e-11_0 LATeah2113F_72.0_598_-8.499999999999999e-11_0 LATeah2113F_72.0_598_-9.299999999999996e-11_0 LATeah2113F_72.0_56_-8.4e-11_2 LATeah2113F_72.0_762_-2.2999999999999998e-11_0 LATeah2113F_72.0_782_-4.000000000000003e-11_1 LATeah2113F_72.0_782_-4.400000000000004e-11_1 LATeah2113F_72.0_1078_-8.999999999999997e-11_0
Looks like a rounding issue - some of the zeros are replaced by extremely small values of "almost zero" in FP32 variable like -8.999999999999997e-11 (it's 0.00000000000899999)
Same apply to names of some input and output files to such WUs. I see file names like 'LATeah2113F_88.0_2970_-9.099999999999997e-11_0_1" quite often.
It could just be a cosmetic display defect. But I found that many (probably all, but I didn't check everything, because there are too many of them) such tasks also have other more significant problems:
1 - the checkpoints in such WU batches are partially broken, they make checkpoints only twice at 45% and ~89.95% points(but multiple times at 89%). Depending on the CPU speed, this can take up to many hours between checkpoints and significant losses of useful calculations in the event of restarting computer/BOINC/tasks or frequent switching between BOINC projects (if the user does not have the option to leave suspended tasks in memory enabled).
2 - the reporting of calculation progress to BOINC client is similarly disrupted (if BOINC task progress interpolation is disabled by the user via <fraction_done_exact/> flag or if we check boinc_task_state.xml in working slot directory). WU progress jumps 0% - 45% - 89.95% - 100% without any intermediate values.
Examples of stderr.txt and boinc_task_state.xml files for a taks running ~6 hours on Ryzen 2700 (about 75-80% done as it usually take about 7.5-8 hours to finish one FGRP5 WU on this machine):
01:20:19 (1580): [normal]: This Einstein@home App was built at: Jul 26 2017 09:32:43 01:20:19 (1580): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe'. 01:20:19 (1580): [debug]: 2.1e+015 fp, 4.2e+009 fp/s, 500823 s, 139h07m02s87 command line: projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah2113F.dat --alpha 3.5340718238 --delta -1.0671047766 --skyRadius 0.0008901179185 --ldiBins 15 --f0start 72.0 --f0Band 16 --firstSkyPoint 3014 --numSkyPoints 2 --f1dot -5.700000000000008e-11 --f1dotBand 1e-12 --df1dot 1.004320633e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 57569.0 --f0orbit 0.005 --freeRadiusFactor 2 --mismatch 0.15 --debug 0 -o LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0.out output files: 'LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0' 'LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah2113F_88.0_3014_-5.600000000000008e-11_0_1' 01:20:19 (1580): [debug]: Flags: i386 SSE GNUC X86 GNUX86 01:20:19 (1580): [debug]: Set up communication with graphics process. read_checkpoint(): Couldn't open file 'LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0.out.cpt': No such file or directory (2) INFO: Major Windows version: 6 % C 1 0
====================
<active_task> <project_master_url>https://einstein.phys.uwm.edu/</project_master_url> <result_name>LATeah2113F_88.0_3014_-5.600000000000008e-11_0</result_name> <checkpoint_cpu_time>11602.730000</checkpoint_cpu_time> <checkpoint_elapsed_time>11837.349976</checkpoint_elapsed_time> <fraction_done>0.450000</fraction_done> <peak_working_set_size>685195264</peak_working_set_size> <peak_swap_size>681181184</peak_swap_size> <peak_disk_usage>13367</peak_disk_usage> </active_task>
As you can see - only 1 checkpoint was saved at 45% after 3.2 hours from task start and no more after (next ~2.5h). But near 89% it can write up to 10-15 checkpoints in a row.
Link to example of such "bad" WUs in DB: https://einsteinathome.org/task/1704029984
And normal WU for comparison: https://einsteinathome.org/task/1703391437
Note difference in checkpoints: 12 cpt in 2 blocks (1 @ 45% and 11 @ ~89%) vs 31 cpt in 22 blocks written regularly at approximately regular intervals, except for a pack of probably excessive checkpoints recorded in a row at ~89% too.
Copyright © 2024 Einstein@Home. All rights reserved.
Mad_Max wrote:... there are
)
Values listed in the task name are not used in the calculations being performed. They are probably just there to remind the researchers of the parameter space being probed. So there is nothing "problematic" with an unusual name.
Unfortunately, what you are listing are things that are not problems but simply unavoidable characteristics.
A checkpoint can only be written when a particular set of calculations has been completed. In the parameters you listed in the stderr.txt snip, there were two key values - --numskypoints 2 and --toplist 10. This tells you that for the 90% main calculations stage, there will only be 2 'skypoints' and so only two opportunities to write a checkpoint. If you examine previous tasks, you might find values in the 50 - 100 range, hence many more checkpointing opportunities with those. So checkpoints only at 45% and 90% are quite normal for the current tasks.
The 'toplist' is a list of the top candidates (in this case 10) found in the main analysis. Each of these is 'recalculated' and a checkpoint is written after each one. That's why you are seeing 10 checkpoints during the 90-100% stage.
Cheers,
Gary.
+1
)
+1
-1
)
-1
Mad_Max schrieb: How
)
Workunits with few skypoints (e.g. just two) which therefore checkpoint a few times only, typically feature a number < 100 immediately after the workunit's name prefix "LATeah_NNNNF" (the raw data file name):
...just as the suspected "anomalous" examples above.
Scrooge McDuck
)
I got the impression that mad_max doesn't mean the NNs, but the part after them ...
cheers
sfv
San-Fernando-Valley
)
Yes, sure. But mad_max also suspected the rounding anomalies in filenames have something to do with the observed behaviour of these WUs.
I don't know if these 'anomalies' only can be observed at WUs which checkpoint seldomly or at others too. It's not relevant... numbers not used for calculations, as Gary wrote.
My point was: these WUs can be easily identified by the small numbers (NN) in their WU name without the need to check logfiles or WU's xml configuration. Gary already explained the true reason for the few checkpoints with logfile and numskypoints in cmdln parameters.