Unless someone is prepared to host an MD5/fanout online lookup list, where we can all chip in details of any data files we happen to have access to?
I wrote a simple script to generate the file_info block ...
Since all the data files and their associated MD5 checksums live in the various (hundreds of) download fanout directories, I assume your script does a search through these various directories until it finds the particular data filename it is looking for. Once it finds the file it can construct the URL and also retrieve the MD5 at the same time.
If you are prepared to share, I would be very interested in perusing a copy of your script.
When I did this trick to grab extra R3 resend tasks, I didn't need to find the data files and their MD5 sums in the fanout structure since I already had what I needed. I have approximately 150 hosts in total and I was using about 20 of them to exclusively crunch R3 resends. If a given host exhausted a particular data frequency, it would normally switch to downloading R4 since I was using the "dual capability" app_info.xml file at the time.
Rather than let a host switch to R4, I "borrowed" a known productive set of data files and MD5 sums from another machine that was still getting work and copied and pasted the information into the statefile of the "dry" host so that it could share in the supply of tasks as well. By repeating that for a number of different known productive data frequencies and for quite a number of different hosts, I ended up being able to grab hundreds of extra resend tasks over the month or so from the time R4 started.
Another trick I used quite successfully was to use the extra days of cache to load up to about 10-12 days supply. Often in doing this, the supply of resends would become exhausted for the particular set of frequencies in use on that host. All the data files listed in the statefile would become . At that point I would remove all the delete tags and set the cache back down to say 1 day only. After several days of allowing the cache to dwindle, I would then set the cache high enough to start looking for work again. In most cases, new resend tasks would have appeared on the server in that time interval so the host could start getting tasks where none had been available several days earlier. Once a host starts feeding again, the scheduler becomes very keen to find even more tasks in frequencies several steps away from the current one. I remember one host ending up with about 70 large data files because I refused to let the files it had get deleted :-).
Since all the data files and their associated MD5 checksums live in the various (hundreds of) download fanout directories, I assume your script does a search through these various directories until it finds the particular data filename it is looking for. Once it finds the file it can construct the URL and also retrieve the MD5 at the same time.
Figuring out the file locations is, indeed, the key. Here is a PHP script to walk the download directories and make a master list of the file locations:
If you save the output the above script generates in the filename einstein-master-file-list (warning: it will take a long time to run to completion), you can then use the following script to generate the file_info block:
#!/bin/sh
PATH=/bin:/usr/bin:/sbin
MASTERFILE=einstein-master-file-list
if test "X${1}" = "X"
then
echo "Usage: ${0} filename"
exit 1
fi
That script, call it emit-einstein-file-info, expects the datapack to already be available in the current directory when the script is run, so here's a wrapper script to grab the files and put them in the current directory:
#!/bin/sh
#
# Retrieve a data pack file specified on the command line,
# and create the client_state.xml data block for it
#
# Should be able to copy/paste just the data block into the machine which
# will be doing the work, and the boinc-client will download the data file
# itself
PATH=/usr/bin:/bin:/sbin
cd `dirname ${0}`
if test "X${1}" = "X"
then
echo "Usage: ${0} frequency"
echo "e.g.: ${0} 1004.65"
exit 1
fi
for TARGURL in `grep "${1}.*_S5R4" einstein-master-file-list | fgrep -v .md5`
do
fetch ${TARGURL}
fetch ${TARGURL}.md5
emit-einstein-file-info `basename ${TARGURL}`
done
These scripts are set up to run on FreeBSD. Under Linux, the command to get an md5 checksum is md5sum instead of md5, and the output format is different, so you'll need to change those appropriately. I think Linux uses wget instead of fetch, also. Other things may be different, as well, but this should give you a place to start and a nudge in the right direction.
RE: RE: Unless someone is
)
Since all the data files and their associated MD5 checksums live in the various (hundreds of) download fanout directories, I assume your script does a search through these various directories until it finds the particular data filename it is looking for. Once it finds the file it can construct the URL and also retrieve the MD5 at the same time.
If you are prepared to share, I would be very interested in perusing a copy of your script.
When I did this trick to grab extra R3 resend tasks, I didn't need to find the data files and their MD5 sums in the fanout structure since I already had what I needed. I have approximately 150 hosts in total and I was using about 20 of them to exclusively crunch R3 resends. If a given host exhausted a particular data frequency, it would normally switch to downloading R4 since I was using the "dual capability" app_info.xml file at the time.
Rather than let a host switch to R4, I "borrowed" a known productive set of data files and MD5 sums from another machine that was still getting work and copied and pasted the information into the statefile of the "dry" host so that it could share in the supply of tasks as well. By repeating that for a number of different known productive data frequencies and for quite a number of different hosts, I ended up being able to grab hundreds of extra resend tasks over the month or so from the time R4 started.
Another trick I used quite successfully was to use the extra days of cache to load up to about 10-12 days supply. Often in doing this, the supply of resends would become exhausted for the particular set of frequencies in use on that host. All the data files listed in the statefile would become . At that point I would remove all the delete tags and set the cache back down to say 1 day only. After several days of allowing the cache to dwindle, I would then set the cache high enough to start looking for work again. In most cases, new resend tasks would have appeared on the server in that time interval so the host could start getting tasks where none had been available several days earlier. Once a host starts feeding again, the scheduler becomes very keen to find even more tasks in frequencies several steps away from the current one. I remember one host ending up with about 70 large data files because I refused to let the files it had get deleted :-).
Cheers,
Gary.
RE: Since all the data
)
Figuring out the file locations is, indeed, the key. Here is a PHP script to walk the download directories and make a master list of the file locations:
If you save the output the above script generates in the filename einstein-master-file-list (warning: it will take a long time to run to completion), you can then use the following script to generate the file_info block:
PATH=/bin:/usr/bin:/sbin
MASTERFILE=einstein-master-file-list
if test "X${1}" = "X"
then
echo "Usage: ${0} filename"
exit 1
fi
FILENAME=${1}
NBYTES=`cat ${1} | wc -c | awk '{print $1}'`
if test -s ${1}.md5
then
MD5SUM=`cat ${1}.md5 | awk '{print $1}'`
fi
if test "X${MD5SUM}" = "X"
then
MD5SUM=`md5 ${1} | awk '{print $4}'`
fi
SOURCEPATH=`grep ${1} ${MASTERFILE} | grep -v md5 | cut -d/ -f4-`
echo "
${FILENAME}
${NBYTES}.000000
0.000000
${MD5SUM}
1"
if test `echo ${FILENAME} | grep -c skygrid` -eq 0
then
echo " "
fi
echo "
http://einstein.phys.uwm.edu/${SOURCEPATH}
http://einstein.ligo.caltech.edu/${SOURCEPATH}
http://einstein.astro.gla.ac.uk/${SOURCEPATH}
http://einstein.aei.mpg.de/${SOURCEPATH}
"
That script, call it emit-einstein-file-info, expects the datapack to already be available in the current directory when the script is run, so here's a wrapper script to grab the files and put them in the current directory:
PATH=/usr/bin:/bin:/sbin
cd `dirname ${0}`
if test "X${1}" = "X"
then
echo "Usage: ${0} frequency"
echo "e.g.: ${0} 1004.65"
exit 1
fi
for TARGURL in `grep "${1}.*_S5R4" einstein-master-file-list | fgrep -v .md5`
do
fetch ${TARGURL}
fetch ${TARGURL}.md5
emit-einstein-file-info `basename ${TARGURL}`
done
These scripts are set up to run on FreeBSD. Under Linux, the command to get an md5 checksum is md5sum instead of md5, and the output format is different, so you'll need to change those appropriately. I think Linux uses wget instead of fetch, also. Other things may be different, as well, but this should give you a place to start and a nudge in the right direction.