The following is a task line: Einstein@Home Hierarchical S5 all-sky GW search #6 3.01 h1_1066.80_S5R4_1651_S5R6a_0 17:30:56 0.000% 05:07:30 12/18/092:53:17 PM Running ... Any suggestions welcome i would like to continue to run Einstein @ Home.
OK, you have a task named h1_1066.80_S5R4_1651_S5R6a which has clocked up over 17.5 hours of crunch time and made no progress. Interestingly, the remaining time still shows as just over 5 hours which is (I imagine) the value that all your GW tasks start with.
Id like you to browse your BOINC Data folder and find and enter the slots directory. You will have several folders in there named 0, 1, 2, ...., and you will need to look in each one until you find the correct one that contains some files that have the above h1_1066.80_S5R4_1651_S5R6a name as part (at least) of the filename. When you find that particular slot folder make sure you have a detailed view of the contents - filenames, sizes, dates, etc and paste a copy of those details into a message here. In particular, I want to see if a checkpoint file is being created and if any output is being created, including both program output and stderr output.
I was actually trying to get you to give me a listing of the files that were in the particular slot directory - for example here is a file listing from one of my machines.
In particular, if everything is running normally, you would see a large checkpoint file (.cpt extension), an output file Hough.out, a messages file stderr.txt and a large skygrid file.
Most of the 1 KB files you mention are only a very small size because they are links to the much larger real files that exist elsewhere. When the contents of the slot directory are set up at the start of crunching of a task, all these links to the real files are created. Some of the real files are compressed - eg skygrid for example. If you don't have a large file called skygrid_xxxxHz_S5R5.dat in your slot folder, perhaps the problem is to do with decompression of data files right at the outset.
You can generate a list of files in the slot directory by executing the command
dir /a /b /-p /o:gen >filelisting.txt
from within the slot directory. You can prepend a suitable path in front of the output filename 'filelisting.txt' so that the listing doesn't get written into the slot directory if you'd rather not pollute that directory. You can always delete filelisting.txt once you have pasted the contents into a message here.
I appear to have the same issue on my Win7 64 Bit machine with the I5-750 processor. I have no issues running SETI@Home (CUDA and standard), ClimatePrediction.Net, or the Einstein@Home Pulsar Search applications (CUDA and Standard).
As reported when running the SkySearch application the to completion time reports 4 hours, but after 25 hours I still have 0% completion. I've had this occur on 4 WU's so far and I don't believe I have completed any of these on this machine. One of the initial WU's had gotten to 6% and then then effectively froze. I have another machine standard Pentium 4 and it doesn't appear to have a problem.
I have noticed this behavior in my Message Log, it appears the app is restarting every three minutes which maybe why it doesn't make any progress.
12/18/2009 6:55:24 AM Einstein@Home Restarting task h1_1027.65_S5R4__1317_S5R6a_0 using einstein_S5R6 version 301
12/18/2009 6:58:30 AM Einstein@Home Restarting task h1_1027.65_S5R4__1317_S5R6a_0 using einstein_S5R6 version 301
12/18/2009 7:01:32 AM Einstein@Home Restarting task h1_1027.65_S5R4__1317_S5R6a_0 using einstein_S5R6 version 301
12/18/2009 7:04:38 AM Einstein@Home Restarting task h1_1027.65_S5R4__1317_S5R6a_0 using einstein_S5R6 version 301
12/18/2009 7:07:40 AM Einstein@Home Restarting task h1_1027.65_S5R4__1317_S5R6a_0 using einstein_S5R6 version 301
12/18/2009 7:10:46 AM Einstein@Home Restarting task h1_1027.65_S5R4__1317_S5R6a_0 using einstein_S5R6 version 301
12/18/2009 7:13:48 AM Einstein@Home Restarting task h1_1027.65_S5R4__1317_S5R6a_0 using einstein_S5R6 version 301
12/18/2009 7:16:54 AM Einstein@Home Restarting task h1_1027.65_S5R4__1317_S5R6a_0 using einstein_S5R6 version 301
12/18/2009 7:19:57 AM Einstein@Home Restarting task h1_1027.65_S5R4__1317_S5R6a_0 using einstein_S5R6 version 301
12/18/2009 7:23:00 AM Einstein@Home Restarting task h1_1027.65_S5R4__1317_S5R6a_0 using einstein_S5R6 version 301
12/18/2009 7:26:02 AM Einstein@Home Restarting task h1_1027.65_S5R4__1317_S5R6a_0 using einstein_S5R6 version 301
12/18/2009 7:29:08 AM Einstein@Home Restarting task h1_1027.65_S5R4__1317_S5R6a_0 using einstein_S5R6 version 301
The task manager displays many of these jobs all eating up around 104 KB and eating up 0 CPU. Due to the shear number, I'm simply going to reboot to get them out of memory.
List of files from the Slot directory (I simply suspended the task for now, so I still have everything on my machine):
I appear to have the same issue on my Win7 64 Bit machine with the I5-750 processor....
Yes you do indeed! Thanks very much for your very thorough description of the problem.
If you look at your file listing and compare it with the example I provided, the first and most obvious difference is that mine contains a stderr.txt file and a checkpoint file whereas yours doesn't. The checkpoint file has the .cpt extension. It takes a couple of minutes of initial crunching to get to the point of writing the first checkpoint. Your task is obviously stalling before it gets that far.
The next thing I'd like you to check is the size of the skygrid_1030Hz_S5R5.dat file. There should be a file of exactly the same name in your E@H project folder which should be much smaller in size because it is received compressed. When the task is set up in the slot directory, BOINC decompresses it and retains the same name. The file in the slot directory should be 4 or 5 times larger in size. Can you please confirm the sizes.
Quote:
If you need more information please post back and I'll respond as quickly as I can.
Gundolf has pointed you to a workaround suggested by Bikeman. In the message immediately after the linked one, I have published the app_info.xml file that you can try out. Read both messages carefully, as the instructions are not trivial, particularly if you haven't used the anonymous platform (AP) mechanism previously.
Essentially, what AP is going to do is bypass the step where a separate application chooses the appropriate level of optimisation to be used. As your CPU can handle SSE2 or higher, the einstein_S5R6_3.01_windows_intelx86_2.exe is the correct app to use and AP will run it directly without letting the switcher app make the choice.
There has been a report already that this workaround has allowed crunching to progress normally and it would be really good to get more such reports if possible so that we can suggest it with confidence until the real problem is corrected. Please be aware that tasks that have already started may very well fail immediately but that new tasks should be OK. Your computers are hidden so I can't see what you actually have in the way of tasks already on your machine.
If you have a number of tasks suspended because of the problem, they will probably fail the minute you resume them. This isn't really a problem at all but it will temporarily reduce your daily allowed limit. It is actually quite easy to retrieve all tasks without having any failures if you are prepared to do a bit of (fairly simple) surgery on your state file (client_state.xml) at the time you insert the app_info.xml and shut down BOINC. However, although simple, it's not something to be undertaken lightly because if you accidently corrupt your state file you could lose tasks for other projects. If you are comfortable with simple text editing with something like Notepad and are willing to try, I'd be quite happy to explain further.
I appear to have the same issue on my Win7 64 Bit machine with the I5-750 processor....
Yes you do indeed! Thanks very much for your very thorough description of the problem.
If you look at your file listing and compare it with the example I provided, the first and most obvious difference is that mine contains a stderr.txt file and a checkpoint file whereas yours doesn't. The checkpoint file has the .cpt extension. It takes a couple of minutes of initial crunching to get to the point of writing the first checkpoint. Your task is obviously stalling before it gets that far.
The next thing I'd like you to check is the size of the skygrid_1030Hz_S5R5.dat file. There should be a file of exactly the same name in your E@H project folder which should be much smaller in size because it is received compressed. When the task is set up in the slot directory, BOINC decompresses it and retains the same name. The file in the slot directory should be 4 or 5 times larger in size. Can you please confirm the sizes.
Found the skygrid_1030Hz_SSR5.dat file in the project directory and it is 589 KB. The version in the slot directory is only 1 KB so something isn't right. I extracted the content of the Project version, by renaming to a zip, and as you stated it's expanded size was about 5x larger at 2,797 KB.
Is it as simple as replacing the dat file to at least get the two current WU processing?
While we are talking about file size, all the other files in the slot directory are 1 KB with the exception of init_data.xml which is 5 KB. Unsure if replacing the Dat file alone will truly be enough to resolve. The einstein_S5R6_3.0.1_windows_intelx86*.exe are also way smaller then their equivalent names in the Project folder.
Quote:
Quote:
If you need more information please post back and I'll respond as quickly as I can.
Gundolf has pointed you to a workaround suggested by Bikeman. In the message immediately after the linked one, I have published the app_info.xml file that you can try out. Read both messages carefully, as the instructions are not trivial, particularly if you haven't used the anonymous platform (AP) mechanism previously.
Essentially, what AP is going to do is bypass the step where a separate application chooses the appropriate level of optimisation to be used. As your CPU can handle SSE2 or higher, the einstein_S5R6_3.01_windows_intelx86_2.exe is the correct app to use and AP will run it directly without letting the switcher app make the choice.
There has been a report already that this workaround has allowed crunching to progress normally and it would be really good to get more such reports if possible so that we can suggest it with confidence until the real problem is corrected. Please be aware that tasks that have already started may very well fail immediately but that new tasks should be OK. Your computers are hidden so I can't see what you actually have in the way of tasks already on your machine.
If you have a number of tasks suspended because of the problem, they will probably fail the minute you resume them. This isn't really a problem at all but it will temporarily reduce your daily allowed limit. It is actually quite easy to retrieve all tasks without having any failures if you are prepared to do a bit of (fairly simple) surgery on your state file (client_state.xml) at the time you insert the app_info.xml and shut down BOINC. However, although simple, it's not something to be undertaken lightly because if you accidently corrupt your state file you could lose tasks for other projects. If you are comfortable with simple text editing with something like Notepad and are willing to try, I'd be quite happy to explain further.
Let us know what you want to do.
I'm holding off on the workaround for the moment due to the filesizes all being odd implying that something more maybe going on. I'll see what you recommend after this post.
I didn't realize until later, but I also had a lot of conhost processes running before I suspended and rebooted. With the skysearch suspended, I only have 1 active one now so these are probably tied in with the hung skysearch processes.
While we are talking about file size, all the other files in the slot directory are 1 KB with the exception of init_data.xml which is 5 KB. Unsure if replacing the Dat file alone will truly be enough to resolve. The einstein_S5R6_3.0.1_windows_intelx86*.exe are also way smaller then their equivalent names in the Project folder.
That's quite okay, since all those 1KB files are "soft links" to the actual data files. The only other files are the skygrid file, a lockfile, the checkpoint file (.cpt) and two stderr files (.txt, .old).
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
Found the skygrid_1030Hz_SSR5.dat file in the project directory and it is 589 KB. The version in the slot directory is only 1 KB so something isn't right. I extracted the content of the Project version, by renaming to a zip, and as you stated it's expanded size was about 5x larger at 2,797 KB.
Looks like there really is a problem with BOINC's unzip libs under Win7.
Quote:
Is it as simple as replacing the dat file to at least get the two current WU processing?
That would be a good start. It would be interesting to see if that kick starts things.
Quote:
While we are talking about file size, all the other files in the slot directory are 1 KB with the exception of init_data.xml which is 5 KB. Unsure if replacing the Dat file alone will truly be enough to resolve. The einstein_S5R6_3.0.1_windows_intelx86*.exe are also way smaller then their equivalent names in the Project folder.
No need to worry about the 1KB files - links to the real files, just as Gundolf says.
Quote:
I'm holding off on the workaround for the moment due to the filesizes all being odd implying that something more maybe going on. I'll see what you recommend after this post.
No need to hold off - go for it.
Quote:
I didn't realize until later, but I also had a lot of conhost processes running before I suspended and rebooted. With the skysearch suspended, I only have 1 active one now so these are probably tied in with the hung skysearch processes.
Others seem to experience this as well. Seens to be part of the problem.
Looks like there really is a problem with BOINC's unzip libs under Win7.
Could be, but not necessarily: The absence of the stderr.txt file in the slot directory seems to indicate that the app isn't even started in the first place, so I guess it gets stuck even before it has a chance to unzip anything.
I wonder whether this may be related to anti-virus software as well. When you google for "conhost", you learn that it is a perfectly legitimate process under Win 7 but you also see many reports about false positive alarms by anti-virus software that wasn't certified for Win 7 (e.g. keeping the AV software after migrating from Vista, I guess).
So if all those who experienced this problem would be so kind to tell us what kind of AV software they are using (if any), that might help. If you prefer not to disclose this on the web, you can also send me this info via PM.
Could be, but not necessarily: The absence of the stderr.txt file in the slot directory seems to indicate that the app isn't even started in the first place, so I guess it gets stuck even before it has a chance to unzip anything.
Is the skygrid file in the project directory supposed to remain compressed and when running the application should unpack into the slot directory?
Quote:
I wonder whether this may be related to anti-virus software as well. When you google for "conhost", you learn that it is a perfectly legitimate process under Win 7 but you also see many reports about false positive alarms by anti-virus software that wasn't certified for Win 7 (e.g. keeping the AV software after migrating from Vista, I guess).
So if all those who experienced this problem would be so kind to tell us what kind of AV software they are using (if any), that might help. If you prefer not to disclose this on the web, you can also send me this info via PM.
Thanks
Bikeman
AV info sent via PM.
I tried to simply copy the extracted version of skygrid and it seemed to work until the first checkpoint, 5 min, and then restarted again. It seemed the period was a bit longer this time. I double checked task manager and I now had 3 einstein_S5R6_3.01_windows_intelx86_2.exe stuck and likely 3 conhost.exe's with them (ending the process tree for the S5R6 apps also removed 1 conhost each). I only saw 2 restarts per the messages window, the first was the initial when the app started and the second when it reported a restart 5 minutes in. I'm unsure where the 3rd version of the app came from.
I tried the XML work around and it didn't really work as expected. Apparently installing it obliterated the two WU's I had for SkySearch. It also seemed not to be able to download the 3.12 versions of the pulsar application. It made no attempt to retrieve the CUDA versions of the app. This may have been my fault since I had tried the app_info.xml without the Pulsar Search app initially since I was hoping that I could simply override only the settings for the SkySearch. Boinc deleted the local version of the Pulsar Search app when I restarted it.
I saw no attempt to download wu's from the server for skysearch to allow me to determine if the new configuration would allow me to run the skysearch app successfully.
If I understand correctly it appears all desired apps must be manually specified in the app_info.xml for it to work correctly, it may also be required to somehow download the apps manually (unsure if the issue was a temp server problem or not). I removed the app_info.xml temporarily and I only received work for the CUDA version of the pulsar search which Boinc downloaded all required apps as normal.
RE: The following is a task
)
OK, you have a task named h1_1066.80_S5R4_1651_S5R6a which has clocked up over 17.5 hours of crunch time and made no progress. Interestingly, the remaining time still shows as just over 5 hours which is (I imagine) the value that all your GW tasks start with.
Id like you to browse your BOINC Data folder and find and enter the slots directory. You will have several folders in there named 0, 1, 2, ...., and you will need to look in each one until you find the correct one that contains some files that have the above h1_1066.80_S5R4_1651_S5R6a name as part (at least) of the filename. When you find that particular slot folder make sure you have a detailed view of the contents - filenames, sizes, dates, etc and paste a copy of those details into a message here. In particular, I want to see if a checkpoint file is being created and if any output is being created, including both program output and stderr output.
Thanks for your assistance.
Cheers,
Gary.
RE: here is your requested
)
I was actually trying to get you to give me a listing of the files that were in the particular slot directory - for example here is a file listing from one of my machines.
In particular, if everything is running normally, you would see a large checkpoint file (.cpt extension), an output file Hough.out, a messages file stderr.txt and a large skygrid file.
Most of the 1 KB files you mention are only a very small size because they are links to the much larger real files that exist elsewhere. When the contents of the slot directory are set up at the start of crunching of a task, all these links to the real files are created. Some of the real files are compressed - eg skygrid for example. If you don't have a large file called skygrid_xxxxHz_S5R5.dat in your slot folder, perhaps the problem is to do with decompression of data files right at the outset.
You can generate a list of files in the slot directory by executing the command
dir /a /b /-p /o:gen >filelisting.txt
from within the slot directory. You can prepend a suitable path in front of the output filename 'filelisting.txt' so that the listing doesn't get written into the slot directory if you'd rather not pollute that directory. You can always delete filelisting.txt once you have pasted the contents into a message here.Cheers,
Gary.
I appear to have the same
)
I appear to have the same issue on my Win7 64 Bit machine with the I5-750 processor. I have no issues running SETI@Home (CUDA and standard), ClimatePrediction.Net, or the Einstein@Home Pulsar Search applications (CUDA and Standard).
As reported when running the SkySearch application the to completion time reports 4 hours, but after 25 hours I still have 0% completion. I've had this occur on 4 WU's so far and I don't believe I have completed any of these on this machine. One of the initial WU's had gotten to 6% and then then effectively froze. I have another machine standard Pentium 4 and it doesn't appear to have a problem.
I have noticed this behavior in my Message Log, it appears the app is restarting every three minutes which maybe why it doesn't make any progress.
The task manager displays many of these jobs all eating up around 104 KB and eating up 0 CPU. Due to the shear number, I'm simply going to reboot to get them out of memory.
List of files from the Slot directory (I simply suspended the task for now, so I still have everything on my machine):
init_data.xml is the only file that appears to have been touched after the initial download of the WU.
If you need more information please post back and I'll respond as quickly as I can.
Again, the stderr.txt file is
)
Again, the stderr.txt file is missing in that slot.
It seems to be a problem with I5/I7 hosts running win7 64bit. See this thread for a possible workaround.
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
RE: I appear to have the
)
Yes you do indeed! Thanks very much for your very thorough description of the problem.
If you look at your file listing and compare it with the example I provided, the first and most obvious difference is that mine contains a stderr.txt file and a checkpoint file whereas yours doesn't. The checkpoint file has the .cpt extension. It takes a couple of minutes of initial crunching to get to the point of writing the first checkpoint. Your task is obviously stalling before it gets that far.
The next thing I'd like you to check is the size of the skygrid_1030Hz_S5R5.dat file. There should be a file of exactly the same name in your E@H project folder which should be much smaller in size because it is received compressed. When the task is set up in the slot directory, BOINC decompresses it and retains the same name. The file in the slot directory should be 4 or 5 times larger in size. Can you please confirm the sizes.
Gundolf has pointed you to a workaround suggested by Bikeman. In the message immediately after the linked one, I have published the app_info.xml file that you can try out. Read both messages carefully, as the instructions are not trivial, particularly if you haven't used the anonymous platform (AP) mechanism previously.
Essentially, what AP is going to do is bypass the step where a separate application chooses the appropriate level of optimisation to be used. As your CPU can handle SSE2 or higher, the einstein_S5R6_3.01_windows_intelx86_2.exe is the correct app to use and AP will run it directly without letting the switcher app make the choice.
There has been a report already that this workaround has allowed crunching to progress normally and it would be really good to get more such reports if possible so that we can suggest it with confidence until the real problem is corrected. Please be aware that tasks that have already started may very well fail immediately but that new tasks should be OK. Your computers are hidden so I can't see what you actually have in the way of tasks already on your machine.
If you have a number of tasks suspended because of the problem, they will probably fail the minute you resume them. This isn't really a problem at all but it will temporarily reduce your daily allowed limit. It is actually quite easy to retrieve all tasks without having any failures if you are prepared to do a bit of (fairly simple) surgery on your state file (client_state.xml) at the time you insert the app_info.xml and shut down BOINC. However, although simple, it's not something to be undertaken lightly because if you accidently corrupt your state file you could lose tasks for other projects. If you are comfortable with simple text editing with something like Notepad and are willing to try, I'd be quite happy to explain further.
Let us know what you want to do.
Cheers,
Gary.
RE: RE: I appear to have
)
Found the skygrid_1030Hz_SSR5.dat file in the project directory and it is 589 KB. The version in the slot directory is only 1 KB so something isn't right. I extracted the content of the Project version, by renaming to a zip, and as you stated it's expanded size was about 5x larger at 2,797 KB.
Is it as simple as replacing the dat file to at least get the two current WU processing?
While we are talking about file size, all the other files in the slot directory are 1 KB with the exception of init_data.xml which is 5 KB. Unsure if replacing the Dat file alone will truly be enough to resolve. The einstein_S5R6_3.0.1_windows_intelx86*.exe are also way smaller then their equivalent names in the Project folder.
I'm holding off on the workaround for the moment due to the filesizes all being odd implying that something more maybe going on. I'll see what you recommend after this post.
I didn't realize until later, but I also had a lot of conhost processes running before I suspended and rebooted. With the skysearch suspended, I only have 1 active one now so these are probably tied in with the hung skysearch processes.
RE: While we are talking
)
That's quite okay, since all those 1KB files are "soft links" to the actual data files. The only other files are the skygrid file, a lockfile, the checkpoint file (.cpt) and two stderr files (.txt, .old).
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
RE: Found the
)
Looks like there really is a problem with BOINC's unzip libs under Win7.
That would be a good start. It would be interesting to see if that kick starts things.
No need to worry about the 1KB files - links to the real files, just as Gundolf says.
No need to hold off - go for it.
Others seem to experience this as well. Seens to be part of the problem.
Cheers,
Gary.
RE: Looks like there
)
Could be, but not necessarily: The absence of the stderr.txt file in the slot directory seems to indicate that the app isn't even started in the first place, so I guess it gets stuck even before it has a chance to unzip anything.
I wonder whether this may be related to anti-virus software as well. When you google for "conhost", you learn that it is a perfectly legitimate process under Win 7 but you also see many reports about false positive alarms by anti-virus software that wasn't certified for Win 7 (e.g. keeping the AV software after migrating from Vista, I guess).
So if all those who experienced this problem would be so kind to tell us what kind of AV software they are using (if any), that might help. If you prefer not to disclose this on the web, you can also send me this info via PM.
Thanks
Bikeman
RE: Could be, but not
)
Is the skygrid file in the project directory supposed to remain compressed and when running the application should unpack into the slot directory?
AV info sent via PM.
I tried to simply copy the extracted version of skygrid and it seemed to work until the first checkpoint, 5 min, and then restarted again. It seemed the period was a bit longer this time. I double checked task manager and I now had 3 einstein_S5R6_3.01_windows_intelx86_2.exe stuck and likely 3 conhost.exe's with them (ending the process tree for the S5R6 apps also removed 1 conhost each). I only saw 2 restarts per the messages window, the first was the initial when the app started and the second when it reported a restart 5 minutes in. I'm unsure where the 3rd version of the app came from.
I tried the XML work around and it didn't really work as expected. Apparently installing it obliterated the two WU's I had for SkySearch. It also seemed not to be able to download the 3.12 versions of the pulsar application. It made no attempt to retrieve the CUDA versions of the app. This may have been my fault since I had tried the app_info.xml without the Pulsar Search app initially since I was hoping that I could simply override only the settings for the SkySearch. Boinc deleted the local version of the Pulsar Search app when I restarted it.
I saw no attempt to download wu's from the server for skysearch to allow me to determine if the new configuration would allow me to run the skysearch app successfully.
If I understand correctly it appears all desired apps must be manually specified in the app_info.xml for it to work correctly, it may also be required to somehow download the apps manually (unsure if the issue was a temp server problem or not). I removed the app_info.xml temporarily and I only received work for the CUDA version of the pulsar search which Boinc downloaded all required apps as normal.