Starting a new thread as the other one got hyjacked. :-)
I managed to catch an error, using sysinternal's filemon. The zipped file is abt 90kb, which expands to 1.6mb.
I don't want to include a block of text of that size in this forum. : )
So if someone wants to review the log, tell me where to send it.
There are some 'name collision' errors.
Copyright © 2024 Einstein@Home. All rights reserved.
Albert error 10 [ Resolved ]
)
Hi Claude,
Send to to me at wgdebug(at)yahoo(dot)com.
If you continue running FileMon, would you make sure these options are set:
Advanced output
Clock time
Show milliseconds
Would you also get pslist and psservice from System Internals, run them and send the output with the filemon trace? You can run them from a command prompt, redirecting the output to a file like this:
pslist >pslist.txt
psservice >psservice.txt
Thanks,
Walt
Thanks Walt. Message sent.
)
Thanks Walt.
Message sent.
Well, I'm still getting error
)
Well, I'm still getting error 10's. About 30 to 50% of the WU's error out, at random, on two of my three WinXpHome systems. The three Linux systems haven't had any of these errors.
I don't understand how my two systems are the only ones, out of the thousands of computers doing Albert WU's, to get these errors.
I see no problems from the Seti and Climate Prediction computations, or any of the other processes that run on these systems.
It seems to be something like the program is trying to write to a file before the file is created, or unlocked, or something.
RE: Well, I'm still getting
)
Hi Claude,
The trace you sent shows something is reading the same file Albert is using. Each time Albert writes a block of data, the other application reads the file. When Albert tries to delete the old file, the delete fails with "sharing violation" and the subsequent rename fails with "Name Collision".
I'll send a couple of things to try in email.
Walt
Walt, Et Al, Received your
)
Walt, Et Al,
Received your email, thought about what was similar and different between the three computers, and realized that the two systems that were getting the errors had their BOINC folders 'shared', the other system didn't.
I unshared both folders and ran filemon overnight. I don't see that interleaved write/read in the log this morning, so sharing the folders might be the problem.
I've emailed the log to you, and will continue to monitor for a while, to see if the problem occurs again.
Claude
RE: Walt, Et Al, Received
)
Hi Claude,
Didn't think about file sharing. Explains a lot actually.
The share by itself won't cause any problem. But it means that a program on another computer could be accessing the BOINC files thru the network. Like a network based file backup utility. Or the BOINC folder was mapped on the other PC, and the utilities were reading the files over the network as though the folder were local.
I looked thru one of the other traces you sent, the one with 10 minutes of all the file I/O, and it suggests that another PC was accessing the BOINC directory. If you look thru that trace, you'll see trace entries like this:
1:05:00.948 PM svchost.exe:732 IRP_MJ_DIRECTORY_CONTROL C:\\$Extend\\$ObjId SUCCESS Change Notify
That "Change Notify" appears every time a file is changed, renamed or deleted. And it notifies an application that a file was changed, in case the program wants to do something with the file. Explorer use that facility to keep the file list up-to-date in its file list, but utility programs also use it to see what files change. Expecailly on remote filesystems accessed over the network.
You could try enabling the share again, and see if the problem comes back. ALthough it would be better to check with FileMon, not wait for another error. If you see the excess file activity from "system:4" you can see the active shares:
Open Computer Management - right-click "My computer", click "Properties". Or click "Start", "Administrative Tools", "Computer Management". If you don't havea menu item for admin tools, look in the Control Panel.
In the Computer Management dialog, double-click "Shared folders" to open it. In the left pane, click "Sessions" to see what sessions are active, click "Open Files" to see what files are being used by the other systems. Its not a dynamic display, you have to press the F5 key to refresh the display.
If you see open files, FileMon on the other PC (the one listed in "sessions"), selecting "network" in the "Volumes" menu item (unselect the other items). That should show all the remote file activity.
Walt
Walt, Verified. I
)
Walt,
Verified.
I enabled 'share' on blnt30, and filmon showed a flurry of activity.
'Computer Management' showed the einstein data file open.
I disabled 'share', and filmon went back to the usual entries.
I had the folder 'shared' because BoincView had used that method to get data from remote systems. When BoincView went to the 'RPC' method I followed, but didn't disable the "share', as it seemed a convenient way of checking things without having to go to the remote systems ( in an other room ).
I can cope with having to walk a few feet to do the checking. :-)
Thanks,
Claude
Walt, Now all that someone
)
Walt,
Now all that someone has to do, is to find out why the Einstein API has trouble with having a 'share' set and Seti & CP doesn't.
:-)
RE: Walt, Now all that
)
Hi Claude,
Knowing which program on the "remote" system is accessing the files would be a great help. Running FileMon on the remote PC's should give you that information:
-Enable the share on the local machine.
-Run FileMon on the local machine to verify the excessive file activity is taking place.
-Use the Computer Management console, "shared files", "sessions" to see which systems are accessing the system
-Run FileMon on the remote machine:
--Enable "Volume" menu item "Network" (only that one, clear the others)
--Set "Include" filter to "*".
--Remove "Exclude" filter
--Set all the trace options at the bottom of the filter dialog
--enable options "Advanced output", "Clock time", "Show milliseconds".
-Start tracing.
You should see trace entries for the "remote" file read activity, with the name of the process reading it.
RE: Hi Claude, Knowing
)
Ok, done.
I've emailed you the filmon log results of this test.
[ It's BoincLogX. ]
:-)
Claude
[edit] Of course, without the share enabled, BoincLogX can't get data from the remote systems. :-(
[/edit]