You're right, it doesn't affect these cpus, and I don't know at the moment, if Athlon XPs are also affected by the performance gap.
Athlon XPs fall into the same category wrt. this problem as Intel Pentium IIIs : Yes , the Win version will run significantly slower (probably roughly 30 %) as compared to the Linux app. The reason is that the non-SSE2 variant of the "modf" function used by the math lib in the win app is very slow indeed. And no, the experimental fix mentioned above won't help because Athlon XPs (AFAIK) don't support SSE2.
Yeah I was talking about my Venice box. I haven't done anything about the app on my Core so far, mainly because I don't have a clue what's wrong there. But if ut's an individual issue it doesn't really matter- I don't mind running Linux (was more or less planning it anyway, just hadn't gotten around to make the effort).
I can't say anything about my AMD's performance after our "quick-tuning" yet, since the WU wasn't even sent to anyone else yet and I have no idea what it's worth (it's a 400 MHz, does that tell you sth?). The WU is at 36.6% in about 8.5 hours, which hints at a completion time of around 24 hours or sth.
Yeah I was talking about my Venice box. I haven't done anything about the app on my Core so far, mainly because I don't have a clue what's wrong there. But if ut's an individual issue it doesn't really matter- I don't mind running Linux (was more or less planning it anyway, just hadn't gotten around to make the effort).
I can't say anything about my AMD's performance after our "quick-tuning" yet, since the WU wasn't even sent to anyone else yet and I have no idea what it's worth (it's a 400 MHz, does that tell you sth?). The WU is at 36.6% in about 8.5 hours, which hints at a completion time of around 24 hours or sth.
Good morning Annika!
The WU should be in the 300-350 credits range, I guess. The "fix" doesn't seem to level the playing field between the Windows and the Linux app completely, but it should narrow the gap.
It's 400 Hz btw, not MHz (it's somehow related to the spinning speed of the pulsars we are looking for, and a pulsar spinning a few hundred million times per sec would probably mean a Nobel Prize to it's discoverer.
Okay, thanks for explaining. When I get home from Uni (around 7 pm) the WU should be more than half crunched, so I'll be able to get some fairly good estimates. Any idea how big the Win penalty for this kind of box usually is, so I have sth to compare to? I've seen everything from people on the board writing about a 20% difference all the way to a friend's Opteron which is a good 70% (!!!) faster under Linux.
Okay, thanks for explaining. When I get home from Uni (around 7 pm) the WU should be more than half crunched, so I'll be able to get some fairly good estimates. Any idea how big the Win penalty for this kind of box usually is, so I have sth to compare to? I've seen everything from people on the board writing about a 20% difference all the way to a friend's Opteron which is a good 70% (!!!) faster under Linux.
No idea about the penalty for your Venice, I guess Michael and his database of statistics will be helpful there, but he's probably still taking the well deserved nap. Can hardly wait to see your results!!!
Okay, thanks for explaining. When I get home from Uni (around 7 pm) the WU should be more than half crunched, so I'll be able to get some fairly good estimates. Any idea how big the Win penalty for this kind of box usually is, so I have sth to compare to? I've seen everything from people on the board writing about a 20% difference all the way to a friend's Opteron which is a good 70% (!!!) faster under Linux.
No idea about the penalty for your Venice, I guess Michael and his database of statistics will be helpful there, but he's probably still taking the well deserved nap. Can hardly wait to see your results!!!
CU
BRM
Hi,
my first result is finished and uploaded also some other members of our team have patched there app and successfully finished WUs.
My c/h rose from ~14 to ~19!
This is eliminating the AMD/Win penalty. :-)
Some stats from my data:
[pre]
A64: 8.2 - 8.8 [c/(h·GHz)] Linux
A64 X2: 8.2 - 8.8 [c/(h·GHz)] Linux
A64: 4.6 - 5.2 [c/(h·GHz)] Windows
A64 X2: 4.7 - 5.2 [c/(h·GHz)] Windows
[/pre]
Because the Einstein app is scaling with cpu clock, there is no need to look at the different clocks, also cache size is pretty uninteresting. The former S5R1 and S5R2 app was running in L1 cache and even the smaller cache of Intel cpus was big enough. In my data there are for sure some hosts which are oc'd and therefore have influence to the results above, but I suppose one can find them in both os-groups.
My first result equals 7,25 [c/(h·GHz)].
I should say, that running Boinc native with only one Einstein app without cpu affinity and one VMWare Linux cruncher dedicated to one core ended up with about 50% resource share for each task. But taskmanager showed more than 105,000,000 page faults for the Einstein Win app. VMWare in contrast only produced 430,000 page fauts after running for a couple of days. So maybe running Boinc without another full load process aside which is dedicated to one core will even improve the speed. Also the example imho shows, that page misses don't really bother the app and do not dramaticaly reduce speed.
When we get other results, we can draw conclusions about this.
The Intel Core cpus show really big differences in my data and therefor it's impossible to get good stats without knowing the exact clock rate.
When does one of the developers give a statement about this ugly lib issue? ;-)
Okay, thanks for explaining. When I get home from Uni (around 7 pm) the WU should be more than half crunched, so I'll be able to get some fairly good estimates. Any idea how big the Win penalty for this kind of box usually is, so I have sth to compare to? I've seen everything from people on the board writing about a 20% difference all the way to a friend's Opteron which is a good 70% (!!!) faster under Linux.
No idea about the penalty for your Venice, I guess Michael and his database of statistics will be helpful there, but he's probably still taking the well deserved nap. Can hardly wait to see your results!!!
CU
BRM
Hi,
my first result is finished and uploaded also some other members of our team have patched there app and successfully finished WUs.
My c/h rose from ~14 to ~19!
This is eliminating the AMD/Win penalty. :-)
Some stats from my data:
[pre]
A64: 8.2 - 8.8 [c/(h·GHz)] Linux
A64 X2: 8.2 - 8.8 [c/(h·GHz)] Linux
A64: 4.6 - 5.2 [c/(h·GHz)] Windows
A64 X2: 4.7 - 5.2 [c/(h·GHz)] Windows
[/pre]
Because the Einstein app is scaling with cpu clock, there is no need to look at the different clocks, also cache size is pretty uninteresting. The former S5R1 and S5R2 app was running in L1 cache and even the smaller cache of Intel cpus was big enough. In my data there are for sure some hosts which are oc'd and therefore have influence to the results above, but I suppose one can find them in both os-groups.
My first result equals 7,25 [c/(h·GHz)].
I should say, that running Boinc native with only one Einstein app without cpu affinity and one VMWare Linux cruncher dedicated to one core ended up with about 50% resource share for each task. But taskmanager showed more than 105,000,000 page faults for the Einstein Win app. VMWare in contrast only produced 430,000 page fauts after running for a couple of days. So maybe running Boinc without another full load process aside which is dedicated to one core will even improve the speed. Also the example imho shows, that page misses don't really bother the app and do not dramaticaly reduce speed.
When we get other results, we can draw conclusions about this.
The Intel Core cpus show really big differences in my data and therefor it's impossible to get good stats without knowing the exact clock rate.
When does one of the developers give a statement about this ugly lib issue? ;-)
cu,
Michael
Thanks for the stats, this looks really promising, doesn't it!!! I expected a 30 % rise in performance.
Unless you've already done so, I'll drop Bernd an email just in case he has missed the whole discussion.
Thanks for the stats, this looks really promising, doesn't it!!! I expected a 30 % rise in performance.
Unless you've already done so, I'll drop Bernd an email just in case he has missed the whole discussion.
CU
BRM
Yes, looks very good. :-)
I haven't mailed to Bernd, so go ahaed.
Btw. I don't think this patch harms any cpus that are not SSE2 capable. There must be another switch in the code to filter out Intel SSE1 and non SSE cpus. This will probably work on AMD too. Should be something like described on that Web page about Intel compilers. But if there is some place where SSE1 Instructions are used, this might accelerate AMD Athlon XPs too.
But this is just a guess.
Yes, looks very good. :-)
I haven't mailed to Bernd, so go ahaed.
Btw. I don't think this patch harms any cpus that are not SSE2 capable. There must be another switch in the code to filter out Intel SSE1 and non SSE cpus. This will probably work on AMD too. Should be something like described on that Web page about Intel compilers. But if there is some place where SSE1 Instructions are used, this might accelerate AMD Athlon XPs too.
But this is just a guess.
cu
Michael
Yes the detection mechanism to me seems to be as described in the article: First, detect feature bits to check for SSE2, then check for vendor and if it is "AuthenticAMD", reset the results just obtained from CPUID to a bare minimum. Not a nice thing to do, IMHO.
I didn't see any SSE instructions and I doubt very much that Athlon XPs or P IIIs will see any performance increase whatsoever by changing the COU detection code. For those platforms to reach the the same levels of performance as under Linux/gcc, a better implementation of the modf function is needed.
In the meantime, I think it's a matter of courtesy to keep the number of modified clients to a minimum until Bernd OK's the change. It was essential to verify our hypothesis to try out the change, but let's wait until the official OK before everybody is patching the app. If 1000 people are patching and one of them makes a mistake, it can mess up quite a few results. As a software engineer I'd prefer that the new version is formally tested, approved, and only then released with a new version number before it's widely used so any negative effects are traceable.
Hey guys, just a quick update. My WU has about 2 hours left; total crunching time should amount to between 20.5 and 21 hours. I still don't know the exact credit value, though. Btw, I'm getting a friend from Uni to check this with one or two of his AMD boxes, so we'll get some more results. Mailing Bernd is a great idea imo.
RE: You're right, it
)
Athlon XPs fall into the same category wrt. this problem as Intel Pentium IIIs : Yes , the Win version will run significantly slower (probably roughly 30 %) as compared to the Linux app. The reason is that the non-SSE2 variant of the "modf" function used by the math lib in the win app is very slow indeed. And no, the experimental fix mentioned above won't help because Athlon XPs (AFAIK) don't support SSE2.
CU
BRM
Yeah I was talking about my
)
Yeah I was talking about my Venice box. I haven't done anything about the app on my Core so far, mainly because I don't have a clue what's wrong there. But if ut's an individual issue it doesn't really matter- I don't mind running Linux (was more or less planning it anyway, just hadn't gotten around to make the effort).
I can't say anything about my AMD's performance after our "quick-tuning" yet, since the WU wasn't even sent to anyone else yet and I have no idea what it's worth (it's a 400 MHz, does that tell you sth?). The WU is at 36.6% in about 8.5 hours, which hints at a completion time of around 24 hours or sth.
RE: Yeah I was talking
)
Good morning Annika!
The WU should be in the 300-350 credits range, I guess. The "fix" doesn't seem to level the playing field between the Windows and the Linux app completely, but it should narrow the gap.
It's 400 Hz btw, not MHz (it's somehow related to the spinning speed of the pulsars we are looking for, and a pulsar spinning a few hundred million times per sec would probably mean a Nobel Prize to it's discoverer.
CU
BRM
Okay, thanks for explaining.
)
Okay, thanks for explaining. When I get home from Uni (around 7 pm) the WU should be more than half crunched, so I'll be able to get some fairly good estimates. Any idea how big the Win penalty for this kind of box usually is, so I have sth to compare to? I've seen everything from people on the board writing about a 20% difference all the way to a friend's Opteron which is a good 70% (!!!) faster under Linux.
RE: Okay, thanks for
)
No idea about the penalty for your Venice, I guess Michael and his database of statistics will be helpful there, but he's probably still taking the well deserved nap. Can hardly wait to see your results!!!
CU
BRM
RE: RE: Okay, thanks for
)
Hi,
my first result is finished and uploaded also some other members of our team have patched there app and successfully finished WUs.
My c/h rose from ~14 to ~19!
This is eliminating the AMD/Win penalty. :-)
Some stats from my data:
[pre]
A64: 8.2 - 8.8 [c/(h·GHz)] Linux
A64 X2: 8.2 - 8.8 [c/(h·GHz)] Linux
A64: 4.6 - 5.2 [c/(h·GHz)] Windows
A64 X2: 4.7 - 5.2 [c/(h·GHz)] Windows
[/pre]
Because the Einstein app is scaling with cpu clock, there is no need to look at the different clocks, also cache size is pretty uninteresting. The former S5R1 and S5R2 app was running in L1 cache and even the smaller cache of Intel cpus was big enough. In my data there are for sure some hosts which are oc'd and therefore have influence to the results above, but I suppose one can find them in both os-groups.
My first result equals 7,25 [c/(h·GHz)].
I should say, that running Boinc native with only one Einstein app without cpu affinity and one VMWare Linux cruncher dedicated to one core ended up with about 50% resource share for each task. But taskmanager showed more than 105,000,000 page faults for the Einstein Win app. VMWare in contrast only produced 430,000 page fauts after running for a couple of days. So maybe running Boinc without another full load process aside which is dedicated to one core will even improve the speed. Also the example imho shows, that page misses don't really bother the app and do not dramaticaly reduce speed.
When we get other results, we can draw conclusions about this.
The Intel Core cpus show really big differences in my data and therefor it's impossible to get good stats without knowing the exact clock rate.
When does one of the developers give a statement about this ugly lib issue? ;-)
cu,
Michael
RE: RE: RE: Okay,
)
Thanks for the stats, this looks really promising, doesn't it!!! I expected a 30 % rise in performance.
Unless you've already done so, I'll drop Bernd an email just in case he has missed the whole discussion.
CU
BRM
RE: Thanks for the stats,
)
Yes, looks very good. :-)
I haven't mailed to Bernd, so go ahaed.
Btw. I don't think this patch harms any cpus that are not SSE2 capable. There must be another switch in the code to filter out Intel SSE1 and non SSE cpus. This will probably work on AMD too. Should be something like described on that Web page about Intel compilers. But if there is some place where SSE1 Instructions are used, this might accelerate AMD Athlon XPs too.
But this is just a guess.
cu
Michael
RE: Yes, looks very good.
)
Yes the detection mechanism to me seems to be as described in the article: First, detect feature bits to check for SSE2, then check for vendor and if it is "AuthenticAMD", reset the results just obtained from CPUID to a bare minimum. Not a nice thing to do, IMHO.
I didn't see any SSE instructions and I doubt very much that Athlon XPs or P IIIs will see any performance increase whatsoever by changing the COU detection code. For those platforms to reach the the same levels of performance as under Linux/gcc, a better implementation of the modf function is needed.
In the meantime, I think it's a matter of courtesy to keep the number of modified clients to a minimum until Bernd OK's the change. It was essential to verify our hypothesis to try out the change, but let's wait until the official OK before everybody is patching the app. If 1000 people are patching and one of them makes a mistake, it can mess up quite a few results. As a software engineer I'd prefer that the new version is formally tested, approved, and only then released with a new version number before it's widely used so any negative effects are traceable.
CU
BRM
Hey guys, just a quick
)
Hey guys, just a quick update. My WU has about 2 hours left; total crunching time should amount to between 20.5 and 21 hours. I still don't know the exact credit value, though. Btw, I'm getting a friend from Uni to check this with one or two of his AMD boxes, so we'll get some more results. Mailing Bernd is a great idea imo.