I also programmed from the late '70's. Thing then was hardware was really expensive relative to staff, so yes, you squeezed everything to fit the machine and used an army of programmers to do it. Today, the situation is the reverse. If you need another GB of RAM, you put it in, if you need a new staff member, you have to think big money.
Today though, I program mostly deeply embedded systems often with very small microcontrollers, so again, it is a case of managing your resources efficiently.
1) The biggest nightmare I had in the 70's (outside of having to break things into overlays if the code got to big for memory) was comm between the CPU's in use. We should be very happy with the progress of ethernet comm. Back then I had one "master controller" running 3 "slave processors" all of which had to talk via serial link with home-made semaphores, etc.
2) In what I thought was an interesting thread "who?" over at SETI is trying to get the entire app to run in cache to get more performance.
What about the bigger L2 caches that are accompanying the Quad Cores? Can that be a feature that can be utilized?
Probabaly not. Akos was able to tune the s4/sr5r1 apps to fit in the l1 caches of amd processors, and almost managed to do the same with the significantly smaller ones on intel chips. Unless the current app us much more data hungry the only benefit that a really large l2/l3 cahce is likely to provide is a reduced chance of data being swapped to main memory when a second app is using the core.
Quote:
Also could calculations be broken up, where by, each core takes a part of the calculation. On the machine language level, taking a 32 bit code, and breaking it into four, eight bit codes, for each core to work. I don't know if we are talking parallel core at this level.
What you're describing is impossible and wouldn't actually gain anything even if it was done. Each of the 32 bits of data in an operation is computed in parallel. In theory a work unit could be broken into several parellelizable parts and ran concurently on multiple cores, but even in the best case that wouldn't gain any throughput over running multiple WUs in parallel. In reality, the overhead needed to keep the threads syncronized would result in lower throughput and the einstien WUs aren't large enough for the benefit of a faster return to actually be a real benefit. CPDN on the other hand could benefit from this sort of parallelization, but thier science app is much more complex which would make the implementation far harder to do.
Quote:
Remember too, each core also has a Math-Coprocessor, which also could be doing calculations.
Einstien (and almost all other) science apps are primarily floating point math. All floating point math is done on the coprocessor. The main CPU is a purely integer unit and while capable of doing floating point in software doing so is extremely slow.
What about the bigger L2 caches that are accompanying the Quad Cores? Can that be a feature that can be utilized?
Probabaly not. Akos was able to tune the s4/sr5r1 apps to fit in the l1 caches of amd processors, and almost managed to do the same with the significantly smaller ones on intel chips. Unless the current app us much more data hungry the only benefit that a really large l2/l3 cahce is likely to provide is a reduced chance of data being swapped to main memory when a second app is using the core.
Quote:
Also could calculations be broken up, where by, each core takes a part of the calculation. On the machine language level, taking a 32 bit code, and breaking it into four, eight bit codes, for each core to work. I don't know if we are talking parallel core at this level.
What you're describing is impossible and wouldn't actually gain anything even if it was done. Each of the 32 bits of data in an operation is computed in parallel. In theory a work unit could be broken into several parellelizable parts and ran concurently on multiple cores, but even in the best case that wouldn't gain any throughput over running multiple WUs in parallel. In reality, the overhead needed to keep the threads syncronized would result in lower throughput and the einstien WUs aren't large enough for the benefit of a faster return to actually be a real benefit. CPDN on the other hand could benefit from this sort of parallelization, but thier science app is much more complex which would make the implementation far harder to do.
Quote:
Remember too, each core also has a Math-Coprocessor, which also could be doing calculations.
Einstien (and almost all other) science apps are primarily floating point math. All floating point math is done on the coprocessor. The main CPU is a purely integer unit and while capable of doing floating point in software doing so is extremely slow.
I just ordered a Power Mac with 8 3Ghz Xeon cores. My walet is a whole lot lighter now. I figure you only live once, might as well go out crunching. Not that I plan on leaving this world anytime soon. Soon, I'll be able to compare dual vs quad cores. Although, my dual cores are relatively slow @1.83 and 1.66Ghz. I have been impressed with the crunch times of some of the fast Xeons I have run up against.
RE: I also programmed from
)
1) The biggest nightmare I had in the 70's (outside of having to break things into overlays if the code got to big for memory) was comm between the CPU's in use. We should be very happy with the progress of ethernet comm. Back then I had one "master controller" running 3 "slave processors" all of which had to talk via serial link with home-made semaphores, etc.
2) In what I thought was an interesting thread "who?" over at SETI is trying to get the entire app to run in cache to get more performance.
RE: What about the bigger
)
Probabaly not. Akos was able to tune the s4/sr5r1 apps to fit in the l1 caches of amd processors, and almost managed to do the same with the significantly smaller ones on intel chips. Unless the current app us much more data hungry the only benefit that a really large l2/l3 cahce is likely to provide is a reduced chance of data being swapped to main memory when a second app is using the core.
What you're describing is impossible and wouldn't actually gain anything even if it was done. Each of the 32 bits of data in an operation is computed in parallel. In theory a work unit could be broken into several parellelizable parts and ran concurently on multiple cores, but even in the best case that wouldn't gain any throughput over running multiple WUs in parallel. In reality, the overhead needed to keep the threads syncronized would result in lower throughput and the einstien WUs aren't large enough for the benefit of a faster return to actually be a real benefit. CPDN on the other hand could benefit from this sort of parallelization, but thier science app is much more complex which would make the implementation far harder to do.
Einstien (and almost all other) science apps are primarily floating point math. All floating point math is done on the coprocessor. The main CPU is a purely integer unit and while capable of doing floating point in software doing so is extremely slow.
RE: RE: What about the
)
Thx.
They were just some crazy ideas. :)
I just ordered a Power Mac
)
I just ordered a Power Mac with 8 3Ghz Xeon cores. My walet is a whole lot lighter now. I figure you only live once, might as well go out crunching. Not that I plan on leaving this world anytime soon. Soon, I'll be able to compare dual vs quad cores. Although, my dual cores are relatively slow @1.83 and 1.66Ghz. I have been impressed with the crunch times of some of the fast Xeons I have run up against.