Yeah, I think there's quite a back story here. On 21/08/2013 update
Quote:
After 5 years of having to constantly “do more with less†it finally looks like our ship has come in! I can’t say more than that for now, but I will say that the stronger Adapteva is financially the more likely it is that the Parallella platform will be a long term success!
Sorry for the lack of communication!! We have been in a pretty delicate position (nothing related to the board or the chips). Hopefully some day I can tell everyone the whole horrific story...
FWIW my guess is that they've been occupied at a business, not technical, level with a potential big backer or contract or somesuch that didn't go as well as hoped .....
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
On every clock cycle, the following operations can occur:
- 64 bits of instructions can be fetched from memory to the program sequencer.
- 64 bits of data can be passed between the local memory and the CPU’s register file.
- 64 bits can be written into the local memory from the network interface.
- 64 bits can be transferred from the local memory to the network using the local DMA
Oh, I see. But being able to perform one such action per cycle only gives you the throughput. It doesn't tell you how long it will take to finish these actions, i.e. the latency. From you other post:
Quote:
Every router in the mesh is connected to the north, east, west, south, and to a mesh node. Write transactions move through the network, with a latency of 1.5 clock cycles per routing hop. A transaction traversing from the left edge to right edge of a 64- core chip would thus take 12 clock cycles.
That's what I was getting at: the latency to finish that write depends on the distance between the chip and can be much more than 5 clocks (still fast, though!). I hope we didn't just talk about different "how long"s all the time: "How long does it take the sender to send the write?" versus "How long does it take for the write to arrive?". Actually.. your initial statement was "I can write into the memory (or was it register?) of another core in 5 cycles". So it's actually the total latency to finish the write.
Quote:
In theory at least, one might 'unroll' a loop to perform the same essential calculations on several cores, with each core doing what might have been done for a single loop iteration ie. accounting for different values of whatever loop variable(s) would have otherwise been updated per round of the loop.
That's what I'm occasionally using in MATLAB with a parfor loop. The overhead there is significant, though, in that individual loops have to exceed 10's or better 100's of ms of runtime for this to provide any benefit. Which greatly limits its applicability.. so I'm a bit jealous about what you could do at a low level. On the other hand I'm not all that keen on spending the time to hand-tweak such details ;)
Quote:
A subtle bit here is we are using RISC processors which by definition will/may/could have an expanded code memory footprint for a given task(s) c/w their CISC cousins ( but not necessarily ).
Considering Parallela is starting from scratch here and that the individual cores are fairly simple, I'd actually expect their instruction footprint to be less than x86. Especially if 16 bit instructions can sometimes be used.
Sorry for the lack of communication!! We have been in a pretty delicate position (nothing related to the board or the chips). Hopefully some day I can tell everyone the whole horrific story...
my guess is that they've been occupied at a business, not technical, level with a potential big backer or contract or somesuch that didn't go as well as .
I suspect a challenge on their intellectual property.
There are some who can live without wild things and some who cannot. - Aldo Leopold
Sorry for the lack of communication!! We have been in a pretty delicate position (nothing related to the board or the chips). Hopefully some day I can tell everyone the whole horrific story...
my guess is that they've been occupied at a business, not technical, level with a potential big backer or contract or somesuch that didn't go as well as .
I suspect a challenge on their intellectual property.
Well they have a new logo but I don't think that was it.
@Rod + @MarkJ : Intellectual property challenge, yeah there's a thought. It'd be the sort of thing a big player might do to squash a start-up, but I speculate. IMHO ( FWIW ) I reckon their design is brilliant so certainly well worth a patent, which they have.
@MrS :
Quote:
I hope we didn't just talk about different "how long"s all the time ...
Oooops, I think there may have been a tad of that. Sorry :-O :-)
Yup, throughput of one per cycle with latency of five cycles.
Quote:
On the other hand I'm not all that keen on spending the time to hand-tweak such details ;)
One early task for me, when the card arrives, is to create a wide set of assembler macros suitably parameterised. Their implementation of the superscalar aspect is intriguing I think, and with a bit of clever ordering the CPU can really hand off alot of stuff simultaneously. Here the dependencies within the pipeline can be mitigated by attention to the parallel scheduling rules and cycle separations ie. avoid stalls.
Quote:
Especially if 16 bit instructions can sometimes be used.
Yup, using the general registers 0 through 7 with short immediates is now on my list of features to ruthlessly exploit at assembler level. Within 16 bits you would, at most, get room for a signed immediate of 3 bits ( simm3 ) ie. -4 to + 3
Cheers, Mike.
( edit ) More info from Andreas : here and here .... :-)
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Well I must say : I am getting itchy fingers for the alleged imminent Parallella delivery ! :-)
Anyway, while waiting I have produced these musings upon possible approaches to the Parallella for FFT, keeping the horrible mathematical mud in the appendices. :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
He already had over 300 posts when I discovered XKCD some time ago.. went through all of them :)
I even made myself an A0 poster with some old favorites, it's still happily hanging at the bathroom door. Too bad most guests have trouble getting the (sligthly) nerdy jokes in english!
Yeah, I think there's quite a
)
Yeah, I think there's quite a back story here. On 21/08/2013 update
but from 27/09/2013 forum post
FWIW my guess is that they've been occupied at a business, not technical, level with a potential big backer or contract or somesuch that didn't go as well as hoped .....
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: OK, it's here ( my
)
Scanning for our furry friends since Jan 2002
RE: RE: Sorry for the
)
I suspect a challenge on their intellectual property.
There are some who can live without wild things and some who cannot. - Aldo Leopold
RE: RE: RE: Sorry for
)
Well they have a new logo but I don't think that was it.
BOINC blog
@Rod + @MarkJ : Intellectual
)
@Rod + @MarkJ : Intellectual property challenge, yeah there's a thought. It'd be the sort of thing a big player might do to squash a start-up, but I speculate. IMHO ( FWIW ) I reckon their design is brilliant so certainly well worth a patent, which they have.
@MrS :
Oooops, I think there may have been a tad of that. Sorry :-O :-)
Yup, throughput of one per cycle with latency of five cycles.
One early task for me, when the card arrives, is to create a wide set of assembler macros suitably parameterised. Their implementation of the superscalar aspect is intriguing I think, and with a bit of clever ordering the CPU can really hand off alot of stuff simultaneously. Here the dependencies within the pipeline can be mitigated by attention to the parallel scheduling rules and cycle separations ie. avoid stalls.
Yup, using the general registers 0 through 7 with short immediates is now on my list of features to ruthlessly exploit at assembler level. Within 16 bits you would, at most, get room for a signed immediate of 3 bits ( simm3 ) ie. -4 to + 3
Cheers, Mike.
( edit ) More info from Andreas : here and here .... :-)
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: On the other hand I'm
)
... and I'm glad that there are others who are keen to do so :D
(in a clever way, without wasting teir time, of course)
MrS
Scanning for our furry friends since Jan 2002
Well I must say : I am
)
Well I must say : I am getting itchy fingers for the alleged imminent Parallella delivery ! :-)
Anyway, while waiting I have produced these musings upon possible approaches to the Parallella for FFT, keeping the horrible mathematical mud in the appendices. :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
... just be careful with your
)
... just be careful with your pets and such ;)
MrS
Scanning for our furry friends since Jan 2002
RE: ... just be careful
)
I do so love XKCD, it fills the hole left by Gary Larson when he retired. :-)
Cheers, Mike.
( edit ) Subtitle/hover is 'That cat has some serious periodic components' ....
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
He already had over 300 posts
)
He already had over 300 posts when I discovered XKCD some time ago.. went through all of them :)
I even made myself an A0 poster with some old favorites, it's still happily hanging at the bathroom door. Too bad most guests have trouble getting the (sligthly) nerdy jokes in english!
MrS
Scanning for our furry friends since Jan 2002