Parallella, Raspberry Pi, FPGA & All That Stuff

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 316195615
RAC: 336307

HB has just announced a test

HB has just announced a test version on Albert :

Quote:

We do have a NEON enabled ARM Linux app version on Albert@Home now:

Please test at http://albert.phys.uwm.edu


The soon-to-be-arriving Parallella boards have Zynq 7020's ( ARM dual cores plus NEON ) onboard, and you get a full Ubuntu desktop ( boot from SD ) and of course an Ethernet port.

Now while this doesn't involve the Epiphany parallel wizz-fizz aspect yet, it has sprung all sorts of thoughts from my fevered & caffeine soaked brow ! :-) :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 725259475
RAC: 1207463

Yup, I also had the

Yup, I also had the Parallella in mind when releasing this. Unfortunately I'm only backer #5000ish, so I will have to wait some time before mine will arrive.

It will be really interesting to see what kind of extra performance one can get from the "coprocessor" chip when trying to optimize E@H apps for it, as it has only very limited on-chip RAM and the main RAM is connected to it via a link that is not extremely fast.

Cheers
HB

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 316195615
RAC: 336307

RE: Yup, I also had the

Quote:

Yup, I also had the Parallella in mind when releasing this. Unfortunately I'm only backer #5000ish, so I will have to wait some time before mine will arrive.

It will be really interesting to see what kind of extra performance one can get from the "coprocessor" chip when trying to optimize E@H apps for it, as it has only very limited on-chip RAM and the main RAM is connected to it via a link that is not extremely fast.


Yeah, I've been having looooong thoughts about those communication/bandwidth parameters. The speed asymmetry is around two orders of magnitude, so much so that even a packet type delivery service with even the simplest of algorithms for wrap/compression followed by unwrap/decompression might look good. Add that in to an hierarchical distribution method - send a sample into one node and let it distribute copies & subsets to the others. Benefit after overhead for this on a 16 core chip might be doubtful, but how might that scale upwards to 4096 ? :-)

[ The joy of such a novel development platform is that one can think of and try all sorts of ideas that might be considered whacky elsewhere, and not get too embarrassed! I tend to get a lot of ideas while sitting in the dentist's chair. The stress/need for a significantly focused distraction seems to work well for me there .... :-) ]

ALSO : There has been a late tweak of the Epiphany chip and DRAM voltages, mildly downwards toward the middle of the nominal range. This ought improve power consumption, heat issues etc and hopefully without impacting on switching speed too much.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 577833544
RAC: 196348

RE: I tend to get a lot of

Quote:
I tend to get a lot of ideas while sitting in the dentist's chair. The stress/need for a significantly focused distraction seems to work well for me there .... :-) ]


Want some more candy? :D

MrS

Scanning for our furry friends since Jan 2002

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 316195615
RAC: 336307

RE: RE: I tend to get a

Quote:
Quote:
I tend to get a lot of ideas while sitting in the dentist's chair. The stress/need for a significantly focused distraction seems to work well for me there .... :-) ]

Want some more candy? :D


Well I'm not yet sure that they are good ideas, just plentiful ! :0)

For instance here's an idea which probably has no practical advantage at all IMHO : the Epiphany cores can place themselves in an idle state, but another core ( or circuitry external to the chip, suitably wired ) can, via an interrupt line, wake it up again. Fair enough. Thus lies a curious possibility, bearing in mind that the memory architecture of the chip is flat and unrestricted ( one core can access any other core's local RAM ). Start with some core ( call it #1 ) and have it execute a section of code held on it's local RAM but acting on data somewhere in the global address space ( wherever, it doesn't matter ). Then it puts itself to idle/sleep, but before it does it sends a wakeup to core #2. This core #2 performs another section of code held on it's RAM - presumably next in sequence along some logical/task prescription - on the same piece of data. When done it puts itself to sleep and taps the next core ( #3 ) on the shoulder for it to awaken and thence do it's bit. Etc.

So an area of data in memory has some series of operations performed on it - just like a single sequential CPU - but each sub-portion of some total algorithm is implemented by each of a succession of cores. No single core holds the entire algorithm. Think of a relay race and a baton getting passed. You'd have a ripple of activity travelling around the chip as each core is activated by the previous one, does it's work, activates the next one and then goes to sleep. At least, perhaps, until it's 'turn' comes around again ....

[ASIDE : This idea comes from two sources ( NB I'm really re-inventing known processing paradigms here within the Parallella setting ). Firstly is the production line concept obviously, but in a Ferrari style where we leave some piece to be worked upon in place but call in a succession of specialist workers to complete some design. Secondly is the idea of re-entrancy - if you want several passes over the whole workpiece just do more than one lap/circuit of the entire core-set. But of course like all computing tasks : something, somewhere has to put a halt on activity and/or output a result. ]

Now why would you bother? At present the only reason I can think of is that you might have some large block of code that can't fit on one core ( currently low memory per core ). But then you would put that on an ordinary sequential machine and forget about Parallella altogether. If nothing else you could trivially demonstrate that a 'parallel machine' can behave sequentially ( big deal! ). You could do tasks with only subsets of the total core set, traversing a linked list of cores. A given 16/64 core set could have several subsets ( non-overlapping ? ) each working on unrelated tasks on separate data loads. Maybe two-core pairs that play 'tennis' with the one 'ball' .... etc

What I'm looking for is permutations of core activity that are in a sense 'minimally-organised' from an external point of view, meaning the bulk of the administrative overhead of workload allocation is done on the Epiphany chip. Think of this like the Roman general Agricola ( later Emperor ? ) who was instructed to go to Britain to govern. He was given an overall task from the Senate - manage Roman interests well - but not given instruction in detail, so he decided upon delegation and resource allocation himself. Something like that. :-)

Cheers, Mike.

( edit ) No, not Emperor. Indeed he did rather too well at his job and was recalled, then sent into retirement from public duties where oddly he died of poison in his food. So it's good to be good, but not to be too good ....

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

David S
David S
Joined: 6 Dec 05
Posts: 2473
Credit: 22936222
RAC: 0

RE: RE: RE: I tend to

Quote:
Quote:
Quote:
I tend to get a lot of ideas while sitting in the dentist's chair. The stress/need for a significantly focused distraction seems to work well for me there .... :-) ]

Want some more candy? :D

Well I'm not yet sure that they are good ideas, just plentiful ! :0)

For instance here's an idea which probably has no practical advantage at all IMHO : the Epiphany cores can place themselves in an idle state, but another core ( or circuitry external to the chip, suitably wired ) can, via an interrupt line, wake it up again. Fair enough. Thus lies a curious possibility, bearing in mind that the memory architecture of the chip is flat and unrestricted ( one core can access any other core's local RAM ). Start with some core ( call it #1 ) and have it execute a section of code held on it's local RAM but acting on data somewhere in the global address space ( wherever, it doesn't matter ). Then it puts itself to idle/sleep, but before it does it sends a wakeup to core #2. This core #2 performs another section of code held on it's RAM - presumably next in sequence along some logical/task prescription - on the same piece of data. When done it puts itself to sleep and taps the next core ( #3 ) on the shoulder for it to awaken and thence do it's bit. Etc.

So an area of data in memory has some series of operations performed on it - just like a single sequential CPU - but each sub-portion of some total algorithm is implemented by each of a succession of cores. No single core holds the entire algorithm. Think of a relay race and a baton getting passed. You'd have a ripple of activity travelling around the chip as each core is activated by the previous one, does it's work, activates the next one and then goes to sleep. At least, perhaps, until it's 'turn' comes around again ....

[ASIDE : This idea comes from two sources ( NB I'm really re-inventing known processing paradigms here within the Parallella setting ). Firstly is the production line concept obviously, but in a Ferrari style where we leave some piece to be worked upon in place but call in a succession of specialist workers to complete some design. Secondly is the idea of re-entrancy - if you want several passes over the whole workpiece just do more than one lap/circuit of the entire core-set. But of course like all computing tasks : something, somewhere has to put a halt on activity and/or output a result. ]

Now why would you bother? At present the only reason I can think of is that you might have some large block of code that can't fit on one core ( currently low memory per core ). But then you would put that on an ordinary sequential machine and forget about Parallella altogether. If nothing else you could trivially demonstrate that a 'parallel machine' can behave sequentially ( big deal! ). You could do tasks with only subsets of the total core set, traversing a linked list of cores. A given 16/64 core set could have several subsets ( non-overlapping ? ) each working on unrelated tasks on separate data loads. Maybe two-core pairs that play 'tennis' with the one 'ball' .... etc

What I'm looking for is permutations of core activity that are in a sense 'minimally-organised' from an external point of view, meaning the bulk of the administrative overhead of workload allocation is done on the Epiphany chip. Think of this like the Roman general Agricola ( later Emperor ? ) who was instructed to go to Britain to govern. He was given an overall task from the Senate - manage Roman interests well - but not given instruction in detail, so he decided upon delegation and resource allocation himself. Something like that. :-)

Cheers, Mike.

( edit ) No, not Emperor. Indeed he did rather too well at his job and was recalled, then sent into retirement from public duties where oddly he died of poison in his food. So it's good to be good, but not to be too good ....


I can only follow this in a very basic, conceptual sense, but what came to my mind as I read it is the old Monty Python bit about the British "joke warfare" they allegedly used against the Germans. The joke was so funny, it caused the enemy to laugh themselves to death instantly, but that meant no British soldier could understand it, and no British researcher could develop the whole joke or work on its translation, so they divided it up into chunks for different teams to work on. (I suppose this was satire on how the military conducted nuclear weapons research and development.) Am I nuts for seeing a similarity here?

David

Miserable old git
Patiently waiting for the asteroid with my name on it.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 316195615
RAC: 336307

RE: I can only follow this

Quote:
I can only follow this in a very basic, conceptual sense, but what came to my mind as I read it is the old Monty Python bit about the British "joke warfare" they allegedly used against the Germans. The joke was so funny, it caused the enemy to laugh themselves to death instantly, but that meant no British soldier could understand it, and no British researcher could develop the whole joke or work on its translation, so they divided it up into chunks for different teams to work on. (I suppose this was satire on how the military conducted nuclear weapons research and development.) Am I nuts for seeing a similarity here?


Not at all, there is a deep similarity indeed with secret weapons development ! Including Monty Python ... :-)

One basic tension in parallel programming is the degree of looseness between computing elements vs their mutual coherence. There's no ideal set point, it'll be problem domain dependent.

If on the one hand there is no conceptual overlap b/w work done on different compute units then why bother calling/treating it as parallel ( or sequential for that matter ) ? So if you do a Microsoft Word article about tulips on your Windows machine and I do a digital hardware design for a deepsea robot on a Linux machine, where do they intersect ? Only perhaps at some high level of abstraction.

On the other hand you may do an E@H BRP WU on your Windows machine, and I can do the same WU on a Linux. Ideally we would view each host as interchangeable. The overlap here is the same data/content and identical algorithms ( albeit different specific implementations ). The quorum validation causes/permits alignment.

In any case there is an administrative layer at a minimum doing two things : splitting & distribution, followed by collection & aggregation. This is largely done server side at E@H.

Cheers, Mike.

( edit ) Regarding Agricola, I love the way Edward Gibbon put it : "Engaged in the pursuit of pleasure, or in the exercise of tyranny, the first Caesars seldom showed themselves to the armies, or to the provinces; nor were they disposed to suffer, that those triumphs which their indolence neglected, should be usurped by the conduct and valor of their lieutenants. The military fame of a subject was considered as an insolent invasion of the Imperial prerogative; and it became the duty, as well as interest, of every Roman general, to guard the frontiers entrusted to his care, without aspiring to conquests which might have proved no less fatal to himself than to the vanquished barbarians." :-)

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 577833544
RAC: 196348

This scheme sounds like one

This scheme sounds like one operating mode of the original Cell processor. Here the individual cores (7 in the PS3, 8 for super computers) could be daisy-chained to form something I'd just call a macro-pipeline for now (that's too long ago to remember their official name), not unlike a classic vector processor from the 80's, where the individual elements would travel through the pipeline one after the other (not parallel like today). And the pipeline could be unusually long, since you'd know you're working on long vectors and don't have to be afraid of stalling. Here daisy-chaining more cores should enable one to run more (or more complex) instructions on each element. A massively parallel chip like a GPU, on the other hand, might need to keep an excessive amount of threads (elements) in flight to deal with such algorithms. This could overwhelm the registers or caches, in which case performance will plummet.

Maybe such a programming scheme could enable one to parallelize algorithms with contain a comparably large sequential section (per element) at its core. On the other hand.. this sounds more like hand-crafting code rather than something done automatically. Or do I underestimate compilers?

There's certainly an energy cost involved for communication between cores, even if they're as tiny as Epiphanys. This cost would increase with the distance to the next core. I understand you want to keep the data being worked on in the same place all the time, at least logically. But still.. the computational results have top travel at some point. In some lecture a few years ago I heard there was a paradigm shift starting in the sense that it might be more energy efficient to recalculate something locally than passing it around on the chip. This had future single core P4 designs in mind, though, where you'd have had to drive a >100 mm² die at >4 GHz.

MrS

Scanning for our furry friends since Jan 2002

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 316195615
RAC: 336307

RE: This scheme sounds like

Quote:
This scheme sounds like one operating mode of the original Cell processor. Here the individual cores (7 in the PS3, 8 for super computers) could be daisy-chained to form something I'd just call a macro-pipeline for now (that's too long ago to remember their official name), not unlike a classic vector processor from the 80's, where the individual elements would travel through the pipeline one after the other (not parallel like today). And the pipeline could be unusually long, since you'd know you're working on long vectors and don't have to be afraid of stalling. Here daisy-chaining more cores should enable one to run more (or more complex) instructions on each element. A massively parallel chip like a GPU, on the other hand, might need to keep an excessive amount of threads (elements) in flight to deal with such algorithms. This could overwhelm the registers or caches, in which case performance will plummet.


Ultimately the devil is in the detail for such things. Does/can/should the left hand know what the right hand is doing ? In fundamental terms data and code are the same thing : they are information entities with distinct states/representations that have a cost. If I choose to examine a random byte within your computer's RAM I could not by inspection of that byte alone determine whether it was ( part of ) a data element or an instruction. That byte's role is 'revealed' when it gets to the processor - does it go to a register as an operand or pass into the instruction decode circuitry? Either way the processor as a state machine moves to a new configuration. The cost may be measured in time ( cycles ) or space ( bytes ) or energy or whatever is the key metric for a specification. This is the view taken in The Emperor's New Mind ( Roger Penrose ).

So one might have a phone book ( correlating names with numbers ) but you still ( implicitly or otherwise ) require structure to that data and a method to traverse/search the structure. Change the structure and you must change the method - presuming that you want to maintain some total informative content ( however defined ). You can compress the structure but there must be a corresponding expansion of the 'unwrapping' algorithm. If you want a quicker & simpler algorithm then you must accept an expanded data representation. So you could even list the phone book entries quite randomly, there's no law against that, but then your search method must account for that. One can view the entire E@H project as seeking structure within data sets, with some minimum of a priori assumptions, and hence a code base to suit those precepts. You get the idea.

Thus if one passes the data around cores, you would hopefully do that because it is more efficient to move data than move/store code. In the former USSR they pioneered the idea of production line cataract surgery - literally the patients would move along a large room from bay to bay where each of a series of smaller/simpler procedures was performed, the sum total for a given patient was the removal of a diseased eye lens and replacement with an artificial one. You could deduce this was done to keep the code-per-station low ( a single surgeon with well honed skills has a single task, repeated many times ) and moving the data was easier ( one patient is much like another ). If you have restricted resources of some type then such trade-offs ought be considered.

Current Epiphany realisations are extremely constrained by both data and code size ( at any given moment ) within a core. However what is a tremendous boon is the fastest/blitziest write hardware, a deliberate design choice, thus if one can mold an algorithm around ( synchronous ? ) data writes you can get massive throughput. IF ....

Quote:
Maybe such a programming scheme could enable one to parallelize algorithms with contain a comparably large sequential section (per element) at its core. On the other hand.. this sounds more like hand-crafting code rather than something done automatically. Or do I underestimate compilers?


With the presently available SDK ( C/C++ ) compilation can be optimised per core, so it is the developer's responsibility to leverage trans-core efficiencies via whatever tricks are deemed suitable.

Quote:
There's certainly an energy cost involved for communication between cores, even if they're as tiny as Epiphanys. This cost would increase with the distance to the next core. I understand you want to keep the data being worked on in the same place all the time, at least logically. But still.. the computational results have top travel at some point. In some lecture a few years ago I heard there was a paradigm shift starting in the sense that it might be more energy efficient to recalculate something locally than passing it around on the chip. This had future single core P4 designs in mind, though, where you'd have had to drive a >100 mm² die at >4 GHz.


Again it depends on what aspect is more precious .... you are right in that you ultimately have to extract an answer, eventually aggregating the work of many cores.

Cheers, Mike.

( edit ) BTW someone is looking at overlays for Epiphany - an old idea redux - where code memory areas become like a hot-bunk on a submarine, rather like data unions.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 316195615
RAC: 336307

FWIW I've broadly decided how

FWIW I've broadly decided how I might parallelise the FFT in Epiphany ! If you will bear my description ( thinking out aloud ) ..... :-) :-)

If you examine the matrix approach that I have outlined elsewhere, then a methodology comes to mind.

Firstly : the input data vector ( a one dimensional array, and for a typical E@H application these have ~ 2^22 elements ) is to be partitioned to ( nearly ) equal chunks and distributed one chunk per core. The exact data-load-per-core depends upon how much available memory per core you want to commit, whether you are doing pure reals and/or to/from full complex, in addition to the ( time series ) vector length and chunk choice. I say 'nearly' as such labour division may have a non-zero remainder ( otherwise you bind yourself to only particular transform sizes that neatly place upon an Epiphany chip), in which case the core taking on the vector's tail, say, can be allocated to have a quieter time. No matter. Given that the first FFT-type operation on that vector is to permute it's elements then each core can thus receive a shuffled chunk during distribution to said cores. I am working on a 'rations distribution' sub-algorithm ( using my Roman Legion model/analogy ) that achieves that. I've discovered a neat & tidy sequence of assembly instructions ( LDR, BITR, LSR and STR ) which achieves that at a rate of 4 cycles per 32-bit operand, which if ordered correctly will 'pipeline' nicely within a very short & fast loop*. This will run like a high speed letter 'multiplexer' that you might find at a regional postal sorting centre. Recall that the general register set is nine-way ported, meaning that nine different CPU abilities may be accessed per cycle provided there is no register conflicts/overlaps.

Secondly : the next operation is to pre-multiply ( left side ) that permuted vector by the 'middle' Fourier component matrix. But because of all those zeroes in the suburbs ( that being the point of the FFT matrix factorisation ) that lie within, then only a segment of each row needs to multiply the chunk that a core holds. Clearly you need to match suburb sizes to those chunk lengths ( that's the pivotal point of this approach, actually ) and so the inner product thus achieved is somewhat shorter & quicker. Because you avoid having to bother with multiplying by something you already know is zero.

Thirdly : same idea again but this time with the mixed identity/diagonal array. This gives, per core, a final subset of elements of the output vector ( a frequency spectrum ).

Fourthly : collect up the segments from each core and return to sender ( Roman Legion again ). An inverse of the first steps above, without the need for permutation, but possibly with scaling depending on the transform definition and direction. Indeed all of the above may be internally switched to do either forward or back transform without much grief.

Overall : for a given size transform much can be pre-computed during a setup phase ie. before and outside of some high level looping construct. For that a given core needs to know :

(a) the value of PI ( doh!! ),
(b) operand choices ( real or complex ),
(c) the overall transform size,
(d) the per-core chunk size,
(e) the assigned position along the input vector of it's chunk

After all you are not going to set up a 2^22 length transform and then process only one vector! Leave the coefficients etc ( blocks/suburbs ) on core for the entire series of input vector operands. That's OK because Epiphany is a co-processor to be invoked as needed by some leading hardware ( the ARM on the Parallella, or a PC via Ethernet ). This also has the simplicity/advantage of only one sequential read-in ( to a single core which coordinates distribution of data and later gathering up of results ) of the entire input vector, followed by one sequential read-out from the array. All transfers can/will be phrased as write transactions at assembly level.

You can hide/wrap all of this behind a few ( CUDA-FFT interface imitated ) C style calls!! The invoking code/hardware doesn't need to know a damn thing about Parallellas or Epiphanys, it's just a pick-up and delivery service. :-)

Cheers, Mike.

* As Epiphany has dual-issue scheduling rules applying to the RISC instruction pipeline ( ie. superscalar ), then for large vectors this will probably amortise to about 2 cycles per 32-bit operand ( meaning full data distribution to all participating cores ). But I'll have to check that number. Keen huh ! :-)

( edit ) Reminder : a 32-bit operand is a single precision float. So that's a single real operand or half a complex one.

( edit ) Clever punters may note that the pattern of sparsity ( where the zero entries are ) of the Fourier coefficients array materially varies from from the mixed identity/diagonal array. This means step three above is sufficiently different from step two in it's traversal of elements to accumulate an inner product. I've yet to decide how to reconcile that, at worst there might need to be an overlap b/w adjacent cores as to the portions of the input vector that they hold and/or retrieval of values from a nearby core. Hmmmmm ....

( edit )

Quote:
This will run like a high speed letter 'multiplexer' ...


Oooops, that'd be 'demultiplexer' :-)

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.