development in hardware – why? option: array of custom processing nodes step 1: analyze the...
TRANSCRIPT
![Page 1: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/1.jpg)
Development in hardware – Why?
Option: array of custom processing nodes Step 1: analyze the
application and extract the component tasks
Step 2: design the custom processors
Step 3: program the FPGA
Step 4: assign the tasks to the processors and set up the connection network
← Multi-cellular organization
← ???
← Growth (cellular division)
![Page 2: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/2.jpg)
Development in hardware – Why?
Step 2: as a function of the tasks, design one (or more) custom processors.
×+ ÷≠ FFT +
×
DCT×+ ÷≠ FFT +
×
IN DCT OUT
![Page 3: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/3.jpg)
Cellular differentiation Cells adapt their physical
structure to fit the “application”
Can circuits/processors do the same? Physically? No Logically? Yes, but…
Can they do it easily (dare we say, automatically)?
![Page 4: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/4.jpg)
Cellular differentiation
Needed: adaptable cellular architecture
That is, a processor architecture that is Customizable Compact Powerful Easy to design and modify Amenable to evolution and learning
Possible solution: MOVE architectures
![Page 5: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/5.jpg)
The MOVE paradigm
One single instruction : move Data displacements trigger
operations Architecture based around
data ≠ operation centric Regular structure : functional
units + data network Scalable and modular
architecture
Example: Sum of two values
Conventional architecture:add R1, R2, R3;
MOVE architecture: move O(Fxxx), I1(Fsum)
move O(Fyyy), I2(Fsum)move O(Fsum), I(Fzzz)
![Page 6: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/6.jpg)
Cellular differentiation
Main features: Conventional fetch/decode mechanism – compatible with
bio-inspired mechanisms No pipeline: computation carried out in specialized
functional units (FU) Communication carried out in specialized communication
units (CU) Only one instruction that MOVEs data to and from the CUs
and FUs (dataflow architecture)
![Page 7: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/7.jpg)
Cellular differentiation
Main advantages: Can be easily customized by introducing application-specific functional and communication units. Perfectly fits the requirements of systolic arrays (arbitrarily complex communication patterns). The introduction of custom components does not affect the assembler language, the code
structure, the fetch and decode units, or the transport bus.
![Page 8: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/8.jpg)
Genotype Layer
Phenotype Layer
Example – Automatic Synthesis
Application-specific (parallel) functions
Developmental algorithm
Genetic code
Mapping Layer
![Page 9: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/9.jpg)
Example – Automatic Synthesis
Phenotype Layer
Mapping Layer
Genotype Layer
Totipotent Cell
![Page 10: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/10.jpg)
Example – Automatic SynthesisTotipotent CellProgrammable Logic
![Page 11: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/11.jpg)
Example – Automatic SynthesisProgrammable Logic
Cellular Array
![Page 12: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/12.jpg)
Implementation - The BioWall
![Page 13: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/13.jpg)
![Page 14: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/14.jpg)
Development in hardware – Why?
Option: array of custom processing nodes Step 1: analyze the
application and extract the component tasks
Step 2: design the custom processors
Step 3: program the FPGA
Step 4: assign the tasks to the processors and set up the connection network
← Multi-cellular organization
← ???
← Cell specialization
← Growth (cellular division)
![Page 15: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/15.jpg)
Phenotype LayerCell design and specialization
Application code (parallel)
Within a MOVE framework, the specialization (differentiation) of a cell corresponds to the selection of the functional and communication units that can most efficiently implement the desired application.
![Page 16: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/16.jpg)
FU extraction Extracting the optimal FUs from the code is a
complex problem!
![Page 17: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/17.jpg)
FU extraction How about having a quick
peek at biology?
Idea: let us use evolution!!
In fact, this approach is much closer to biology than simply evolving code: in nature, the hardware (the cell) and the software (the genome) have evolved together!
![Page 18: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/18.jpg)
FU extraction Idea: let us use evolution!!
![Page 19: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/19.jpg)
FU extraction First step: profiling the code (standard
compilation technique)
![Page 20: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/20.jpg)
FU extraction Second step:
transform into tree (standard compilation technique)
Third step: represent as 1-D genome
Fourth step: run the GA (with some fancy optimizations)
![Page 21: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/21.jpg)
Fitness evaluation
s = size of the new processort = execution time of the program on the new processorα = execution time of the program on a minimal processorβ = hardware area to implement the minimal processor (which has, by definition,
a fitness of 1)hwLimit = maximum hardware allowed to implement the new processor
Note:• Relative fitness function• When out of allowed hardware
range, logarithmic decrease• The hardware investment has to
be small enough to be retained
![Page 22: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/22.jpg)
Determining hardware size How can the size of the new FU estimated (the β
parameter of the fitness) ? The idea:
Determine the size of each basic building block (+, -AND, …)
What to do with assignments or loops ? Compute how many of them are used for a new
FU The characterization has to be done for every
target platform.
![Page 23: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/23.jpg)
Determining hardware execution time Use the same idea used for size :
Compute the time needed for each elementary function
Take targeted clock period as a basis When time estimated > clock period, add 1 to the
total time small jumps in the fitness landscape
![Page 24: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/24.jpg)
Pattern-matching optimization
How to find reusable FUs ? The GA behaves a bit like random mutations difficult
to find reusability this way Helps the GA a bit : search the whole tree each time a
new HW block is defined to replace similar pieces of code
![Page 25: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/25.jpg)
Non-optimal block pruning
“Cleaning” phase made at each step Removes HW blocks that are non-
optimal from the fitness point-of-view To see if a block is useful, compute
the fitness with and without this block implemented in HW. If the software solution has a better fitness, the block is non-optimal and can be removed.
![Page 26: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/26.jpg)
FU extraction - Interface
STANDARD DOMAIN
![Page 27: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/27.jpg)
![Page 28: Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649e205503460f94b0bc05/html5/thumbnails/28.jpg)
FU extraction - Results Example (functions from FACT factorization
algorithm): Hardware increase (estimated): 10% (fixed) Speedup (estimated): 2.27 (227%)
Other results:
All were obtained in a few seconds