cnr-bioinformatics dec. 19, napoli l. verdoscia & r. vaccaro – many-core processors: the...

L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

1CNR-BioinformaticsDec. 19, Napoli

Many-core processors: the integrated approach to the

computational and execution models

Lorenzo Verdoscia and Roberto Vaccaro

Institute for High Performance Computing and Networking

National Research Council – Italylorenzo.verdoscia@na.icar.cnr.it

The Landscape of Parallel Computing Research: A View From Berkeley http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

From our architectural point of view, this new trend raises at least two queries: how to exploit such spatial parallelism, how to program such systems.

What is D3AS

The first query brings us to seriously reconsider the dataflow paradigm, given the fine grain nature of its operations. In fact, instead of carrying out in sequence a set of

operations like a von Neumann processor does, a many-core dataflow processor could calculate a function first connecting and configuring a number of identical simple cores as a dataflow graph and then allowing data asynchronously flow through them.

What is D3AS

The second query brings us to seriously reconsider the functional programming style, given its intrinsic simplicity in writing parallel programs. In fact, functional languages have three key properties

that make them attractive for parallel programming: They have powerful mechanisms for abstracting over

both computation and coordination; they eliminate unnecessary dependencies; their high-level coordination achieves a largely

architecture-independent style of parallelism.

What is D3AS

Agenda

The hHLDS model CHIARA language Dataflow graph generation and mapping D3AS general architecture Future work

D3AS (Demand Data Driven Architecture System):

a high performance reconfigurable computing system demonstrator, which exploits FPGA technology where

the computational model is functional

the execution model is dataflow

and whose architecure has a highly scalable degree with nodes characterized by having

a dynamic configurability

a transparent hardware reconfiguration

Design methodology:

develop the right computation model alongside languages & hadware

The methodological approach

Physical Model(reconfigurable)

( dataflow ) ( functional )

Computational Model

Real Machine(Hundread thousands of identical MPFUs)

Data Driven Demand Driven

hHigh-Level dataflow System CHIARA language

Firing rules in the classical model

Let A={a1, …, an} be the set of actors

and L ={ll, …, ln} be the set of links

A dataflow graph is a labelled directed graph

G = (N, E)where

N = A L is the set of nodes

E (A × L) (L × A) is the set of edges

firing of an actora token on each input link and no token on each output link

effectconsumes all input tokens and produces a token on its output link

The homogeneous High Level Dataflow System (hHLDS) model

Special actors in the classical model

Switch

Decider

are characterized by having heterogeneous I/O conditions

The hHLDS model

homogeneous High Level Dataflow SystemAny actor has two input links and one output link and consumes and produces only data tokens

firing of an actora token on each input link

effectconsumes all input tokens and can produces a token on its output link

a+b*c*

If b≤c then a

Comparison between the two models

LS T LS T

12 13 14

input (a, c) b := 1; repeat if a > 1 then a := a \ 2 else a := a * 5 b := b * 3; until b = c;output (d)

The hHLDS model

CHIARA language

dialect of Backus‘s FP tuple (O, F, F, :, D) where:

O is a set of objects; F is a set of functions (or operators) from

objects to objects; F is a set of functional forms (functionals)

from functions to functions; : is the application operation; D is a set of function definitions.

CHIARA objects Atoms: include integer fixed and floating-point numbers,

Boolean constants,characters and strings

Sequences: denoted with angle brackets < 1, 2, 3 > The empty sequence <> is the only object which is both an

atom and a sequence

Undefined special object (or UDF) called bottom, which is usually used to denote errors or exceptions. Sequences are bottom-preserving: < 1; 2;< 3; 5 >; > =

CHIARA language

CHIARA functionstwo kinds of operators that can be applied to

objects:

Elementary: the commonly used binary operators and some new ones

Combinator: operators that affect the structure of the objects on which they are applied (combine sequences, transpose sequences of sequences, etc).

CHIARA language

Elementary operators

Keyword Name Action Definition

+ arithmetic addition

if x is a pair of numbers, + : x produces their sum, otherwise . + :< x1; x2 >= x1 + x2

- arithmetic subtraction

if x is a pair of numbers, - : x produces their difference, otherwise . - :< x1; x2 >= x1 - x2

* arithmetic multiplication

if x is a pair of numbers, * : x produces their product , otherwise . * :< x1; x2 >= x1 * x2

/ arithmetic division

if x is a pair of numbers, / : x produces their quotient , otherwise . / :< x1; x2 >= x1 / x2

lt less than if x is a pair of objects, lt : x produces T if thefirst object is less than the second, otherwise .

lt:< x1, x2 >

CHIARA language

Elementary operators

gt greater than

if x is a pair of objects, gt : x produces T if the first object is greater than the second, otherwise . gt:< x1, x2 >

… … … …

lst loop startif x is a pair of objects, lst: x produces the first object when applied the first time, otherwise the second object

lst:< x1; x2 >

sL select leftif x is a pair of objects, sL : x produces the first object, otherwise .

sL:< x1; x2 > = x1

sR select right

if x is a pair of objects, sR : x produces the second object, otherwise .

sR:< x1; x2 > = x2

CHIARA language

Combinator operators

i selector if i and n are two natural numbers and x a non-empty sequence of n i elements, i: x produces the ith element, otherwise .

i :< x1,..,xi,..,xn >= xi

id identity if x is an object, id: x produces x.

id : x = x

apndL append left

if x is a pair of objects, where the second one is a sequence, apndL: x produces a sequence, otherwise .

apndL:< z,< y1,...,yn >>= < z,y1,...,yn,>

apndR append right

if x is a pair of objects, where the first one is a sequence, apndR: x produces a sequence, otherwise .

apndR:<< y1,...,yn >,z>= << y1,...,yn ,z >>

… … ... …

Combinator operators

concat concatenate

if x is a pair of sequences, concat: x produces a sequence where the second one is concatenated to the first one, otherwise .

concat:<< x11,...,x1n >,< x21 ,...,x2n >> = < x11,...,x1n,x21,...,x2n >

distL distribute from left

if x is a pair of objects, where the second one is a sequence, distL: x produces a sequence of pairs, otherwise .

distL:< z,< y1,..,yn >>=

<< z,y1 >,.., < z,yn >>

trans transpose

if x is a sequence of objects, trans : x produces a transposition of this sequence, otherwise .

trans :<< x11,...,x1m >,…,< xn1 ,...,xnm >> = << x11,...,xn1 >,…,< x1m ,...,xnm >>

… … ... …

Functional forms

CHIARA functional forms are used to define new functions from existing functions and combinators

Functionals in CHIARA include the functional forms of Backus’s FP and some new ones

Functional forms

Keyword

Name Action Definition

% constant if x is a value and y is an object, %x : y produces x, otherwise. %x : y = x

° composition

it permits the application of the composition of two functions to an object x. The composition of two functions, f ° g, is the function obtained by applying first g and then f to an object x

(f ° g) : x = f : (g : x)

−−> conditional

it permits the application of one of the two functions q and r to an object according to the boolean value of a condition p.

(p−−> q; r) : x = q : x if p : x =T r : x if p : x =F otherwise

case … ... …

Functional forms

Keyword

[...] construction it permits the application of a sequence of functions f1, ..., fn to an object x.

[f1 , ..., fn ]: x =

< f1 : x, ..., fn : x >

& apply to all it permits the application of the same function f to a sequence x.

&f :< x1, ..., xn >=

< f : x1, ..., f : xn >

| insert

if x is a sequence of at least two elements, it recursively applies the same function f to the couple of objects head-left tail. It stops if the tail contains one object, otherwise.

|f :< x1, x2, ..., xn >=

f :< x1, |f :< x2..., xn >>=

f :< x1, f :< x2 , |f :< x3, ..., xn >>>

… … … …

Functional forms

Keyword

! binary insert

it breaks up the argument into n pairs applying itself recursively to all the pairs.The functional parameter is applied to the result.It stops if the object contains one pair.

!f :< x1, x2, ..., xn >=

!f :< f :< x1,x2>, f :< x3,x4>..., f :< xn

-1,xn >> =…

while (while p, f): x = (while p, f ) : (f : x) if p : x =T x if p : x =F

repeat …

The assembly language

a functionally complete sub-set of elementary operators is the assembly language for a D3AS many-core processor

more complex functions are obtained applying the rule of metacomposition

dataflow graphs that are produced can be directly mapped and executed onto the hardware

CHIARA language

New functions

The def construct permits the definition of new functions from existing functions, combinators, functional forms, and other already defined functions.

For example:

def max = (gt ° [1,2] --> 1;2)

max:<5,6> = 6

Dataflow graph mapping

communications inter many-core processors are slower than intra many-core processor

NP-hard mapping problem

Dataflow graph generation and mapping

Compilation process

The whole compilation process is composed of two steps:

compilation, producing the dataflow graph from CHIARA programs (function definitions plus expressions to be evaluated)

mapping, aimed at implementing the produced dataflow graph onto the D3AS prototype

Dataflow graph generation

the CHIARA compiler, in conjunction with front-end tools, generates the

Global Dataflow Graph Table (GDGT)

Global Dataflow Graph Table (GDGT)Node# Func Apply Constr Insert Left Right Out

level level Level In In

.. ... . . . .. .. ..

43 MUL 1 0 0 %1 %30 47

44 MUL 1 0 0 %2 %30 47

45 MUL 1 0 0 %3 %30 48

46 MUL 1 0 0 %4 %30 48

47 ADD 0 0 1 43 44 49

48 ADD 0 0 1 45 46 49

49 ADD 0 0 2 47 48 out

50 MUL 1 0 0 %1 %40 54

51 MUL 1 0 0 %2 %40 54

52 MUL 1 0 0 %3 %40 55

53 MUL 1 0 0 %4 %40 55

54 ADD 0 0 1 50 51 56

55 ADD 0 0 1 52 53 56

56 ADD 0 0 2 54 55 out

.. ... . . . .. .. ..

Visualization of Compiler Graph

The next step

the compiler extracts from the GDGT two tables:

Dataflow Graph Description (DGD) table, that contains, for each node, the binary operation and interconnection codes for the Graph Setter of a Processing Subsystem

Initial Input Value (IIV) table, that contains the binary information about input program data tokens

Dataflow graph mapping

The presence of functionals:

permits the adoption of strategies that try to cluster parallelism exploitation

suggests handy ways to partition the dataflow graph into smaller, loosely connected graphs that can be run on the single platform-processors

D3AS general architecture

Reconfigurable Hardware System (RHS)

Capable to map and execute dataflow graphs, created with the hHLDS model in a completely asynchronous manner.

Contituted by three Subsystem

■ Actor Realization Subsystem (ARS)

Capable to create a one-to-one correspondence among graph actors and Functional Units.

■ Token flow Realization Subsystem (TRS)

Implementing graph edges.

■ Graph Mapping Subsystem (GMS)

Devoted to store the RHS Context Informations.

■ ARS Constituted by N identical Multipurpose Functional Unit (MPFUs).

■ TRS Constituted by 3 Sets of N buffer Registers and a Crossbar Swith Interconnect.

■ GMS Constituted by a set of buffers and logic circuitery.

Critical Parameters in the RHS design.

■ NMPFU: the number of the MPFUs constituting the ARS;

■ CMPFU: the logical and functional complexity of the MPFUs;

■ INTRS: the type of interconnect for the TRS.

The number of MPFU implementable on a VLSI device depends on:

■ interconnect complexity;

■ logical and functional complexity of MPFU.

RHS/D3AS Fundamental Building Block

Many-core Datalow Processor (MDP)

A many-core chip replicating the D3AS general arcitecture with n MPFU interconnected via a non-blocking cross bar switch network.

Architecture with globally pure dataflow model

N: Number of Graph Actor

n: Number of MPFU of MDP

RHS is configured interconnecting K= N/n MPD with a 2nd level non-blocking crossbar switch interconnection network.

D3AS general architecture with Hybrid Dataflow Model

The Graph is partitioned into subgraphs and the RHS is configured interconnecting m= N/n MDP with a 2nd level message passing interconnection network.

Dataflow Graph Edge among subgraph mapped on different MDP are virtualized by messages ranted through the network.

Communnicating Dataflow Processes (CDP)

D3AS general architecture demonstrator

GIDEL board

Routing Subsystem

Kernel Subsystem

Processing Subsystem

Packet AssemblerPacket Deassembler

WK-recursive Message Manager

DestinationList

GCL ITTE 0TTEControl

GRAPH SETTER

MPFU INTERCONNECT

MPFU # 1

MPFU # n

MPFU OP Code

TOKEN OUT ENSEMBLE BUFFERS

TOKEN_IN A ENSEMBLE BUFFERS

TOKEN_IN B ENSEMBLE BUFFERS

Control

Latch LatchEnable In

Enable OutLST Test

ValidityTest result

ALU+MULT

# 1MPFU

/33 /33 /33/33

. . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . .

. . . . . . . . . . .

From the Token_in A buffer

# mMPFU

11 6417 48

k1 6417 48

m1 6417 48

n1 6417 48

Matrix Multiplication

Given two matrices A(n,n) and B(n,n), their product generates a matrix C(n,n) whose generic element is given by the following formula:

Some results

1kkjikij

bac i,j = 1…n

Matrix Multiplication we used two values of n: n=32 and n=64

Some results

Matrix Multiplication

we compared the performance of a platform-processor with a IA32 Pentium IV

we measured performance in terms of CPI because our FPGA platform-processor executes an operation in 30 ns against 0.5 ns of the Pentium.

Some results

IA-32 Pentium IV vs D3AS

Pentium Platform-Processor

cycles per instruction cycles per instruction

n Products Sums Total Products Sums Total

32 8192 7939 16131 - - 1027

64 65561 64537 130098 4096 4108 8204

Some results

Zeroes of a function (f=x*x+3x-1.75)

assembly code generated compiling the C source code: 122 sequential assembly code lines

Some results

Zeroes of a function

our compiler generates a GDGT with only 28 micro-instructions organized on 12 sequential steps.

Node FuncApply level

Constr level

Insert Level

Left input

Right input

Output

1 LST 0 0 0 26 -1% 5-5-6-3-172 LST 0 0 0 27 1% 3-213 ADD 0 0 0 1 2 4 4 DIV 0 3 0 3 2% 7-7-8-18-19-20-225 MUL 0 0 0 1 1 9 6 MUL 0 0 0 1 3% 9 7 MUL 0 0 0 4 4 10 8 MUL 0 0 0 4 3% 10 9 ADD 0 0 0 5 6 11

10 ADD 0 0 0 7 8 12 11 SUB 0 0 0 9 1.75% 13 12 SUB 0 0 0 10 1.75% 13 13 MUL 0 0 0 11 12 14-15-16 14 LT 0 0 0 13 0% 17-1815 EQ 0 0 0 13 0% 19-2016 GT 0 0 0 13 0% 30-2217 ADD 0 0 0 1 14 30 18 ADD 0 0 0 14 4 28 19 ADD 0 0 0 4 15 30 20 ADD 0 0 0 15 4 28 21 ADD 0 0 0 2 16 28 22 ADD 0 0 0 16 4 30 23 SUB 0 0 0 30 28 24-2524 GEQ 0 0 0 23 0.01% 26-2725 LT 0 0 0 23 0.01% 29 26 ADD 0 0 0 30 24 1 27 ADD 0 0 0 28 24 2

28 MRG 0 0 018-20-

23-27-29

29 ADD 0 0 0 28 25 Out

30 MRG 0 0 017-19-

Some results

Future work

To evalute which applications perfom better on the architecure with globally pure and hybrid dataflow model.

How to generalize pipeline inside the MDP

cnr-bioinformatics dec. 19, napoli l. verdoscia & r. vaccaro – many-core processors: the...

core processors

integrated approach

computational model

d3as slide

core dataflow processor

vaccaro man

roberto vaccaro institute

languages hadware slide

Documents

dipartimento di fisica, universit a di napoli federico ii...

yolanda vaccaro- artÍculo revista derecho uned.pdf

bed and breakfast napoli | i visconti | b&b napoli...

impedances and wake fields: they are forty, but they don’t...

napoli sotterranea

phil vaccaro joseph huber image segmentation

napoli project

irmi cnr napoli

a teacher’s guide to the books of laura vaccaro seeger ·...

via cintia, 80126napoli napoli - via cintia, 80126...

e. marinella | napoli

joan vaccaro joe spring anthony chefles

united states district court northern … · 16 17 18 i. 19...

msc napoli

economic impact studies: practical tips & examples - lynn...

prof. dra. maría i. vaccaro

ponchieli ricordo napoli

colonel thomas j. vaccaro - ship.edu · col vaccaro is a...

profile/feature writing prof. vaccaro * hofstra university

ghid napoli