cnr-bioinformatics dec. 19, napoli l. verdoscia & r. vaccaro – many-core processors: the...
Post on 19-Dec-2015
212 Views
Preview:
TRANSCRIPT
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
1CNR-BioinformaticsDec. 19, Napoli
Many-core processors: the integrated approach to the
computational and execution models
Lorenzo Verdoscia and Roberto Vaccaro
Institute for High Performance Computing and Networking
National Research Council – Italylorenzo.verdoscia@na.icar.cnr.it
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
2CNR-BioinformaticsDec. 19, Napoli
The Landscape of Parallel Computing Research: A View From Berkeley http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
3CNR-BioinformaticsDec. 19, Napoli
From our architectural point of view, this new trend raises at least two queries: how to exploit such spatial parallelism, how to program such systems.
What is D3AS
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
4CNR-BioinformaticsDec. 19, Napoli
The first query brings us to seriously reconsider the dataflow paradigm, given the fine grain nature of its operations. In fact, instead of carrying out in sequence a set of
operations like a von Neumann processor does, a many-core dataflow processor could calculate a function first connecting and configuring a number of identical simple cores as a dataflow graph and then allowing data asynchronously flow through them.
What is D3AS
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
5CNR-BioinformaticsDec. 19, Napoli
The second query brings us to seriously reconsider the functional programming style, given its intrinsic simplicity in writing parallel programs. In fact, functional languages have three key properties
that make them attractive for parallel programming: They have powerful mechanisms for abstracting over
both computation and coordination; they eliminate unnecessary dependencies; their high-level coordination achieves a largely
architecture-independent style of parallelism.
What is D3AS
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
6CNR-BioinformaticsDec. 19, Napoli
Agenda
The hHLDS model CHIARA language Dataflow graph generation and mapping D3AS general architecture Future work
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
7CNR-BioinformaticsDec. 19, Napoli
D3AS (Demand Data Driven Architecture System):
a high performance reconfigurable computing system demonstrator, which exploits FPGA technology where
the computational model is functional
the execution model is dataflow
and whose architecure has a highly scalable degree with nodes characterized by having
a dynamic configurability
a transparent hardware reconfiguration
Design methodology:
develop the right computation model alongside languages & hadware
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
8CNR-BioinformaticsDec. 19, Napoli
The methodological approach
Physical Model(reconfigurable)
( dataflow ) ( functional )
Computational Model
Real Machine(Hundread thousands of identical MPFUs)
Data Driven Demand Driven
hHigh-Level dataflow System CHIARA language
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
9CNR-BioinformaticsDec. 19, Napoli
Firing rules in the classical model
Let A={a1, …, an} be the set of actors
and L ={ll, …, ln} be the set of links
A dataflow graph is a labelled directed graph
G = (N, E)where
N = A L is the set of nodes
E (A × L) (L × A) is the set of edges
firing of an actora token on each input link and no token on each output link
effectconsumes all input tokens and produces a token on its output link
The homogeneous High Level Dataflow System (hHLDS) model
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
10CNR-BioinformaticsDec. 19, Napoli
Special actors in the classical model
Merge
FT
A B
L
FT
Switch
A
L
Decider
A B
L
R L
Gate
are characterized by having heterogeneous I/O conditions
The hHLDS model
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
11CNR-BioinformaticsDec. 19, Napoli
homogeneous High Level Dataflow SystemAny actor has two input links and one output link and consumes and produces only data tokens
firing of an actora token on each input link
effectconsumes all input tokens and can produces a token on its output link
a+b*c*
+
a
b c
≤
+
a
b c
If b≤c then a
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
12CNR-BioinformaticsDec. 19, Napoli
Comparison between the two models
TF
=
T F
T F
T F
* 3
/ 2 5
F F
1 c
a
d
F
F
F
T
TT
a )
TF
> 1
+
**
+ +
> <
:_
LS T LS T
++
==
a
b
1
53 2
1
c
d
a
b )
1 2
3
6
8
10
12 13 14
11
9
7
4 5
input (a, c) b := 1; repeat if a > 1 then a := a \ 2 else a := a * 5 b := b * 3; until b = c;output (d)
The hHLDS model
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
13CNR-BioinformaticsDec. 19, Napoli
CHIARA language
dialect of Backus‘s FP tuple (O, F, F, :, D) where:
O is a set of objects; F is a set of functions (or operators) from
objects to objects; F is a set of functional forms (functionals)
from functions to functions; : is the application operation; D is a set of function definitions.
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
14CNR-BioinformaticsDec. 19, Napoli
CHIARA objects Atoms: include integer fixed and floating-point numbers,
Boolean constants,characters and strings
Sequences: denoted with angle brackets < 1, 2, 3 > The empty sequence <> is the only object which is both an
atom and a sequence
Undefined special object (or UDF) called bottom, which is usually used to denote errors or exceptions. Sequences are bottom-preserving: < 1; 2;< 3; 5 >; > =
CHIARA language
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
15CNR-BioinformaticsDec. 19, Napoli
CHIARA functionstwo kinds of operators that can be applied to
objects:
Elementary: the commonly used binary operators and some new ones
Combinator: operators that affect the structure of the objects on which they are applied (combine sequences, transpose sequences of sequences, etc).
CHIARA language
16CNR-BioinformaticsDec. 19, Napoli
Elementary operators
Keyword Name Action Definition
+ arithmetic addition
if x is a pair of numbers, + : x produces their sum, otherwise . + :< x1; x2 >= x1 + x2
- arithmetic subtraction
if x is a pair of numbers, - : x produces their difference, otherwise . - :< x1; x2 >= x1 - x2
* arithmetic multiplication
if x is a pair of numbers, * : x produces their product , otherwise . * :< x1; x2 >= x1 * x2
/ arithmetic division
if x is a pair of numbers, / : x produces their quotient , otherwise . / :< x1; x2 >= x1 / x2
lt less than if x is a pair of objects, lt : x produces T if thefirst object is less than the second, otherwise .
lt:< x1, x2 >
CHIARA language
17CNR-BioinformaticsDec. 19, Napoli
Elementary operators
Keyword Name Action Definition
gt greater than
if x is a pair of objects, gt : x produces T if the first object is greater than the second, otherwise . gt:< x1, x2 >
… … … …
lst loop startif x is a pair of objects, lst: x produces the first object when applied the first time, otherwise the second object
lst:< x1; x2 >
sL select leftif x is a pair of objects, sL : x produces the first object, otherwise .
sL:< x1; x2 > = x1
sR select right
if x is a pair of objects, sR : x produces the second object, otherwise .
sR:< x1; x2 > = x2
CHIARA language
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
18CNR-BioinformaticsDec. 19, Napoli
Combinator operators
Keyword Name Action Definition
i selector if i and n are two natural numbers and x a non-empty sequence of n i elements, i: x produces the ith element, otherwise .
i :< x1,..,xi,..,xn >= xi
id identity if x is an object, id: x produces x.
id : x = x
apndL append left
if x is a pair of objects, where the second one is a sequence, apndL: x produces a sequence, otherwise .
apndL:< z,< y1,...,yn >>= < z,y1,...,yn,>
apndR append right
if x is a pair of objects, where the first one is a sequence, apndR: x produces a sequence, otherwise .
apndR:<< y1,...,yn >,z>= << y1,...,yn ,z >>
… … ... …
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
19CNR-BioinformaticsDec. 19, Napoli
Combinator operators
Keyword Name Action Definition
concat concatenate
if x is a pair of sequences, concat: x produces a sequence where the second one is concatenated to the first one, otherwise .
concat:<< x11,...,x1n >,< x21 ,...,x2n >> = < x11,...,x1n,x21,...,x2n >
distL distribute from left
if x is a pair of objects, where the second one is a sequence, distL: x produces a sequence of pairs, otherwise .
distL:< z,< y1,..,yn >>=
<< z,y1 >,.., < z,yn >>
trans transpose
if x is a sequence of objects, trans : x produces a transposition of this sequence, otherwise .
trans :<< x11,...,x1m >,…,< xn1 ,...,xnm >> = << x11,...,xn1 >,…,< x1m ,...,xnm >>
… … ... …
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
20CNR-BioinformaticsDec. 19, Napoli
Functional forms
CHIARA functional forms are used to define new functions from existing functions and combinators
Functionals in CHIARA include the functional forms of Backus’s FP and some new ones
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
21CNR-BioinformaticsDec. 19, Napoli
Functional forms
Keyword
Name Action Definition
% constant if x is a value and y is an object, %x : y produces x, otherwise. %x : y = x
° composition
it permits the application of the composition of two functions to an object x. The composition of two functions, f ° g, is the function obtained by applying first g and then f to an object x
(f ° g) : x = f : (g : x)
−−> conditional
it permits the application of one of the two functions q and r to an object according to the boolean value of a condition p.
(p−−> q; r) : x = q : x if p : x =T r : x if p : x =F otherwise
case … ... …
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
22CNR-BioinformaticsDec. 19, Napoli
Functional forms
Keyword
Name Action Definition
[...] construction it permits the application of a sequence of functions f1, ..., fn to an object x.
[f1 , ..., fn ]: x =
< f1 : x, ..., fn : x >
& apply to all it permits the application of the same function f to a sequence x.
&f :< x1, ..., xn >=
< f : x1, ..., f : xn >
| insert
if x is a sequence of at least two elements, it recursively applies the same function f to the couple of objects head-left tail. It stops if the tail contains one object, otherwise.
|f :< x1, x2, ..., xn >=
f :< x1, |f :< x2..., xn >>=
f :< x1, f :< x2 , |f :< x3, ..., xn >>>
… … … …
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
23CNR-BioinformaticsDec. 19, Napoli
Functional forms
Keyword
Name Action Definition
! binary insert
it breaks up the argument into n pairs applying itself recursively to all the pairs.The functional parameter is applied to the result.It stops if the object contains one pair.
!f :< x1, x2, ..., xn >=
!f :< f :< x1,x2>, f :< x3,x4>..., f :< xn
-1,xn >> =…
while (while p, f): x = (while p, f ) : (f : x) if p : x =T x if p : x =F
repeat …
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
24CNR-BioinformaticsDec. 19, Napoli
The assembly language
a functionally complete sub-set of elementary operators is the assembly language for a D3AS many-core processor
more complex functions are obtained applying the rule of metacomposition
dataflow graphs that are produced can be directly mapped and executed onto the hardware
CHIARA language
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
25CNR-BioinformaticsDec. 19, Napoli
New functions
The def construct permits the definition of new functions from existing functions, combinators, functional forms, and other already defined functions.
For example:
def max = (gt ° [1,2] --> 1;2)
max:<5,6> = 6
a
a b
max
++
> <
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
26CNR-BioinformaticsDec. 19, Napoli
Dataflow graph mapping
communications inter many-core processors are slower than intra many-core processor
NP-hard mapping problem
Dataflow graph generation and mapping
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
27CNR-BioinformaticsDec. 19, Napoli
Compilation process
The whole compilation process is composed of two steps:
compilation, producing the dataflow graph from CHIARA programs (function definitions plus expressions to be evaluated)
mapping, aimed at implementing the produced dataflow graph onto the D3AS prototype
Dataflow graph generation and mapping
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
28CNR-BioinformaticsDec. 19, Napoli
Dataflow graph generation
the CHIARA compiler, in conjunction with front-end tools, generates the
Global Dataflow Graph Table (GDGT)
Dataflow graph generation and mapping
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
29CNR-BioinformaticsDec. 19, Napoli
Global Dataflow Graph Table (GDGT)Node# Func Apply Constr Insert Left Right Out
level level Level In In
.. ... . . . .. .. ..
43 MUL 1 0 0 %1 %30 47
44 MUL 1 0 0 %2 %30 47
45 MUL 1 0 0 %3 %30 48
46 MUL 1 0 0 %4 %30 48
47 ADD 0 0 1 43 44 49
48 ADD 0 0 1 45 46 49
49 ADD 0 0 2 47 48 out
50 MUL 1 0 0 %1 %40 54
51 MUL 1 0 0 %2 %40 54
52 MUL 1 0 0 %3 %40 55
53 MUL 1 0 0 %4 %40 55
54 ADD 0 0 1 50 51 56
55 ADD 0 0 1 52 53 56
56 ADD 0 0 2 54 55 out
.. ... . . . .. .. ..
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
30CNR-BioinformaticsDec. 19, Napoli
Visualization of Compiler Graph
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
31CNR-BioinformaticsDec. 19, Napoli
The next step
the compiler extracts from the GDGT two tables:
Dataflow Graph Description (DGD) table, that contains, for each node, the binary operation and interconnection codes for the Graph Setter of a Processing Subsystem
Initial Input Value (IIV) table, that contains the binary information about input program data tokens
Dataflow graph generation and mapping
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
32CNR-BioinformaticsDec. 19, Napoli
Dataflow graph mapping
The presence of functionals:
permits the adoption of strategies that try to cluster parallelism exploitation
suggests handy ways to partition the dataflow graph into smaller, loosely connected graphs that can be run on the single platform-processors
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
33CNR-BioinformaticsDec. 19, Napoli
D3AS general architecture
Reconfigurable Hardware System (RHS)
Capable to map and execute dataflow graphs, created with the hHLDS model in a completely asynchronous manner.
Contituted by three Subsystem
■ Actor Realization Subsystem (ARS)
Capable to create a one-to-one correspondence among graph actors and Functional Units.
■ Token flow Realization Subsystem (TRS)
Implementing graph edges.
■ Graph Mapping Subsystem (GMS)
Devoted to store the RHS Context Informations.
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
34CNR-BioinformaticsDec. 19, Napoli
■ ARS Constituted by N identical Multipurpose Functional Unit (MPFUs).
■ TRS Constituted by 3 Sets of N buffer Registers and a Crossbar Swith Interconnect.
■ GMS Constituted by a set of buffers and logic circuitery.
D3AS general architecture
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
35CNR-BioinformaticsDec. 19, Napoli
D3AS general architecture
Critical Parameters in the RHS design.
■ NMPFU: the number of the MPFUs constituting the ARS;
■ CMPFU: the logical and functional complexity of the MPFUs;
■ INTRS: the type of interconnect for the TRS.
The number of MPFU implementable on a VLSI device depends on:
■ interconnect complexity;
■ logical and functional complexity of MPFU.
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
36CNR-BioinformaticsDec. 19, Napoli
D3AS general architecture
RHS/D3AS Fundamental Building Block
Many-core Datalow Processor (MDP)
A many-core chip replicating the D3AS general arcitecture with n MPFU interconnected via a non-blocking cross bar switch network.
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
37CNR-BioinformaticsDec. 19, Napoli
D3AS general architecture
Architecture with globally pure dataflow model
N: Number of Graph Actor
n: Number of MPFU of MDP
RHS is configured interconnecting K= N/n MPD with a 2nd level non-blocking crossbar switch interconnection network.
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
38CNR-BioinformaticsDec. 19, Napoli
D3AS general architecture with Hybrid Dataflow Model
N>n
The Graph is partitioned into subgraphs and the RHS is configured interconnecting m= N/n MDP with a 2nd level message passing interconnection network.
Dataflow Graph Edge among subgraph mapped on different MDP are virtualized by messages ranted through the network.
Communnicating Dataflow Processes (CDP)
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
39CNR-BioinformaticsDec. 19, Napoli
D3AS general architecture demonstrator
GIDEL board
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
40CNR-BioinformaticsDec. 19, Napoli
En
able
Sig
nal
s
Routing Subsystem
Kernel Subsystem
Processing Subsystem
Toke
n_I
n A
Mes
sag
e_In
Mes
sag
e_O
ut
Toke
n_I
n B
Toke
n_O
ut
Gra
ph
Tab
le
Packet AssemblerPacket Deassembler
WK-recursive Message Manager
DestinationList
GCL ITTE 0TTEControl
Unit
GRAPH SETTER
MPFU INTERCONNECT
MPFU # 1
MPFU # n
Gra
ph
Co
nfi
gu
rati
on
Ta
ble
Inte
rco
nn
ect
Co
de
MPFU OP Code
CO
NT
RO
L S
EC
TIO
N
TOKEN OUT ENSEMBLE BUFFERS
TOKEN_IN A ENSEMBLE BUFFERS
TOKEN_IN B ENSEMBLE BUFFERS
8 8
768
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
41CNR-BioinformaticsDec. 19, Napoli
Control
Op
era
tio
n C
od
eR
eg
iste
r
/
//
//
//
/8
11
11
1
1
6
33
32 32
33
33
Latch LatchEnable In
Enable OutLST Test
ValidityTest result
Va
lid
ity
Fro
m t
he
Gra
ph
Se
tte
r
Va
lid
ity
ALU+MULT
Latch
# 1MPFU
/33 /33 /33/33
. . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . .
. . .
. . . . . . . . . . .
. . .
. . .
. .
\ 1
/6
/6
/6
/6
From the Token_in A buffer
Fro
m t
he
Gra
ph
Se
tte
r# k
MPFU
# mMPFU
# n
MPFU
11 6417 48
k1 6417 48
m1 6417 48
n1 6417 48
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
42CNR-BioinformaticsDec. 19, Napoli
Matrix Multiplication
Given two matrices A(n,n) and B(n,n), their product generates a matrix C(n,n) whose generic element is given by the following formula:
Some results
n
1kkjikij
bac i,j = 1…n
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
43CNR-BioinformaticsDec. 19, Napoli
Matrix Multiplication we used two values of n: n=32 and n=64
Some results
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
44CNR-BioinformaticsDec. 19, Napoli
Matrix Multiplication
we compared the performance of a platform-processor with a IA32 Pentium IV
we measured performance in terms of CPI because our FPGA platform-processor executes an operation in 30 ns against 0.5 ns of the Pentium.
Some results
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
45CNR-BioinformaticsDec. 19, Napoli
IA-32 Pentium IV vs D3AS
Pentium Platform-Processor
cycles per instruction cycles per instruction
n Products Sums Total Products Sums Total
32 8192 7939 16131 - - 1027
64 65561 64537 130098 4096 4108 8204
Some results
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
46CNR-BioinformaticsDec. 19, Napoli
Zeroes of a function (f=x*x+3x-1.75)
assembly code generated compiling the C source code: 122 sequential assembly code lines
Some results
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
47CNR-BioinformaticsDec. 19, Napoli
Zeroes of a function
our compiler generates a GDGT with only 28 micro-instructions organized on 12 sequential steps.
Node FuncApply level
Constr level
Insert Level
Left input
Right input
Output
1 LST 0 0 0 26 -1% 5-5-6-3-172 LST 0 0 0 27 1% 3-213 ADD 0 0 0 1 2 4 4 DIV 0 3 0 3 2% 7-7-8-18-19-20-225 MUL 0 0 0 1 1 9 6 MUL 0 0 0 1 3% 9 7 MUL 0 0 0 4 4 10 8 MUL 0 0 0 4 3% 10 9 ADD 0 0 0 5 6 11
10 ADD 0 0 0 7 8 12 11 SUB 0 0 0 9 1.75% 13 12 SUB 0 0 0 10 1.75% 13 13 MUL 0 0 0 11 12 14-15-16 14 LT 0 0 0 13 0% 17-1815 EQ 0 0 0 13 0% 19-2016 GT 0 0 0 13 0% 30-2217 ADD 0 0 0 1 14 30 18 ADD 0 0 0 14 4 28 19 ADD 0 0 0 4 15 30 20 ADD 0 0 0 15 4 28 21 ADD 0 0 0 2 16 28 22 ADD 0 0 0 16 4 30 23 SUB 0 0 0 30 28 24-2524 GEQ 0 0 0 23 0.01% 26-2725 LT 0 0 0 23 0.01% 29 26 ADD 0 0 0 30 24 1 27 ADD 0 0 0 28 24 2
28 MRG 0 0 018-20-
21
23-27-29
29 ADD 0 0 0 28 25 Out
30 MRG 0 0 017-19-
22
23-26
Some results
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
48CNR-BioinformaticsDec. 19, Napoli
Future work
To evalute which applications perfom better on the architecure with globally pure and hybrid dataflow model.
How to generalize pipeline inside the MDP
top related