compiling high-level descriptions on a heterogeneous system

Compiling High-Level Descriptions on a Heterogeneous System

José Gabriel de Figueiredo CoutinhoDepartment of Computing, Imperial College London

The Programming Challenge of Heterogeneous Architectures WorkshopUniversity of Birmingham

July 2-3, 20091

http://www.hartes.org/

2

Overview1. hArtes Project2. Research

a) Task Transformation

b) Mapping Selection

c) High-Level Synthesis

3. Harmonic toolchain4. Challenges

3

Why Heterogeneous Systems?Because...orders-of-magnitude faster than conventional single-core processors target computation hungry applications:

» financial modeling» pharmaceutical applications» simulation of real-life complex systems

strategy: mix conventional processors with specialised processorsHowever...how to develop applications?

» portability... new system, new application?» design exploration... how to decide the partitioning and mapping?» optimisation... how to exploit specialised processors (FPGAs, DSPs)?» control vs automation.. how developers interact with compilation process?

4

1. hArtes Project - ConsortiumAtmel Roma (Italy)

Faital (Italy)

Fraunhofer IGD (Germany)

Imperial College (U.K.)

INRIA (France)

Leaff (Italy)

Politecnico di Bari (Italy)

Politecnico di Milano (Italy)

Scaleo Chip (France)

Thales Communications (France)

Thomson(France)

TU Delft (Netherlands)

UP delle Marche (Italy)

Università di Ferrara (Italia)

Universitè d'Avignon (France)

15 partners in 5 countries

5

Scope

Holistic Approach to Reconfigurable

real Time Embedded Systems

www.hartes.org

hArtes Tool-Chain

FPGAGPP

DSP

.c source code

Algorithm Exploration Tools

6

Applications

Enhanced In-Car audio and video: » Multichannel audio system» Automatic Echo Cancellation (AEC)» Automatic Speech and Speaker Recognition (ASR)» Adaptive filtering» Video Transcoding» Intra-cabin communication

Hardware Platforms(multi-purpose hardware)

Audio and Video Applications

7

Hardware Platforms

Atmel Diopsis 940H Evaluation Board(ARM+DSP)

hArtes Harware Platform(ARM+DSP+FPGA)

8

Toolchain

The hArtes toolchain is composed by three toolboxes:

1) Algorithm Exploration Toolbox 2) Design Space Exploration Toolbox 3) System Synthesis Toolbox

MappingSelection

9

Algorithm Exploration Toolbox: SciLab

P 1 PsychoacousticAnalysis

P 2 Sub Band AnalysisP 3 MDCT

P 4 Computation of bitsrequired through noise

allocation

D13Psychoacoustic

parameters

mean bits

P.E,SMR

D9 MPEGInformation

D10 WaveInformation

No ofgranules,No of bitsper frame

D7 FrameInformation

mixed block flag,window switchingflag, block type

D15Scalefactors

Scalefactorselection

Information,reservoir drain

resvoir size,maximum reservoir

size,maximumbits available

D14Quantized

values

Quantizedvalues

DTab-18Tables usedby process P

4

Tables

D6 GranuleInformation

P 5 Bitstream Multiplexer


PCM

EncoderParamet

ersEncoding Parameters

PCM Samples

D5 Sub BandSamples

Sub band samples

PCM Samples

Sub bandSamples

TransformedSub band samples

Subbandsamples

PE,SMR

No ofchannels

Scalefactors

All the parameters,except window

switching flag, mixed block flag and block type

D8 ReservoirInformation

Granule Information

Scalefactors

Scalefactorselection

information

Quantizedvalues

D16 MP3Bitstream

Frame of encodedMP3 bistream

P 6 Encodingloop controller

mixed block flag,window switchingflag, block type

Encodingparameters

Sampling rate,channels,bits,length,

total samples,types

MPEGinformation

Granule no.,channel no.

resvoir size,reservoir limitD7 Frame

Information

Main data begin

Main databegin

D5 Sub BandSamples

Spectralvalues

Level-1 DiagramDTab-22 Tables used forPsychoacoustic analysis

DTab-23 Tables used forSub band analysis

DTab-24 Tablesused for MDCT

Tables

Tables

Tables


Block type,mixed block flag

yxtut

μyxtut

yxtuy

yxtux

T ,,,,ρ,,,, 2

2

2

2

2

2

SCILAB

To SCILAB2C and Design Exploration Toolbox

hArtes

Physical Model

Algorithm

10

Algorithm Exploration Toolbox: Nu-Tech

Thanks to the plug-in architecture the developer can write his/her own NUTs (NU-Techs satellites) and immediately plug them into the graphical interface design environment.

hArtes Design Exploration Toolbox

The NU-Tech Graphical Exploration (GAE) is the hArtes platform to validate the complex algorithms.

11

Design Space Exploration Toolbox

Task Partitioning

Task Transformation

Data Representation Optimisation

Annotated C

Annotated C

Annotated C

Annotated C

Politecnico di Milano (Italy)


TU Delft (Netherlands)Profiling

Input Source

12

System Synthesis Toolbox

Generic GPP (C+macros)

GPP Molen code

DSPC code

FPGA

Mapping Selection

Code Generation

Annotated C

Annotated C

GPP comp Molen DSP comp C2VHDL

ELF obj ELF obj ELF obj Bitstream

Linker

Loader

Executable code (ELF)


Atmel Roma (Italy)

TU Delft (Netherlands)

13

Accelerating an application

14

T1_dsp1

T3_dsp3

T2_dsp2

T1_gpp3

32

34

42

43

54

Design Exploration and Synthesis

Partitioning

TaskTransformation

Tasks

T3_fpga2

T1_gpp1

T1_gpp2

Mapping Selection

T1DSP2

Cost Estimation

T3GPP T4

DSP5

C Description

ImplementationsSystem Description

15

2. a) Task transformationWhat are task transformations?

Source-to-source transformations pattern matching on syntax or dataflow

Why use them? Compilers cannot include all optimisations Use knowledge of domain or platform experts Use to influence task mapping

How to use them? Write in C++ using ROSE framework: hard Write in our domain-specific language, CML: easier

Who writes them? Domain or platform experts Developers needing design-space exploration

16

Basic CML: 3 parts to a transform Pattern: syntax to match, label elements Conditions based on dataflow Resulting pattern to substitute

Proposed novel aspects of extended CML Systematic description of dataflow conditions Parameterised transforms Features for labelling subpatterns Probabilities for machine learning

Extend: CML code matching DFGs s1->s2 matches true dependence arc from s1 to s2 s1 -/> s2 matches antidependence arc from s2 to s2 s1 -@-> s2 matches output dependence arc from s1 to s2

CML for task transformations

17

Requirements: CML language Aim: compact transformation description Describe transformations on

Abstract Syntax Tree (AST) Data Flow Graph (DFG)

Support transformations specific to Application domain: embedded media Target technology: CPU + DSP + FPGA

Allow parameterisable transforms e.g. unrolling factor

Interpretation Can change transform without recompilation Saves time, eases learning curve Can rapidly explore transform design space Customize existing transforms

Facilitate cost estimate: e.g. number of registers

18

CML example: replace multiply-by-n with shift

18

Replacing multiplies by shift is usually an optimisation in hardware lower area, greater speed

transform times2ToShift {pattern {

expr(1) * n} conditions {

n & (n-1) == 0} result {

expr(1) << LOG2(n)}

}

Transform name

Pattern: expression multiplied by n.Pattern section:

syntax pattern with labelled parts

Result: labelled expression, shift replaces multiply

Conditions section: optional; only replace if

conditions all true

Result section: what to replace matched

pattern with if conditions apply

expr(1): Labelled

subexpression

19

Simple CML exampleEliminate addition with zero

Expr + 0 => 0 Not always applicable (Floating-point: NaN + 0 = NaN)

transform addZero { pattern { expr(1) + 0 } result { expr(1) }}

C++:

class AddZero : public Avisitor { Expr * result; public: void visit(Add * a) { // recurse to left-hand side a->getLhs()->accept(this); Expr * x = result; if (IntLiteral * il = dynamic_cast<IntLiteral*>(a->getRhs())){ if (il->getValue() == 0){ // pattern matched

result = x; } else { result = new Add(x, result); }

} else { a->getRhs()->accept(this); result = new Add(x, result); } }};

CML

C++ /visitor pattern

Match pattern in several stages

If pattern matched,

replace with expr(1)/x

Match any addition to zero; label left-hand side as x

20

CML InterpreterCML:transform addZero { pattern { expr(a) + 0 } result { expr(a) }}

CML AST

Add

CMLExpra

IntLiteral0

CMLparser

source AST

SgAddOp

SgIntVal1SgAddOp

SgIntVal2

SgIntVal0

Interpreter

Interpret: Depth-first visit of source AST At each node

If node matches root of CML pattern Match pattern in depth-first, postorder Save labelled nodes (“a” in example) Exit at first mismatch

If patterns match and conditions apply Visit result pattern to apply result

21

Ray tracing: Design Space ExplorationStart46.0

Simple parallel23.3

Simple parallel23.0

Loop interchange

Loopcoalesce

Loop interchange

Simple parallel22.6

Simple parallel22.2

Pixel-cyclic parallel20.1

Key:Last transformTime (secs)

Start: simple, sequential loopAdd transforms to aid parallelisationBest result from pixel-cyclic parallel

22

Loop coalescingtransform loopCoalesce { pattern { for(var(0)=0;var(0)<expr(1);var(0)++){ for(var(2)=0;var(2)<expr(3);var(2)++){ stmt(4); }

} } result { // single loop with new variable nv // range from 0 to product of trip // counts of original loops for(int nv=0;nv<expr(0)*expr(1);nv++){ // generate variable values // in terms of nv // note: not strength-reduced var(0) = nv / e0; var(1) = nv % e0; // the original body stmt(4); } }}

Replace loop nest with single loop Should run in same

order as original Declare new variable

to control replacement loop

Synthesise old variables in terms of new variables

This allows body to be copied unmodified

23

Experimental work: combine with model-based transforms

CML transforms are pattern-based Match syntax or dataflow patterns

Model-based patterns Map to underlying mathematical model + solution method

Combine pattern-based with model-based Simplify model-based (transform into preferred input)

24

Experimental work: combine with verification framework Design verification flow

is based on symbolic simulation and equivalence checking

The symbolic simulation results (outputs) from source and target code are compared using equivalence checker (Yices)

Limitations subset of C integer types only loop count constant

25

2. b) Mapping Selection

Overall goalGiven an application, find the best implementation for a

heterogeneous computing system such that the execution time is minimised

Proposed techniquesIntegrated mapping and scheduling techniqueMultiple neighborhood functionsMulti-loop parallelisation

Mapping Selection: Design Flow

Tabu search• Generate neighbor iteratively• Minimise processing time

Mapping criteria• Implementations and costs

associated with each task• Available processing elements• Communication cost• Configuration cost

TasksArchitecture description

Task mapping and scheduling

Mapping & scheduling solution

Tabu search

Processing time

estimator

Mapping & Scheduling

Solution

Overall processing

time

Integrated technique

Clustering + Mapping + Scheduling• Integrated in one

neighbourhood function• Move tasks between

processing elementsExtended solution space

• Contain good solutions

df12

tk1

tk2

tk4

tk3

df13

df34

tk1 : {t11=100, t12 = 1000}tk2 : {t21=400, t22 = 200}tk3 : {t31=2000, t32 = 400}tk4 : {t41=100, t42 = 1000}

df12=10df13=30df34=20

tk1

tk2

tk4

tk3

CPU FPGA

tk1

tk3

tk4

tk2

CPU FPGA

idle

idle

idle

idle

idle

idle

CPU FPGA

Multiple neighborhood functions

Multiple Neighbourhoods Functions• Increase diversification• Search better solutions

Parallel search• Multi-processor systems initial

solution

optimalsolution

1st move

2nd move

a mapping andscheduling solution

tk1

tk4

tk5 tk6

PE1 PE2

tk2

tk3

solution space

Experiments (80 – 112 tasks)FIR filteringMatrix multiplicationHidden Markov model decodingBGM interest rate model

INT, TABU, [Porto, 1995]

SEP, TABU, [Wiangtong, 2005]INT, TABU, MultNF [This work]

Multi-loop parallelisation

Find the best unrolling factor for each loop

Iterative approachUnrolling configuration

• Unrolling factors of all loops

for (.....) { fun(...);}

application withmultiple loops

unrollingconfigurations

generation

configurationqualities

terminationcondition reached?

No

Yes

best unrollingconfiguration

loop unrollingand fission

task graphgeneration

mapping andscheduling

quality scorecalculation

unrollingconfigurations

configurationselection

31

Loops Results

IWR : speech recognitionSUSAN : corner detection for image processingN-Body : particle modeling

2. c) High-Level Synthesis

Behavioural Structural Haydn

existing work our work

Behavioural Structural

Benefits rapid development high maintainability

implement non-obvious designs more control over optimisation

Drawbacks difficult to control poor error management

low productivity poor maintainability

R1: Rapid DevelopmentR2: Design ExplorationR3: ExtensibilityR4: Manual Control

33

Haydn interpretation rulesb a c

* *<<

-delta

> 0

num_sol

== 02 1

0

2

MU

X

MU

X

executed at cycle 1

executed at cycle 2

cc

cctrue

c

*

a

b b

* <<2

-

>

=

= =

==

true false

false

num_sol

num_sol num_sol

0

0

0

2

1StructuralInterpretation

(Handel-C)

BehaviouralInterpretation

{ delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0;}

34

Rapid development

34


cc

cctrue

c

*

a

b b

* <<2

-

>

=

= =

==

true false

false

num_sol

num_sol num_sol

0

0

0

2

1

unscheduling(behavioural interpretation)

scheduling

constraints+

par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; }

synthesis(structural interpretation)

b a c

pmult[0]

delta

> 0

num_sol

==2 1

0

pmult[1]

<< 2tmp1

-tmp2

tmp0

stage 1-7

stage 8

stage 9

stage 10

cycle 1

MU

XMU

X

35

Design exploration par { { // ================= [stage 1] pipe_mult[0].in(b,b); pipe_mult[0].in(a,c); } ....}


b a c

pmult [0]

delta

> 0

num_sol

==2 1

0

pmult [1]

<< 2tmp1

-tmp2

tmp0

stage 1-7

stage 8

stage 9

stage 10

cycle 1

MU

XMU

X

cc

cctrue

c

*

a

b b

* <<2

-

>

=

= =

==

true false

false

num_sol

num_sol num_sol

0

0

0

2

1

scheduling

constraints

+


b a c

pmult[0]

> 0

num_sol

==2 10

tmp1

tmp0

-tmp2

pmult[0]

cycle 2

cycle 1

<< 2

stage 1-4

stage 5

stage 6

cycle 1

cycle 2

cycle 1

cycle 2

MUX

MU

X

synthesis(structural interpretation)

36

Abstraction

par { { // ================= [stage 1] pipe_mult[0].in(b,b); pipe_mult[0].in(a,c); } { // ==================[stage 4] delay; tmp0 = pipe_mult[0].q; } { // ==================[stage 5] tmp1 = pipe_mult[0].q << 2; tmp2 = tmp0 - tmp1; } { // ==================[stage 6] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0;

delta = tmp2; } }

cc

cctrue

c

*

a

b b

* <<2

-

>

=

= =

==

true false

false

num_sol

num_sol num_sol

0

0

0

2

1



abstraction

37

Design quality{ delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0;}

cc

cctrue

c

*

a

b b

*

-

>

=

= =

==

true false

false

num_sol

num_sol num_sol

0

0

0

2

1

unscheduling

scheduling

constraints+


User Intervention

Manual Scheduling<< 2

38

Unscheduling

1. 2.

3. 4.

39

Haydn transformations: interactive [email protected] (*; UNITS:6);{ @HLS.run(II:1);

// original code}

@resources.set (*; UNITS:6);{ // transformed code}

40

Haydn-C: GARCH walk kernel

constraints

kernel specification

41

Design exploration: batch mode

constraints

• 5 multiplications:• 1 cycle per result => 5 multipliers• 2 cycles per result => 3 multipliers• 5 cycles per result => 1 multiplier

Evaluation: speed vs area

43

Initiation interval vs area

44

3. Harmonic Toolchain: Design Flow

binaries bitstream

Handel-C

(cycle-accurate description)

C code

(specific to each PE)

request new partition

task Btask A

C source files, hardware description

Task Partitioning

task A1 (FPGA),task A2 (FPGA),task A3 (DSP)task B1 (GPP)task B2 (DSP)...

Task Transformation Engine

runtime support

FPGA Synthesis

GPPcompiler

Haydn(HLS)

DSPcompiler

Mapping Selection

application and domain specific transformations

description

CML description

input task parameters

CML transforms

ROSEC++

transforms

GPP transforms

DSP transforms

FPGA transforms

Generic TransformLibraries

Task Transformation

Engine

implementations

pattern to match

matching condition

s result pattern

45

Tools and Annotations{ #pragma haydn pipeline II(1) s = SQRT(a); y = (s + b) * (c + d);}

Haydn (High-Level Synthesis)

par { sqrt_v4.in(a); adder_v4[0].in(sqrt_v4.res, b); adder_v4[1].in(c, d); mult_v4.in(adder_v4[0].res, adder_v4[1].res); y = mult_v4.res;}

Task Partitioning

#pragma map clustervoid d0d2Sci2CMixRealChTmpd2(...) { ... ssOpStarsa1(a,x,t1); ... ssOpStarsa2(b,y,t2); ... ssOpPlusaa1(t1,t2,z);}

Source-Files

void filter(...) { { #pragma omp section { #pragma map call_hw \ impl(MAGIC, 14) \ param(x,1000,r) \ param(h,100, rw) filter(x, h); }

#pragma omp section { #pragma map call_hw \ impl(ARM, 15) \ param(y,2000,r) \ param(i,50, rw) filter2(y, i); } }}

Mapping Selection

void foo(...) { ... #pragma omp parallel sections num_threads(2) { #pragma omp section { #pragma map call_hw \ impl(MAGIC, 14) \ param(x,1000,r) \ param(h,100, rw) filter(x, h); }

#pragma omp section { #pragma map call_hw \ impl(ARM, 15) \ param(y,2000,r) \ param(i,50, rw) filter2(y, i); } }}

tasks/implementations

46

ROSE source infrastructure Software analysis and optimization for scientific applications Tool for building source-to-source translators Support for C,C++, Fortran, binary Loop optimizations Lab and academic use Software engineering Performance analysis Domain-specific analysis and optimizations Development of new optimization approaches http://rosecompiler.org

47

4. Challenges

Theoretical• define and meet global constraints (application/platform) • correctness: verify transformation results• effective combination of static and dynamic analysisPractical:• reuse legacy code• incremental approach for using toolchain• create modular toolchain that can evolve with new applications and

platforms

48

5. Summary1. hArtes Project

complete toolchain targeting heterogeneous systems2. Research

Task Transformations: CML language for describing transformations

Mapping Selection: integrated approach with multiple neighbourhood functions

High-Level Synthesis (Haydn): combined behavioural and structural approach

3. Harmonic toolchainmodular: enable customisation and technology evolution

compiling high-level descriptions on a heterogeneous system

Documents

algorithm exploration

milano italy imperial

imperial college london

hartes toolchain

hartes platform

hartes project2

video applications

new system