compiling high-level descriptions on a heterogeneous system
DESCRIPTION
Compiling High-Level Descriptions on a Heterogeneous System. Jos é Gabriel de Figueiredo Coutinho Department of Computing, Imperial College London. The Programming Challenge of Heterogeneous Architectures Workshop University of Birmingham July 2-3, 2009. Overview. 1. hArtes Project - PowerPoint PPT PresentationTRANSCRIPT
Compiling High-Level Descriptions on a Heterogeneous System
José Gabriel de Figueiredo CoutinhoDepartment of Computing, Imperial College London
The Programming Challenge of Heterogeneous Architectures WorkshopUniversity of Birmingham
July 2-3, 20091
2
Overview1. hArtes Project2. Research
a) Task Transformation
b) Mapping Selection
c) High-Level Synthesis
3. Harmonic toolchain4. Challenges
3
Why Heterogeneous Systems?Because...orders-of-magnitude faster than conventional single-core processors target computation hungry applications:
» financial modeling» pharmaceutical applications» simulation of real-life complex systems
strategy: mix conventional processors with specialised processorsHowever...how to develop applications?
» portability... new system, new application?» design exploration... how to decide the partitioning and mapping?» optimisation... how to exploit specialised processors (FPGAs, DSPs)?» control vs automation.. how developers interact with compilation process?
4
1. hArtes Project - ConsortiumAtmel Roma (Italy)
Faital (Italy)
Fraunhofer IGD (Germany)
Imperial College (U.K.)
INRIA (France)
Leaff (Italy)
Politecnico di Bari (Italy)
Politecnico di Milano (Italy)
Scaleo Chip (France)
Thales Communications (France)
Thomson(France)
TU Delft (Netherlands)
UP delle Marche (Italy)
Università di Ferrara (Italia)
Universitè d'Avignon (France)
15 partners in 5 countries
5
Scope
Holistic Approach to Reconfigurable
real Time Embedded Systems
www.hartes.org
hArtes Tool-Chain
FPGAGPP
DSP
.c source code
Algorithm Exploration Tools
6
Applications
Enhanced In-Car audio and video: » Multichannel audio system» Automatic Echo Cancellation (AEC)» Automatic Speech and Speaker Recognition (ASR)» Adaptive filtering» Video Transcoding» Intra-cabin communication
Hardware Platforms(multi-purpose hardware)
Audio and Video Applications
7
Hardware Platforms
Atmel Diopsis 940H Evaluation Board(ARM+DSP)
hArtes Harware Platform(ARM+DSP+FPGA)
8
Toolchain
The hArtes toolchain is composed by three toolboxes:
1) Algorithm Exploration Toolbox 2) Design Space Exploration Toolbox 3) System Synthesis Toolbox
MappingSelection
9
Algorithm Exploration Toolbox: SciLab
P 1 PsychoacousticAnalysis
P 2 Sub Band AnalysisP 3 MDCT
P 4 Computation of bitsrequired through noise
allocation
D13Psychoacoustic
parameters
mean bits
P.E,SMR
D9 MPEGInformation
D10 WaveInformation
No ofgranules,No of bitsper frame
D7 FrameInformation
mixed block flag,window switchingflag, block type
D15Scalefactors
Scalefactorselection
Information,reservoir drain
resvoir size,maximum reservoir
size,maximumbits available
D14Quantized
values
Quantizedvalues
DTab-18Tables usedby process P
4
Tables
D6 GranuleInformation
P 5 Bitstream Multiplexer
D6 GranuleInformation
PCM
EncoderParamet
ersEncoding Parameters
PCM Samples
D5 Sub BandSamples
Sub band samples
PCM Samples
Sub bandSamples
TransformedSub band samples
Subbandsamples
PE,SMR
No ofchannels
Scalefactors
All the parameters,except window
switching flag, mixed block flag and block type
D8 ReservoirInformation
Granule Information
Scalefactors
Scalefactorselection
information
Quantizedvalues
D16 MP3Bitstream
Frame of encodedMP3 bistream
P 6 Encodingloop controller
mixed block flag,window switchingflag, block type
Encodingparameters
Sampling rate,channels,bits,length,
total samples,types
MPEGinformation
Granule no.,channel no.
resvoir size,reservoir limitD7 Frame
Information
Main data begin
Main databegin
D5 Sub BandSamples
Spectralvalues
Level-1 DiagramDTab-22 Tables used forPsychoacoustic analysis
DTab-23 Tables used forSub band analysis
DTab-24 Tablesused for MDCT
Tables
Tables
Tables
D6 GranuleInformation
Block type,mixed block flag
yxtut
μyxtut
yxtuy
yxtux
T ,,,,ρ,,,, 2
2
2
2
2
2
SCILAB
To SCILAB2C and Design Exploration Toolbox
hArtes
Physical Model
Algorithm
10
Algorithm Exploration Toolbox: Nu-Tech
Thanks to the plug-in architecture the developer can write his/her own NUTs (NU-Techs satellites) and immediately plug them into the graphical interface design environment.
hArtes Design Exploration Toolbox
The NU-Tech Graphical Exploration (GAE) is the hArtes platform to validate the complex algorithms.
11
Design Space Exploration Toolbox
Task Partitioning
Task Transformation
Data Representation Optimisation
Annotated C
Annotated C
Annotated C
Annotated C
Politecnico di Milano (Italy)
Imperial College (U.K.)
TU Delft (Netherlands)Profiling
Input Source
12
System Synthesis Toolbox
Generic GPP (C+macros)
GPP Molen code
DSPC code
FPGA
Mapping Selection
Code Generation
Annotated C
Annotated C
GPP comp Molen DSP comp C2VHDL
ELF obj ELF obj ELF obj Bitstream
Linker
Loader
Executable code (ELF)
Imperial College (U.K.)
Atmel Roma (Italy)
TU Delft (Netherlands)
13
Accelerating an application
14
T1_dsp1
T3_dsp3
T2_dsp2
T1_gpp3
32
34
42
43
54
Design Exploration and Synthesis
Partitioning
TaskTransformation
Tasks
T3_fpga2
T1_gpp1
T1_gpp2
Mapping Selection
T1DSP2
Cost Estimation
T3GPP T4
DSP5
C Description
ImplementationsSystem Description
15
2. a) Task transformationWhat are task transformations?
Source-to-source transformations pattern matching on syntax or dataflow
Why use them? Compilers cannot include all optimisations Use knowledge of domain or platform experts Use to influence task mapping
How to use them? Write in C++ using ROSE framework: hard Write in our domain-specific language, CML: easier
Who writes them? Domain or platform experts Developers needing design-space exploration
16
Basic CML: 3 parts to a transform Pattern: syntax to match, label elements Conditions based on dataflow Resulting pattern to substitute
Proposed novel aspects of extended CML Systematic description of dataflow conditions Parameterised transforms Features for labelling subpatterns Probabilities for machine learning
Extend: CML code matching DFGs s1->s2 matches true dependence arc from s1 to s2 s1 -/> s2 matches antidependence arc from s2 to s2 s1 -@-> s2 matches output dependence arc from s1 to s2
CML for task transformations
17
Requirements: CML language Aim: compact transformation description Describe transformations on
Abstract Syntax Tree (AST) Data Flow Graph (DFG)
Support transformations specific to Application domain: embedded media Target technology: CPU + DSP + FPGA
Allow parameterisable transforms e.g. unrolling factor
Interpretation Can change transform without recompilation Saves time, eases learning curve Can rapidly explore transform design space Customize existing transforms
Facilitate cost estimate: e.g. number of registers
18
CML example: replace multiply-by-n with shift
18
Replacing multiplies by shift is usually an optimisation in hardware lower area, greater speed
transform times2ToShift {pattern {
expr(1) * n} conditions {
n & (n-1) == 0} result {
expr(1) << LOG2(n)}
}
Transform name
Pattern: expression multiplied by n.Pattern section:
syntax pattern with labelled parts
Result: labelled expression, shift replaces multiply
Conditions section: optional; only replace if
conditions all true
Result section: what to replace matched
pattern with if conditions apply
expr(1): Labelled
subexpression
19
Simple CML exampleEliminate addition with zero
Expr + 0 => 0 Not always applicable (Floating-point: NaN + 0 = NaN)
transform addZero { pattern { expr(1) + 0 } result { expr(1) }}
C++:
class AddZero : public Avisitor { Expr * result; public: void visit(Add * a) { // recurse to left-hand side a->getLhs()->accept(this); Expr * x = result; if (IntLiteral * il = dynamic_cast<IntLiteral*>(a->getRhs())){ if (il->getValue() == 0){ // pattern matched
result = x; } else { result = new Add(x, result); }
} else { a->getRhs()->accept(this); result = new Add(x, result); } }};
CML
C++ /visitor pattern
Match pattern in several stages
If pattern matched,
replace with expr(1)/x
Match any addition to zero; label left-hand side as x
20
CML InterpreterCML:transform addZero { pattern { expr(a) + 0 } result { expr(a) }}
CML AST
Add
CMLExpra
IntLiteral0
CMLparser
source AST
SgAddOp
SgIntVal1SgAddOp
SgIntVal2
SgIntVal0
Interpreter
Interpret: Depth-first visit of source AST At each node
If node matches root of CML pattern Match pattern in depth-first, postorder Save labelled nodes (“a” in example) Exit at first mismatch
If patterns match and conditions apply Visit result pattern to apply result
21
Ray tracing: Design Space ExplorationStart46.0
Simple parallel23.3
Simple parallel23.0
Loop interchange
Loopcoalesce
Loop interchange
Simple parallel22.6
Simple parallel22.2
Pixel-cyclic parallel20.1
Key:Last transformTime (secs)
Start: simple, sequential loopAdd transforms to aid parallelisationBest result from pixel-cyclic parallel
22
Loop coalescingtransform loopCoalesce { pattern { for(var(0)=0;var(0)<expr(1);var(0)++){ for(var(2)=0;var(2)<expr(3);var(2)++){ stmt(4); }
} } result { // single loop with new variable nv // range from 0 to product of trip // counts of original loops for(int nv=0;nv<expr(0)*expr(1);nv++){ // generate variable values // in terms of nv // note: not strength-reduced var(0) = nv / e0; var(1) = nv % e0; // the original body stmt(4); } }}
Replace loop nest with single loop Should run in same
order as original Declare new variable
to control replacement loop
Synthesise old variables in terms of new variables
This allows body to be copied unmodified
23
Experimental work: combine with model-based transforms
CML transforms are pattern-based Match syntax or dataflow patterns
Model-based patterns Map to underlying mathematical model + solution method
Combine pattern-based with model-based Simplify model-based (transform into preferred input)
24
Experimental work: combine with verification framework Design verification flow
is based on symbolic simulation and equivalence checking
The symbolic simulation results (outputs) from source and target code are compared using equivalence checker (Yices)
Limitations subset of C integer types only loop count constant
25
2. b) Mapping Selection
Overall goalGiven an application, find the best implementation for a
heterogeneous computing system such that the execution time is minimised
Proposed techniquesIntegrated mapping and scheduling techniqueMultiple neighborhood functionsMulti-loop parallelisation
Mapping Selection: Design Flow
Tabu search• Generate neighbor iteratively• Minimise processing time
Mapping criteria• Implementations and costs
associated with each task• Available processing elements• Communication cost• Configuration cost
TasksArchitecture description
Task mapping and scheduling
Mapping & scheduling solution
Tabu search
Processing time
estimator
Mapping & Scheduling
Solution
Overall processing
time
Integrated technique
Clustering + Mapping + Scheduling• Integrated in one
neighbourhood function• Move tasks between
processing elementsExtended solution space
• Contain good solutions
df12
tk1
tk2
tk4
tk3
df13
df34
tk1 : {t11=100, t12 = 1000}tk2 : {t21=400, t22 = 200}tk3 : {t31=2000, t32 = 400}tk4 : {t41=100, t42 = 1000}
df12=10df13=30df34=20
tk1
tk2
tk4
tk3
CPU FPGA
tk1
tk3
tk4
tk2
CPU FPGA
idle
idle
idle
idle
idle
idle
CPU FPGA
Multiple neighborhood functions
Multiple Neighbourhoods Functions• Increase diversification• Search better solutions
Parallel search• Multi-processor systems initial
solution
optimalsolution
1st move
2nd move
a mapping andscheduling solution
tk1
tk4
tk5 tk6
PE1 PE2
tk2
tk3
solution space
Experiments (80 – 112 tasks)FIR filteringMatrix multiplicationHidden Markov model decodingBGM interest rate model
INT, TABU, [Porto, 1995]
SEP, TABU, [Wiangtong, 2005]INT, TABU, MultNF [This work]
Multi-loop parallelisation
Find the best unrolling factor for each loop
Iterative approachUnrolling configuration
• Unrolling factors of all loops
for (.....) { fun(...);}
application withmultiple loops
unrollingconfigurations
generation
configurationqualities
terminationcondition reached?
No
Yes
best unrollingconfiguration
loop unrollingand fission
task graphgeneration
mapping andscheduling
quality scorecalculation
unrollingconfigurations
configurationselection
31
Loops Results
IWR : speech recognitionSUSAN : corner detection for image processingN-Body : particle modeling
2. c) High-Level Synthesis
Behavioural Structural Haydn
existing work our work
Behavioural Structural
Benefits rapid development high maintainability
implement non-obvious designs more control over optimisation
Drawbacks difficult to control poor error management
low productivity poor maintainability
R1: Rapid DevelopmentR2: Design ExplorationR3: ExtensibilityR4: Manual Control
33
Haydn interpretation rulesb a c
* *<<
-delta
> 0
num_sol
== 02 1
0
2
MU
X
MU
X
executed at cycle 1
executed at cycle 2
cc
cctrue
c
*
a
b b
* <<2
-
>
=
= =
==
true false
false
num_sol
num_sol num_sol
0
0
0
2
1StructuralInterpretation
(Handel-C)
BehaviouralInterpretation
{ delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0;}
34
Rapid development
34
{ delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0;}
cc
cctrue
c
*
a
b b
* <<2
-
>
=
= =
==
true false
false
num_sol
num_sol num_sol
0
0
0
2
1
unscheduling(behavioural interpretation)
scheduling
constraints+
par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; }
synthesis(structural interpretation)
b a c
pmult[0]
delta
> 0
num_sol
==2 1
0
pmult[1]
<< 2tmp1
-tmp2
tmp0
stage 1-7
stage 8
stage 9
stage 10
cycle 1
MU
XMU
X
35
Design exploration par { { // ================= [stage 1] pipe_mult[0].in(b,b); pipe_mult[0].in(a,c); } ....}
par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; }
b a c
pmult [0]
delta
> 0
num_sol
==2 1
0
pmult [1]
<< 2tmp1
-tmp2
tmp0
stage 1-7
stage 8
stage 9
stage 10
cycle 1
MU
XMU
X
cc
cctrue
c
*
a
b b
* <<2
-
>
=
= =
==
true false
false
num_sol
num_sol num_sol
0
0
0
2
1
scheduling
constraints
+
unscheduling(behavioural interpretation)
b a c
pmult[0]
> 0
num_sol
==2 10
tmp1
tmp0
-tmp2
pmult[0]
cycle 2
cycle 1
<< 2
stage 1-4
stage 5
stage 6
cycle 1
cycle 2
cycle 1
cycle 2
MUX
MU
X
synthesis(structural interpretation)
36
Abstraction
par { { // ================= [stage 1] pipe_mult[0].in(b,b); pipe_mult[0].in(a,c); } { // ==================[stage 4] delay; tmp0 = pipe_mult[0].q; } { // ==================[stage 5] tmp1 = pipe_mult[0].q << 2; tmp2 = tmp0 - tmp1; } { // ==================[stage 6] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0;
delta = tmp2; } }
cc
cctrue
c
*
a
b b
* <<2
-
>
=
= =
==
true false
false
num_sol
num_sol num_sol
0
0
0
2
1
unscheduling(behavioural interpretation)
{ delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0;}
abstraction
37
Design quality{ delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0;}
cc
cctrue
c
*
a
b b
*
-
>
=
= =
==
true false
false
num_sol
num_sol num_sol
0
0
0
2
1
unscheduling
scheduling
constraints+
par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; }
User Intervention
Manual Scheduling<< 2
38
Unscheduling
1. 2.
3. 4.
39
Haydn transformations: interactive [email protected] (*; UNITS:6);{ @HLS.run(II:1);
// original code}
@resources.set (*; UNITS:6);{ // transformed code}
40
Haydn-C: GARCH walk kernel
constraints
kernel specification
41
Design exploration: batch mode
constraints
• 5 multiplications:• 1 cycle per result => 5 multipliers• 2 cycles per result => 3 multipliers• 5 cycles per result => 1 multiplier
Evaluation: speed vs area
43
Initiation interval vs area
44
3. Harmonic Toolchain: Design Flow
binaries bitstream
Handel-C
(cycle-accurate description)
C code
(specific to each PE)
request new partition
task Btask A
C source files, hardware description
Task Partitioning
task A1 (FPGA),task A2 (FPGA),task A3 (DSP)task B1 (GPP)task B2 (DSP)...
Task Transformation Engine
runtime support
FPGA Synthesis
GPPcompiler
Haydn(HLS)
DSPcompiler
Mapping Selection
application and domain specific transformations
description
CML description
input task parameters
CML transforms
ROSEC++
transforms
GPP transforms
DSP transforms
FPGA transforms
Generic TransformLibraries
Task Transformation
Engine
implementations
pattern to match
matching condition
s result pattern
45
Tools and Annotations{ #pragma haydn pipeline II(1) s = SQRT(a); y = (s + b) * (c + d);}
Haydn (High-Level Synthesis)
par { sqrt_v4.in(a); adder_v4[0].in(sqrt_v4.res, b); adder_v4[1].in(c, d); mult_v4.in(adder_v4[0].res, adder_v4[1].res); y = mult_v4.res;}
Task Partitioning
#pragma map clustervoid d0d2Sci2CMixRealChTmpd2(...) { ... ssOpStarsa1(a,x,t1); ... ssOpStarsa2(b,y,t2); ... ssOpPlusaa1(t1,t2,z);}
Source-Files
void filter(...) { { #pragma omp section { #pragma map call_hw \ impl(MAGIC, 14) \ param(x,1000,r) \ param(h,100, rw) filter(x, h); }
#pragma omp section { #pragma map call_hw \ impl(ARM, 15) \ param(y,2000,r) \ param(i,50, rw) filter2(y, i); } }}
Mapping Selection
void foo(...) { ... #pragma omp parallel sections num_threads(2) { #pragma omp section { #pragma map call_hw \ impl(MAGIC, 14) \ param(x,1000,r) \ param(h,100, rw) filter(x, h); }
#pragma omp section { #pragma map call_hw \ impl(ARM, 15) \ param(y,2000,r) \ param(i,50, rw) filter2(y, i); } }}
tasks/implementations
46
ROSE source infrastructure Software analysis and optimization for scientific applications Tool for building source-to-source translators Support for C,C++, Fortran, binary Loop optimizations Lab and academic use Software engineering Performance analysis Domain-specific analysis and optimizations Development of new optimization approaches http://rosecompiler.org
47
4. Challenges
Theoretical• define and meet global constraints (application/platform) • correctness: verify transformation results• effective combination of static and dynamic analysisPractical:• reuse legacy code• incremental approach for using toolchain• create modular toolchain that can evolve with new applications and
platforms
48
5. Summary1. hArtes Project
complete toolchain targeting heterogeneous systems2. Research
Task Transformations: CML language for describing transformations
Mapping Selection: integrated approach with multiple neighbourhood functions
High-Level Synthesis (Haydn): combined behavioural and structural approach
3. Harmonic toolchainmodular: enable customisation and technology evolution