parallel programming and timing analysis on embedded multicores
Post on 23-Mar-2016
43 Views
Preview:
DESCRIPTION
TRANSCRIPT
Parallel Programmingand Timing Analysis
on Embedded Multicores
Eugene YipThe University of Auckland
Supervisors: Advisor:Dr. Partha Roop Dr. Alain GiraultDr. Morteza Biglari-Abhari
Outline
• Introduction• ForeC language• Timing analysis• Results• Conclusions
Introduction
• Safety-critical systems:– Performs specific tasks.– Behave correctly at all times.– Compliance to strict safety standards. [IEC 61508, DO 178]
– Time-predictability useful in real-time designs.
[Paolieri et al 2011] Towards Functional-Safe Timing-Dependable Real-Time Architectures.
Introduction
• Safety-critical systems:– Shift from single-core to multicore processors.– Better power and execution performance.
Coren
Core0
System bus
Resource Resource
Shared
Shared Shared[Blake et al 2009] A Survey of Multicore Processors.[Cullmann et al 2010] Predictability Considerations in the Design of Multi-Core Embedded Systems.
Introduction• Parallel programming:– From super computers to mainstream computers.– Threaded programming model.– Frameworks designed for systems without
resource constraints or safety-concerns.– Improving average-case performance (flops),
not time-predictability.
Introduction• Parallel programming:– Programmer responsible for shared resources.– Concurrency errors:• Deadlock• Race condition• Atomic violation• Order violation
– Non-deterministic thread interleaving.– Determinism essential for understanding and
debugging.
[McDowell et al 1989] Debugging Concurrent Programs.
Introduction
• Synchronous languages– Deterministic concurrency.– Based on the synchrony hypothesis.– Threads execute in lock-step to a global clock.– Concurrency is logical. Typically compiled away.
[Benveniste et al 2003] The Synchronous Languages 12 Years Later.
Global ticks
Inputs
Outputs1 2 3 4
Introduction
• Synchronous languages
Physical time1 2 3 4
Time between each tick
Must validate:max(Reaction time) < min(Time of each tick)
Reaction time
Defined by the timing requirements
of the system
[Benveniste et al 2003] The Synchronous Languages 12 Years Later.
Introduction• Synchronous languages– Esterel– Lustre– Signal– Synchronous extensions to C:• PRET-C• ReactiveC with shared variables.• Synchronous C (SC – see Michael’s talk)• Esterel C Language
[Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.[Boussinot 1993] Reactive Shared Variables Based Systems.[Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency.[Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.
Concurrent threads scheduled sequentially in a cooperatively manner.
Atomic execution of threads which ensures thread-safe access to shared variables.
Introduction• Synchronous languages– Esterel– Lustre– Signal– Synchronous extensions to C:• PRET-C• ReactiveC with shared variables.• Synchronous C (SC – see Michael’s talk)• Esterel C Language
[Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.[Boussinot 1993] Reactive Shared Variables Based Systems.[Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency.[Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.
Writes to shared variables are delayed to the end of the global tick.
At the end of the global tick, the writes are combined and assigned to the shared variable.
Associative and commutative “combine function”.
Outline
• Introduction• ForeC language• Timing analysis• Results• Conclusions
ForeC language“Foresee”• Deterministic parallel programming of
embedded multicores.• C with a minimal set of synchronous
constructs for deterministic parallelism.• Fork/Join parallelism (explicit).• Shared memory model.• Deterministic thread communication using
shared variables.
ForeC language• Constructs:– par(t1, …, tn)• Fork threads t1 to tn to execute in parallel, in any order.• Parent thread is suspended, until all child threads
terminate.– thread t1(...) {b}• Thread definition.
– pause• Synchronisation barrier.• When a thread pauses, it completes a local tick.• When all threads pause, the program completes a
global tick.
ForeC language
• Constructs:– abort {b} when (c)• Preempts the body b when the condition c is true. The
condition is checked before executing the body. – weak abort {b} when (c)• Preempts the body when the body reaches a pause and
the condition c is true. The condition is checked before executing the body.
ForeC language
• Variable type qualifiers:– input• Variable gets its value from the environment.
– output• Variable emits its value to the environment.
ForeC language• Variable type qualifiers:– shared• Variable which may be accessed by multiple threads.• At the start of a thread’s local tick, it creates local
copies of shared variables that it accesses.• During the thread’s local tick, it modifies its local copy
(isolation).• At the end of the global tick, copies that have been
modified are combined using a commutative and associative function (combine function).• The combined result is committed back to the original
shared variable.
ForeC languageshared int x = 0;void main(void) { x = 1; par(t0(), t1()); x = x - 1;}
thread t0(void) { x = 10; x = x + 1; pause; x = x + 1;}
thread t1(void) { x = x * 2 pause; x = x * 2;}
ForeC languageshared int x = 0;void main(void) { x = 1; par(t0(), t1()); x = x - 1;}
thread t0(void) { x = 10; x = x + 1; pause; x = x + 1;}
thread t1(void) { x = x * 2 pause; x = x * 2;}
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
Fork
Join
Computation
Condition
Pause
Abort
Graph End
Graph Start
Concurrent Control-Flow Graph
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
• Sequential control-flow along a single path.
• Parallel control-flow along branches from a fork node.
• Global tick ends when all threads pause or terminate.
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
State of the shared variables
0
Global: x
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
State of the shared variables
0
Global: x Thread main creates a local copy of x.
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
State of the shared variables
Thread main creates a local copy of x.0
Global: x
0
main
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
State of the shared variables
0
Global: x
1
main
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
State of the shared variables
0
Global: x
1
mainThreads t0 and t1 take over main’s copy of the shared variable x.
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
State of the shared variables
0
Global: x
Threads t0 and t1 take over main’s copy of the shared variable x.
1
t0
1
t1
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
State of the shared variables
0
Global: x
10
t0
1
t1
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
State of the shared variables
0
Global: x
11
t0
1
t1
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
State of the shared variables
0
Global: x
11
t0
2
t1
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
State of the shared variables
0
Global: x
11
t0
2
t1
Global tick is reached.• Combine the copies of x together using a
(programmer defined) associative and commutative function.
• Assume the combine function for x implements summation.
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
State of the shared variables
0
Global: x
11
t0
2
t1
• Assign the combined value back to x.
ForeC language
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
State of the shared variables
13
Global: x
• Assign the combined value back to x.
ForeC language
State of the shared variables
13
Global: x
Next global tick.• Active threads create a copy of x.
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
13
t0
13
t1
ForeC language
State of the shared variables
13
Global: x
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
14
t0
13
t1
ForeC language
State of the shared variables
13
Global: x
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
14
t0
26
t1
ForeC language
State of the shared variables
13
Global: x
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
14
t0
26
t1
Threads t0 and t1 terminate and join back to the parent thread main.• Local copies of x are combined into a
single copy and given back to the parent thread main.
ForeC language
State of the shared variables
13
Global: x
Threads t0 and t1 terminate and join back to the parent thread main.• Local copies of x are combined into a
single copy and given back to the parent thread main.
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
40
main
ForeC language
State of the shared variables
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
13
Global: x
39
main
ForeC language
State of the shared variables
x = x + 1 x = x * 2
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
x = x - 1
39
Global: x
ForeC language
• Shared variables.– Threads modify local copies of shared variables.– Isolates thread execution behaviour.– Order/interleaving of thread execution has no
impact on the final result.– Prevents concurrency errors.– Associative and commutative combine functions.• Order of combining doesn’t matter.
Scheduling
• Light-weight static scheduling.– Take advantage of multicore performance while
delivering time-predictability.– Thread allocation and scheduling order on each
core decided at compile time by the programmer.– Cooperative (non-preemptive) scheduling.– Fork/join semantics and notion of a global tick is
preserved via synchronisation.
Scheduling
• One core to perform housekeeping tasks at the end of the global tick.– Combining shared variables.– Emitting outputs.– Sampling inputs and start the next global tick.
Outline
• Introduction• ForeC language• Timing analysis• Results• Conclusions
Timing analysis
• Compute the program’s worst-case reaction time (WCRT).
Physical time1 2 3 4
Time between each tick
Must validate:max(Reaction time) < min(Time of each tick)
Reaction time
Defined by the timing requirements
of the system
Timing analysis
Existing approaches for synchronous programs.• Integer Linear Programming (ILP)• Max-Plus• Model Checking
Timing analysis
Existing approaches for synchronous programs.• Integer Linear Programming (ILP)– Execution time of the program described as a set
of integer equations.– Solving ILP is known to be NP-hard.
• Max-Plus• Model Checking
[Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose Multiprocessors.
Timing analysis
Existing approaches for synchronous programs.• Integer Linear Programming (ILP)• Max-Plus– Compute the WCRT of each thread.– Using the thread WCRTs, the WCRT of the program
is computed.– Assumes there is a global tick where all threads
execute their worst-case.• Model Checking
Timing analysisExisting approaches for synchronous programs.• Integer Linear Programming (ILP)• Max-Plus• Model Checking– Eliminate false paths by explicit path exploration
(reachability over the program’s CFG).– Binary search: Check the WCRT is less than “x”.– State-space explosion problem.– Trades-off analysis time for precision.– Provides execution trace for the WCRT.
Timing analysis
• Our approach using Reachability:– Same benefits as model checking, but a binary
search of the WCRT is not required.– To handle state-space explosion:• Reduce the program’s CCFG before analysis.
Program binary
(annotated)
Find the global ticks
(Reachability)WCRT
Reconstruct the program’s
CCFG
Timing analysis
• Programs will execute on the following multicore:
Core0
TDMA Shared Bus
Global memory
Datamemory
Instruction memory Core
nDatamemory
Instruction memory
Timing analysis
• Computing the execution time:1. Overlapping of thread execution time from
parallelism and inter-core synchronizations.2. Scheduling overheads.3. Variable delay in accessing the shared bus.
Timing analysis
1. Overlapping of thread execution time from parallelism and inter-core synchronisations.• An integer counter to track each core’s execution time.• Synchronisation occurs when forking/joining, and ending
the global tick.• Advance the execution time of participating cores.
Core 1: Core 2:main t2t1
Core 1 Core 2main
t2t1
x = x + 1 x = x * 2
t1t0
x = 10
x = 1
main
Timing analysis
2. Scheduling overheads.– Synchronisation: Fork/join and global tick.• Via global memory.
– Thread context-switching .• Copying of shared variables at the start and end of a
thread’s local tick via global memory.
SynchronisationThread context-switch
Core 1 Core 2main
t2t1
Global tick
Timing analysis
2. Scheduling overheads.– Required scheduling routines statically known.– Analyse the scheduling control-flow.– Compute the execution time for each scheduling
overhead. Core 1 Core 2main
t1
Core 1 Core 2main
t2t1t2
Timing analysis
3. Variable delay in accessing the shared bus.– Global memory accessed by scheduling routines.– TDMA bus delay has to be considered.
Core 1 Core 2main
t1 t2
Timing analysis
3. Variable delay in accessing the shared bus.– Global memory accessed by scheduling routines.– TDMA bus delay has to be considered.
121212121212
Core 1 Core 2
slotsCore 1 Core 2
main
t1 t2
Timing analysis
3. Variable delay in accessing the shared bus.– Global memory accessed by scheduling routines.– TDMA bus delay has to be considered.
121212121212
Core 1 Core 2main
t1
Core 1 Core 2main
t1 t2t2
Timing analysis
• CCFG optimisations:– merge: Reduce the number of CFG nodes that
need to be traversed for each local tick.– merge-b: Reduce the number of alternate paths
between CFG nodes.
Timing analysis
• CCFG optimisations:– merge: Reduce the number of CFG nodes that
need to be traversed for each local tick.
cost = 1
cost = 4
cost = 3
cost = 1
Timing analysis
• CCFG optimisations:– merge: Reduce the number of CFG nodes that
need to be traversed for each local tick.
cost = 1
cost = 4
cost = 3
cost = 1
Timing analysis
• CCFG optimisations:– merge: Reduce the number of CFG nodes that
need to be traversed for each local tick.
cost = 1
cost = 4
cost = 3
cost = 1
cost= 1 + 3= 4
cost= 1 + 4 + 1= 6
merge
Timing analysis
• CCFG optimisations:– merge-b: Reduce the number of possible paths
between CFG nodes.• Reduces the number of reachable global ticks.
cost = 1
cost = 4
cost = 3
cost = 1
cost= 1 + 3= 4
cost= 1 + 4 + 1= 6
merge
Timing analysis
• CCFG optimisations:– merge-b: Reduce the number of possible paths
between CFG nodes.• Reduces the number of reachable global ticks.
cost = 1
cost = 4
cost = 3
cost = 1
cost= 1 + 3= 4
cost= 1 + 4 + 1= 6
cost = 6
merge merge-b
Outline
• Introduction• ForeC language• Timing analysis• Results• Conclusions
Results
• For the proposed reachability-based timing analysis, we demonstrate:– the precision of the computed WCRT.– the efficiency of the analysis, in terms of analysis
time.
Results
• Timing analysis tool:
Program binary
(annotated)
Explicit path exploration
(Reachability)
Implicit path exploration(Max-Plus)
Taking into account the 3 factors
WCRTProgram CCFG (optimisations)
Results
• Multicore simulator (Xilinx MicroBlaze):– Based on http://www.jwhitham.org/c/smmu.html
and extended to be cycle-accurate and support multiple cores and a TDMA bus.
Core0
TDMA Shared Bus
Global memory
Datamemory
Instruction memory Core
nDatamemory
Instruction memory16KB
16KB
32KB5 cycles
1 cycle
5 cycles/core(Bus schedule round = 5 * no. cores)
Results
• Mix of control/data computations, thread structure and computation load.
* [Pop et al 2011] A Stream-Computing Extension to OpenMP.# [Nemer et al 2006] A Free Real-Time Benchmark.
*
*#
Benchmark programs.
Results
• Each benchmark program was distributed over varying number of cores.– Up to the maximum number of parallel threads.
• Observed the WCRT:– Test vectors to elicit different execution paths.
• Computed the WCRT:– Reachability– Max-Plus
802.11a Results• WCRT decreases
until 5 cores.• Global memory
increasingly expensive.
• Scheduling overheads.
1 2 3 4 5 6 7 8 9 100
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000Observed
Reachability
MaxPlus
Cores
WC
RT
(clo
ck cy
cles
)
802.11a Results
1 2 3 4 5 6 7 8 9 100
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000Observed
Reachability
MaxPlus
Cores
WC
RT
(clo
ck cy
cles
)
Reachability:• ~2% over-
estimation.• Benefit of explicit
path exploration.
802.11a ResultsMax-Plus:• Loss of execution
context: Uses only the thread WCRTs.
• Assumes one global tick where all threads execute their worst-case.
• Max execution time of the scheduling routines.1 2 3 4 5 6 7 8 9 10
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000Observed
Reachability
MaxPlus
Cores
WC
RT
(clo
ck cy
cles
)
802.11a ResultsBoth approaches:• Estimation of
synchronisation cost is conservative. Assumed that the receive only starts after the last sender.
1 2 3 4 5 6 7 8 9 100
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000Observed
Reachability
MaxPlus
Cores
WC
RT
(clo
ck cy
cles
)
802.11a Results
1 2 3 4 5 6 7 8 9 100
500
1,000
1,500
2,000
2,500
Cores
Ana
lysi
s Tim
e (s
econ
ds)
Max-Plus takes less than 2 seconds.Reachability
802.11a Results
1 2 3 4 5 6 7 8 9 100
500
1,000
1,500
2,000
2,500
Cores
Ana
lysi
s Tim
e (se
cond
s)
Reachability (merge)
Reachabilitymerge:• Reduction of ~9.34x
802.11a Results
1 2 3 4 5 6 7 8 9 100
500
1,000
1,500
2,000
2,500
Cores
Ana
lysi
s Tim
e (se
cond
s)
Reachability (merge)Reachability (merge-b)
Reachabilitymerge:• Reduction of ~9.34x
802.11a Results
1 2 3 4 5 6 7 8 9 100
500
1,000
1,500
2,000
2,500
Cores
Ana
lysi
s Tim
e (se
cond
s)
Reachability (merge)Reachability (merge-b)
Reachabilitymerge:• Reduction of ~9.34xmerge-b:• Reduction of ~342x• Less than 7 sec.
802.11a Results
Reduction in states reduction in analysis time
Number of global ticks explored.
Results
Reachability:• ~1 to 8% over-estimation.• Loss in precision mainly from over-estimating the synchronisation
costs.
1 2 3 40
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
FmRadio
Cores
1 2 3 4 5 6 70
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Fly by Wire
Cores
1 2 3 4 5 6 7 80
20,000
40,000
60,000
80,000
100,000
120,000
140,000
Life
Cores1 2 3 4 5 6 7 8
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Matrix
ObservedReachabilityMaxPlus
Cores
Results
Max-Plus:• Over-estimation very dependent on program structure.• FmRadio and Life very imprecise. Loops iterating over par
statement(s) multiple times. Over-estimations are multiplied.• Matrix quite precise. Executes in one global tick. Thus, thread
WCRT assumption is valid.
1 2 3 40
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
FmRadio
Cores
1 2 3 4 5 6 70
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Fly by Wire
Cores
1 2 3 4 5 6 7 80
20,000
40,000
60,000
80,000
100,000
120,000
140,000
Life
Cores1 2 3 4 5 6 7 8
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Matrix
ObservedReachabilityMaxPlus
Cores
Results
• Timing trace of the WCRT.– For each core: Thread start/end time, context-
switching, fork/join, ...– Can be used to tune the thread distribution.– Used to find good thread distributions for each
benchmark program.
Outline
• Introduction• ForeC language• Timing analysis• Results• Conclusions
Conclusions
• ForeC language for deterministic parallel programming.
• Based on synchronous framework.• Able to achieve WCRT speedup while
providing time-predictability.• Very precise, fast and scalable timing analysis
for multicore programs using reachability.
Future work
• Complete the formal semantics of ForeC.• Prune additional infeasible paths using value
analysis.• WCRT-guided, automatic thread distribution.• Cache hierarchy in the analysis.
Questions?
Introduction
• Existing parallel programming solutions.– Shared memory model.• OpenMP, Pthreads• Intel Cilk Plus, Thread Building Blocks• Unified Parallel C, ParC, X10
– Message passing model.• MPI, SHIM
– Provides ways to manage shared resources but not prevent concurrency errors.
[OpenMP] http://openmp.org [Pthreads] https://computing.llnl.gov/tutorials/pthreads/ [X10] http://x10-lang.org/[Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [Intel Thread Building Blocks] http://threadingbuildingblocks.org/[Unified Parallel C] http://upc.lbl.gov/ [Ben-Asher et al] ParC – An Extension of C for Shared Memory Parallel Processing.[MPI] http://www.mcs.anl.gov/research/projects/mpi/ [SHIM] SHIM: A Language for Hardware/Software Integration.
Introduction
• Deterministic runtime support.– Pthreads• dOS, Grace, Kendo, CoreDet, Dthreads.
– OpenMP• Deterministic OMP
– Concept of logical time.– Each logical time step broken into an execution
and communication phase.
[Bergan et al 2010] Deterministic Process Groups in dOS.[Olszewski et al 2009] Kendo: Efficient Deterministic Multithreading in Software. [Bergan et al 2010] CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution.[Liu et al 2011] Dthreads: Efficient Deterministic Multithreading.[Aviram 2012] Deterministic OpenMP.
ForeC language
• Behaviour of shared variables is similar to:• Intel Cilk+ (Reducers)• Unified Parallel C (Collectives)• DOMP (Workspace consistency)• Grace (Copy-on-write)• Dthreads (Copy-on-write)
ForeC language
• Parallel programming patterns:– Specifying an appropriate combine function.– Sacrifice for deterministic parallel programs.– Map-reduce– Scatter-gather– Software pipelining– Delayed broadcast or point-to-point
communication.
top related