trace-drivensimulation of multithreaded applications in...

Trace-driven simulation of multithreadedapplications in gem5

ARM gem5 Research Summit 2017 Workshop

Gilles SASSATELLI [email protected]

Alejandro NOCUA, Florent BRUGUIER, Anastasiia BUTKOCambridge, UK – September 11, 2017

2

◼ Mont-Blanc 1, 2 & 3 projects (FP7, H2020)● Getting ARM technology ready for HPC: HW, SW & Apps● Advances in energy efficiency towards Exascale

◼ Initial effort: using gem5 for performance prediction (2011)● STE Nova A9500 SoC (dual-core Cortex A9)● Fed publicly available parameters into a gem5 FS model● 1.5% - 18% error, due to rough DRAM model and interconnect

Background motivations

Experimental Setup

23

` Simulator - gem5: ` event-driven, full system (FS) ` multiple CPUs models: simple, in-order, out-of-order, etc. ` different memory systems: bus-based, Ruby NoC, cache, etc.

` Real Hardware - Snowball Sky board ` STE Nova A9500 SoC (ARM dual-core)

` Benchmark:

` SPLASH-2, ALPBench, Stream ` Pthread programming

Comparison & Analysis

24

` Results ` Execution time: between 1.4% and 18%

` Sources of error ` Specification ` Memory & Interconnect

Butko, A., Garibotti, R, Ost, L., and Sassatelli, G., Accuracy evaluation of GEM5 simulator system, In 2012 7th

Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), July 2012

*

BUTKO A et Al, « Accuracy Evaluation of GEM5 Simulator System », IEEE ReCoSoC 2012: Reconfigurable Communication Centric SoCs,ISBN 978-1-4673-2570-7, 9-11 juillet 2012, York, pp. 1-7.

Parameters

21

` Specification information: ` CPU – number of cores, core types, etc. ` Memory – cache, coherency, main memory, etc. ` Interconnect – bus parameters,…

` Component implementation:

` Execution behavior ` Communication protocols

Core0

Core1

I D I D

L2

Memory

Parameters

21

` Specification information: ` CPU – number of cores, core types, etc. ` Memory – cache, coherency, main memory, etc. ` Interconnect – bus parameters,…

` Component implementation:

` Execution behavior ` Communication protocols

Core0

Core1

I D I D

L2

Memory

3

◼ Calibration against real hardware● Using gem5 for performance prediction● And McPAT for power estimation

◼Then onto exploring fancy architecture configurations● Heterogeneous single-ISA multicores à la big.LITTLE● Assymetric, 3 levels of heterogeneity etc.

Background motivations cont’d

CPU CPU

CPU CPU

L2 L2

RAM

CPU CPU

CPU CPU

L2 L2

RAM

Assymetric Big.LITTLE Big.medium.LITTLE

Accuracy assessments

57

` Exynos 5 Octa SoC model in gem5/McPAT

` 8 heterogeneous cores

` Independent clock per cluster

` 2 L2 caches, Crossbar Bus

` Results

` Execution time error ~20%

` Power error ~12%

` EtoS ~20%

A7 A7

A7 A7

L2

A15

L2

Interconnect

Memory

A15

A15 A15

‘big’ cluster

‘LITTLE’ cluster

Butko, A., Gamatié, A., Sassatelli G., Torres, L., and Robert, M., Design Exploration For Next Generation High-Performance Manycore On-

chip Systems: Application To big.LITTLE Architectures, In 2015 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), July 2015

Accuracy assessments

57

` Exynos 5 Octa SoC model in gem5/McPAT

` 8 heterogeneous cores

` Independent clock per cluster

` 2 L2 caches, Crossbar Bus

` Results

` Execution time error ~20%

` Power error ~12%

` EtoS ~20%

A7 A7

A7 A7

L2

A15

L2

Interconnect

Memory

A15

A15 A15

‘big’ cluster

‘LITTLE’ cluster

Butko, A., Gamatié, A., Sassatelli G., Torres, L., and Robert, M., Design Exploration For Next Generation High-Performance Manycore On-

chip Systems: Application To big.LITTLE Architectures, In 2015 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), July 2015

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

!"#$%&'()*&'+#*,-#$()./0

1213

1214

1215

1216

127

!)#89:;&(;-(<%&'()*,=(%<#/0

121> 121?> 127 1273> 127> 127?> 123 1233>

6 7>

6 !

6 ?

? !"7 7>

#$/#<')#*%4 ?"4 7>&

-'(

)'9*+,--+!")'9*'#.'%+

)'9*'#.'%+*+,--+!

!)#89:*.#<$:*(8(.%$&

!)#89:*.#<$:*(8(.%$&

(a) bfs

!"#$%&'()*&'+#*,-#$()./0

12!

"#$

"#%

"#&

"#'

"#(

"#)

*+,-./01203245162+789254,:;

"#& "#( #! #$

(" %

(")

("'

&"'# ")# " %

$%:,46+,7&$"'#$" %'

3()

*6.+,-..,*#*6.+(,K65L

*6.+(,K65L+,-..,*

*+,-./7/,4%/7)-2K5N1

*+,-./7/,4%/7)-2K5N1

(b) backprop

!"#$%&'()*&'+#*,-#$()./0

!

"

#

$%

$&

$!

'()*+,-./-0/12.3/(456/21)78

$! $" $# &% && &! &"

#9$:

#9;

#9

&9 !:9;!$9$:

"#7)13()4$!9 !!9$:%

0&'

(3+)*+,,*'!(3+)&)I32J

(3+)&)I32J)*+,,*'

'()*+,4-)1#,4'*/I2L.

'()*+,4-)1#,4'*/I2L.

(c) heartwall

!"#$%&'()*&'+#*,-#$()./0

1213

121!"

#$%

#$%&"

#$%"

#$%!"

'()*+,-./-0/12.3/(456/21)78

#$%& #$%9 #$%: #$%; #$&

; %"

; !

; ! "#7)13()4$9 !%9 %"&

0'(

)3+*+,--+'%)3+*')J32K

)3+*')J32K*+,--+'

'()*+,4.)1#,4(*/J2M.

(d) hotspot

!"#$%&'()*&'+#*,-#$()./0

!"#

$

$"#

%

%"#

&

'()*+,-./-0/12.3/(456/21)78

%"# & &"# # #"# 9 9"# :

; !#

; !

; :

! :"9 !"! !#

#$7)13()4%& :"& !#&

0'(

)3+*+,--+'")3+*')J32K

)3+*')J32K*+,--+'

'()*+,4.)1$,4(*/J2M.

'()*+,4.)1$,4(*/J2M.

(e) kmeans

!"#$%&'()*&'+#*,-#$()./0

!

!

"

#

$

%

&

'()*+,-./-0/12.3/(456/21)78

# $ % & 9 : ; ;;

9!;$

9!:

9!&

&!:";!;$

#$7)13()4%#!&"#!;$&

0'(

)3+*+,--+'")3+*')J32K

)3+*')J32K*+,--+'

'()*+,4.)1$,4(*/J2M.

'()*+,4.)1$,4(*/J2M.

(f) lud

!"#$%&'()*&'+#*,-#$()./0

12!"

#$!%"

#$&

#$&&"

#$&"

#$&%"

#$'

()*+,-./0.1023/40)567032*89

#$& #$&" #$' #$'" #$:

; !"

; !

; %

! %"# !"! !"

$%8*24)*5&: %": !"'

1()

*4,+,-..,("*4,+(*K43L

*4,+(*K43L+,-..,(

()*+,-5/*2%-5)+0K3N/

()*+,-5/*2%-5)+0K3N/

(g) nw

!"#$%"%&

'()*+,-./0,-1

)023)*./456

%

%7&

8

87&

9

97&

:

'/);<=>,.>3.?+,-./02 .+?)56

9 : & ! !

""%&

""#

""!

!"!$%"%&

#$5)?-/)0%:"!$:"%&&

3'(

)-<*+,--+'$)-<*')4-+1

)-<*')4-+1*+,--+'

'/);<=0.)?$=0(;.4+*,

'/);<=0.)?$=0(;.4+*,

(h) nn

!"#$%&'()*&'+#*,-#$()./0

123

12!

"#$

"#%

"#&

'

()*+,-./0.1023/40)567032*89

"#: "#; "#! "#$ "#% "#& '

$ $!' ';

% ';

% &

% $

"#8*24)*5$: $!: ';%

1&'

(4,)*+,,*(!(4,)&*I43J

(4,)&*I43J)*+,,*(

()*+,-5-*2#-5'+0I3L/

()*+,-5-*2#-5'+0I3L/

(i) srad v1 (j) CPI and Execution time comparison

Fig. 4: Performance and energy trade-offs.

CPU CPU

CPU CPU

L2 L2

RAM

CPU CPU

CPU CPU

L2 L2

RAM

Assymetric Big.LITTLE Big.medium.LITTLE

4

◼ Not ready for manycores, too slow!● 1K-1M (simulated) IPS ● Scales bad with system size● Already much better than RTL though

◼ Trading speed for accuracy ? Any sweet spot?


Towards energy-efficiency

6

` Mobile/Embedded technology ` power efficiency ` low cost ` performance?

` Samsung Exynos 5 Octa 9 GFLOPS

` Supercomputer node ` Intel Xeon E5 300 GFLOPS

` Idea ` Multicore Manycore ` Design Space Exploration - Simulation

State-of-the-art

10

Acc

urac

y

Speed

RTL Cycle-accurate

Cycle-approximate

JIT

Distributed Trace-driven

High

Low

Low High

Ideal

5

◼ Not ready for manycores, too slow!● 1K-1M (simulated) IPS● Scales bad with system size● Already much better than RTL though

◼ Trading speed for accuracy ? Any sweet spot?


Towards energy-efficiency

6

` Mobile/Embedded technology ` power efficiency ` low cost ` performance?

` Samsung Exynos 5 Octa 9 GFLOPS

` Supercomputer node ` Intel Xeon E5 300 GFLOPS

` Idea ` Multicore Manycore ` Design Space Exploration - Simulation

State-of-the-art

10

Acc

urac

y

Speed

RTL Cycle-accurate

Cycle-approximate

JIT

Distributed Trace-driven

High

Low

Low High

Ideal

State-of-the-art

13

Acc

urac

y

Speed

RTL Cycle-accurate

JIT

Distributed

Ideal High

Low

Low High

Trace-driven

Cycle-approximate

10s MIPS

1 core FS simulation

6

Simulation speed

18

` Event-driven simulation (Cycle-approximate)

` Abstracting events into traces

Memory

Cache

CPU

Time

Time 𝑇𝑟𝑎𝑐𝑒 𝑇𝑟𝑎𝑐𝑒 𝑇𝑟𝑎𝑐𝑒 𝑇𝑟𝑎𝑐𝑒

Memory

Cache

CPU

◼ ~ 70% simulation effort goes into CPU● Abstracting away CPU cores sounds like a good idea● Between 2 consecutive L1 cache misses (in-order) CPU cores perform « consistently »

Trace-driven simulation principle

Simulation speed

17


Memory

Cache

CPU

Time

? ?Tc1 Tc1 Tc2FS s

imul

atio

nTr

ace-

driv

en

7

◼ SimMATE: 2-stage process● Trace collection: tracing only L1 miss related transactions● Trace replay: Using trace injectors that initiate transactions as previously recorded

SimMATE

Trace-Driven Approach

27

` Main concept ` Trace collection, trace simulation

Core 0 Core n

$ $ $ $

Interconnect

Memory

Hits & Misses

Misses

Memory traces

1 Trace Collection

TI 0 TI n

$ $ $ $

Interconnect

2 Trace Simulation

Misses

Misses

L2 Memory L2 Memory Memory

... ...

Time 𝑇𝑟𝑎𝑐𝑒 𝑇𝑟𝑎𝑐𝑒 𝑇𝑟𝑎𝑐𝑒 𝑇𝑟𝑎𝑐𝑒

Memory

Cache

CPU

Time

8

◼ Trace collection = freezing● CPU parameters alongside application SW● Private caches sizes, speed etc.

◼ Trace replay allows exploring the rest● L2 size, policy etc.● Interconnect type & speed● Main memory speed

SimMATE for faster DSE


29

` Flexibility

Core 0 Core n

$ $ $ $

Interconnect

Memory

Hits & Misses

Misses

Memory traces

1 Trace Collection

TI 0 TI n

$ $ $ $

Interconnect

2 Trace Simulation

Misses

Misses

1 time Multiple times

Memory L2

... ...

9

◼ Trace replay performs event (re-) scheduling● Simple « time shifting » approach● Maintaining constant compute phases

SimMATE for faster DSE cont’dTrace-Driven Approach

30

` Trace Injector ` Events scheduling

Memory & Interconnect

$

TI

Time 𝑇𝑅1

communication computation

𝑇𝑅2

TI 0 TI n

$ $ $ $

Interconnect

Memory L2

...


31



$

TI

Time 𝑇𝑅1

communication

TI 0 TI n

$ $ $ $

Interconnect

Memory L2

...


31



$

TI

Time 𝑇𝑅1

communication

TI 0 TI n

$ $ $ $

Interconnect

Memory L2

...


32



$

TI

Time 𝑇𝑅1

communication

𝑇𝐴1′ 𝑇𝐴1

delay

TI 0 TI n

$ $ $ $

Interconnect

Memory L2

...


33



$

TI

Time 𝑇𝑅1

communication

𝑇𝐴1′ 𝑇𝐴1

delay delay

TI 0 TI n

$ $ $ $

Interconnect

Memory L2

...

𝑇𝑅2′

10

◼ Tuning DRAM latency● Collection performed with 30ns ● TD simulation from 5ns to 55ns● FS used as reference

SimMATE BenchmarkingValidation

34

` External memory latency Trace Collection

30 ns

5 ns

15 ns

45 ns

55 ns

5 ns

15 ns

45 ns

55 ns

TD Simulation FS Simulation %

%

%

%

TI 0 TI n

$ $ $ $

Interconnect

Memory

...

Validation

35

` External memory latency ` Collection - 30ns ` Simulation - 5ns, 15ns, 30ns, 45ns, 55ns

0,6% 0,5% 0,1% 0,4% 0,4%

2,8%

4,5% 5,8%

5,1% 6%

5ns 15ns 30ns 45ns 55ns

Execution time error [%] MJPEG FFT

Original

11

◼ Tuning L2 size● Collection performed without L2● TD simulation from 0 to 16MB L2● Errors originate from Cold-start bias / cache warmup

SimMATE BenchmarkingValidation

36

` L2 Cache ` Short duration application ` Cold-start bias ` Cache warm-up

TI 0 TI n

$ $ $ $

Interconnect

Memory L2

...

15%

7,5%

18% 15%

0,1%

8%

1%

14%

8% 8%

0,3%

10% 7%

Original 256kB 1MB 8MB 16MB

Execution time error [%] ET ET+init ET+OS boot

L2 cache size

Traces:

Validation

36

` L2 Cache ` Short duration application ` Cold-start bias ` Cache warm-up

TI 0 TI n

$ $ $ $

Interconnect

Memory L2

...

15%

7,5%

18% 15%

0,1%

8%

1%

14%

8% 8%

0,3%

10% 7%

Original 256kB 1MB 8MB 16MB

Execution time error [%] ET ET+init ET+OS boot

L2 cache size

Traces:

12

◼ Having these traces collected makes it easy to:● Perform « Trace replication » i.e. emulate more CPU cores for scalability study● This corresponds to weak scaling experiments, i.e. per-core workload remains same

Multithreaded applications

Manycore simulation

37

` Trace Replication

Time

Core 0/ Thread 0

Core1/ Thread 1

Core 0 Core n

$ $ $ $

Interconnect

Memory

Number of Cache Misses

Trace Collection Trace Simulation

... TI 0 TI 1

$ $ $ $

Interconnect

Memory

TI n TI m

$ $ $ $

... T T T T T T

13

◼ Yet synchronizations must be accounted for!● Using whatever API: POSIX threads, OpenMP 3.0 …● Approach: embed synchronizations into traces● Have an arbiter that takes care of locking (when barrier reached) and unlocking TIs

Multithreaded applications

Trace Arbiter

Manycore simulation

39

` Trace Synchronization

$ $ $ $

Interconnect

Memory

Thread 1 Thread 2 OS

$ $ $ $

Interconnect

Memory

Trace 1 Trace 2


Synchronization Traces Barrier Join Memory Traces

14

◼ And most ARM AP are OoO (Out-of-Order)● Meaning multiple outstanding memory transactions● The assumption of constant time btw. 2 misses does not hold

◼ big-LITTLE & other heterogeneous friends everywhere● And there microarchitecture details cannot be overlooked

Limitation: in-order only!Simulation speed

17


Memory

Cache

CPU

Time Tc1 Tc1 Tc2

15

◼ Modeling micro-architecture timing & dependencies● Tracing with O3 model + probes, without L2 cache● Replay done in a smart « elastic »fashion

Elastic Traces: Trace-driven simulation for OoO

http://gem5.org/TraceCPU

16

◼Smart TraceCPU● Updating a dependency graph pushing ready instructions into a queue for issue

Elastic Traces: Trace-driven simulation for OoO

http://gem5.org/TraceCPU

17

(a) Trace replay is based on 3 collected traces for N cores are used to replay M cores

(b) Replay overview including modified TraceCPU and arbiter

Figure 4: Trace replay approach

collection. Three types of events are considered during parallelsections: instruction executed, dependencies (load/store), andsynchronization events. Instruction and data dependency tracesare captured thanks to an augmented TraceCPU model. Thefollowing information are captured into the synchronizationtrace for each CPU and each event:

• Tick: the tick count of a CPU at the entrance in theparallel region.

• Program Counter: the program counter at the beginningof a parallel region ; it will be used during the replayphase for identifying parallel sections.

• Thread ID: the thread ID assigned by the scheduler.• Event Type: an enumerate type that encodes events

corresponding to parallel for, critical and barrier.• Number of instructions: the number of instructions

executed by a CPU during the recorded event.• Number of data accesses: the number of data accesses

performed during the recorded event.It has to be noted that for each thread under analysis the

Tick and PC information will be the same since all threadsare created at the same time. It means that the informationon the synchronization traces is the same. In the case of thedependency trace, the load and store information are onlycollected between the CPU and the L1 caches. At the endof the trace collection phase, three Google Protobuf files areobtained per simulated CPU core with the required data forthe replay phase:

• Instruction Executed Trace File.• Dependency Trace File (LOAD/STORE).• Synchronization Event Trace File.

4) Traces replay: As illustrated in Figure 4(a), collectedtraces can be replayed in target architecture configurationsin different ways. Letting N be the number of cores used intrace collection and M in trace replay, two main purposes areconsidered as follows:

• Parameters exploration: we perform an architecturalexploration in which we replay the exact number ofsimulated cores, i.e., N = M. The objective is to analyzethe influence of a number of architectural parameterssuch as cache sizes, interconnect bandwidth or memorylatency.

• Replication: we perform a scalability analysis, by tar-geting a higher core count compared to that of the initialsystem from which given traces are captured, i.e., M ^

N. This is achieved by simulating more trace injectors.Note that the replication mechanism allows us to performweak-scaling analysis as the problem size is increasedby the ration of M

N . In addition, current implementationis performed with no address offsetting mechanism. Thismeans that most of the resources are shared among cores.

5) Implementation: Figure 4(b) shows the interplay of theprincipal objects involved during the replay phase in Elastic-SimMATE. A number of TraceCPU objects operate and checkif LOAD/STORE dependencies are met on the basis of thetraces they access, as per the Elastic Traces base model. Thesefurther read out the synchronization trace and keep track of theparallel regions. The actual behaviour when entering a parallelregion is as follows:

• Init: whenever one such region is detected on aTraceCPU, a notification is sent to the arbiter so as to

◼ SimMATE + Elastic Traces = ElasticSimMATE● Enabling both OoO + multithreaded applications● Key: embed synchronization information @ tracing time.

ElasticSimMATE (ESM)

Elastic Trace

Barriers etc.

Global Stop/go

18

◼ Proper tracing of synchronizations● API-dependant: OpenMP 3.x● Tracing whenever entering or leaving parallel region, barrier etc.

ElasticSimMATE (ESM)

#pragma omp parallel

for(i=0;i<n;i++) {

/* do_some work */

}

for(i=0;i<n;i++) {

OMP_runtime_call()

/* do_some work */

}

1.bin

OMP runtime + scheduler SWHW

T T T T

T T T T

Tick PC th_id type IC DC389178 177216 0 1 1081157 165738…

19

◼ ESM flow wrapup● Using BSC Mercurium compiler / Nanos++ runtime● Tweaked runtime such that custom m5 pseudo instructions produce trace records

ESW wrapup

#pragma ompparallelfor(i=0;i<n;i++) {/* do_something*/}

Instruction

Dependency

Synchronization Trac

e R

epla

y

Unmodified sources

gem5 Trace Collection Generated Traces gem5

Trace Replay

Trac

e R

epla

yTa

rget

_Arc

h1

#pragma ompparallel{for(i=0;i<n;i++){

nanos_create_team();/* do_something */

nanos_end_team();}

}

Ref

_Arc

hite

ctur

e

App Binaries

Thread creation & suchm5_trace(TYPE,th_id)

Nanos++ runtime

20

◼ Two main use cases:● Fast parameter exploration● Scalability study: « trace replication »

◼ Speedup & accuracy?● Experiments on Rodinia application kernels

Benchmarking

1 core HOTSPOT

0

0,5

1

1,5

2

2,5

3

0MB 16kB 128kB 512kB 1MB 2MB

ExecutionTime[s]

L2CacheSize

FS ET ESM

-4

-2

0

2

4

6

8

10

12

14


ErrorP

ercentage[%

]

L2CacheSize

FSvsET FSvsESM

21



Benchmarking

1 core KMEANS

0

20

40

60

80

100

120

140

160


ExecutionTime[m

s]

L2CacheSize

FS ET ESM

0

2

4

6

8

10

12

14

16

18


ErrorP

ercentage[%

]

L2CacheSize

FSvsET FSvsESM

22



Benchmarking

1 core CANNEAL

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

5


ExecutionTime[s]

L2CacheSize

FS ET ESM

0

5

10

15

20

25


ErrorP

ercentage[%

]

L2CacheSize

FSvsET FSvsESM

23



Benchmarking

L2 KMEANS

0

10

20

30

40

50

60

70

16kB 128kB 512kB 1MB 2MB

L2Cache

Miss

Rate[%

]

L2CacheSize

FS ESM

0

2E+09

4E+09

6E+09

8E+09

1E+10

1,2E+10

1,4E+10

1,6E+10

16kB 128kB 512kB 1MB 2MB

L2Cache

OverallMiss

Latency[Cycles]

L2CacheSize

FS ESM

24



Benchmarking

128 cores KMEANS

0

200

400

600

800

1000

1 2 4 8 16 32 64 128

ExecutionTime[s]

NumberofCores

Blackscholes Hotspot

Manycore simulation

37

` Trace Replication

Time

Core 0/ Thread 0

Core1/ Thread 1

Core 0 Core n

$ $ $ $

Interconnect

Memory



... TI 0 TI 1

$ $ $ $

Interconnect

Memory

TI n TI m

$ $ $ $

... T T T T T T

0200400600800

10001200

1 2 4 8 16 32 64 128

SimulationTime[m

in]

NumberofCores

Blackscholes Hotspot

25

◼ Scalability analysis has limits● Requires additional features s.a. address offsetting● Weak scaling only (replicated per-core workloads)

◼ Programming models moving from loops to tasks● OpenMP 4.0, OmpSs● Still pragma-based● More parallelims available at run-time … more opportunities for smart job scheduling

Perspectives Manycore simulation

37

` Trace Replication

Time

Core 0/ Thread 0

Core1/ Thread 1

Core 0 Core n

$ $ $ $

Interconnect

Memory



... TI 0 TI 1

$ $ $ $

Interconnect

Memory

TI n TI m

$ $ $ $

... T T T T T T

L2

BSC OmpSs: Cholesky decomposition HW

Ready Task queue

Scheduler

26

◼ Unbinding traces from cores● One trace per Task, not per core!● Assign traces to cores by emulating runtime behaviour in trace replay● This is real strong scaling

Perspectives

Scheduler

Task 1 trace

Task 2 trace

Task N trace…

T2T1

T3

…

+Task

dependencies

Manycore simulation

37

` Trace Replication

Time

Core 0/ Thread 0

Core1/ Thread 1

Core 0 Core n

$ $ $ $

Interconnect

Memory



... TI 0 TI 1

$ $ $ $

Interconnect

Memory

TI n TI m

$ $ $ $

... T T T T T T

Scheduler /runtime(emulated)

Task 1 trace

Task 2 trace

T0

Ready Task queue

Ready Task queue

Trace collection Trace simulation

27

◼ Current ESM prototype● ~5x - 10x speedup for low core count, probably more for tens / thousands● Nice solution for fast DSE● Remaining accuracy issues for some applications╶ Common to ESM & Elastic Traces╶ Under investigation with ARM

◼ Use cases● Exploration of memory subsystem● Some microarchitecture parameters (Elastic Traces)● …

◼ Future directions● Ruby compatibility● Could be combined with other initiatives (dist-gem5)● Can be extended to other PM / APIs (Tasking, MPI…)

Conclusion

05E+11 1E+12

1,5E+12 2E+12

2,5E+12 3E+12

16kB 128kB 512kB 1MB 2MBL2CacheM

issLa

tency[Cycles]

L2CacheSize

FS– 1Core ESM– 1Core

Canneal

http://montblanc-project.eu

trace-drivensimulation of multithreaded applications in...

Documents